## Hot questions for Using Neural networks in openai gym

Question:

I have a simple pytorch neural net that I copied from openai, and I modified it to some extent (mostly the input).

When I run my code, the output of the network remains the same on every episode, as if no training occurs.

I want to see if any training happens, or if some other reason causes the results to be the same.

How can I make sure any movement happens to the weights?

Thanks

Answer:

Depends on what you are doing, but the easiest would be to check the weights of your model.

You can do this (and compare with the ones from previous iteration) using the following code:

for parameter in model.parameters(): print(parameter.data)

If the weights are changing, the neural network is being optimized (which doesn't necessarily mean it learns anything useful in particular).

Question:

I am creating a deep neural network using Keras using images from the Gym library from Open AI.

I tried to reshape the images using the following code:

def reshape_dimensions(observation): processed = np.mean(observation,2,keepdims = False) cropped = processed[35:195] result = cropped[::2,::2] return result

This gives me an image of shape (80,80) but every time I try to input that shape in the first layer of the Keras network it doesn't work.

What should be the shape I should use so I can further develop the network?

Attached the whole code:

**PART I** retrieves the training data

import gym import random import numpy as np from statistics import mean, median from collections import Counter ### GAME VARIABLE SETTINGS ### env = gym.make('MsPacman-v0') env.reset() goal_steps = 2000 score_requirement = 250 initial_games = 200 print('Options to play: ',env.unwrapped.get_action_meanings()) ### DEFINE FUNCTIONS #### def reshape_dimensions(observation): processed = np.mean(observation,2,keepdims = False) cropped = processed[35:195] result = cropped[::2,::2] return result def initial_population(): training_data = [] scores = [] accepted_scores = [] for _ in range(initial_games): score = 0 game_memory = [] prev_obvservation = [] for _ in range(goal_steps): #env.render() action = env.action_space.sample() #Take random action in the env observation, reward, done, info = env.step(action) reshape_observation = reshape_dimensions(observation) if len(prev_obvservation) > 0: game_memory.append([prev_obvservation, action]) prev_obvservation = reshape_observation score = score + reward if done: break if score >= score_requirement: accepted_scores.append(score) for data in game_memory: if data[1] == 0: output = [1,0,0,0,0,0,0,0,0] elif data[1] == 1: output = [0,1,0,0,0,0,0,0,0] elif data[1] == 2: output = [0,0,1,0,0,0,0,0,0] elif data[1] == 3: output = [0,0,0,1,0,0,0,0,0] elif data[1] == 4: output = [0,0,0,0,1,0,0,0,0] elif data[1] == 5: output = [0,0,0,0,0,1,0,0,0] elif data[1] == 6: output = [0,0,0,0,0,0,1,0,0] elif data[1] == 7: output = [0,0,0,0,0,0,0,1,0] elif data[1] == 8: output = [0,0,0,0,0,0,0,0,1] training_data.append([data[0],output]) env.reset() scores.append(score) print('Average accepted scores:', mean(accepted_scores)) print('Median accepted scores:', median(accepted_scores)) print(Counter(accepted_scores)) return training_data ### RUN CODE ### training_data = initial_population() np.save('data_for_training_200.npy', training_data)

**PART II** trains the model

import gym import random import numpy as np import keras from statistics import mean, median from collections import Counter from keras.models import Sequential from keras.layers import Dense from keras.callbacks import EarlyStopping from keras.optimizers import Adam ### LOAD DATA ### raw_training_data = np.load("data_for_training_200.npy") training_data = [i[0:2] for i in raw_training_data] print(np.shape(training_data)) ### DEFINE FUNCTIONS ### def neural_network_model(): network = Sequential() network.add(Dense(100, activation = 'relu', input_shape = (80,80))) network.add(Dense(9,activation = 'softmax')) optimizer = Adam(lr = 0.001) network.compile(optimizer = optimizer, loss = 'categorical_crossentropy', metrics=['accuracy']) return network def train_model(training_data): X = [i[0] for i in training_data] y = [i[1] for i in training_data] #X = np.array([i[0] for i in training_data]) #y = np.array([i[1] for i in training_data]) print('shape of X: ', np.shape(X)) print('shape of y: ', np.shape(y)) early_stopping_monitor = EarlyStopping(patience = 3) model = neural_network_model() model.fit(X, y, epochs = 20, callbacks = [early_stopping_monitor]) return model train_model(training_data = training_data)

Answer:

It seems like you are pre-processing individual images correctly but putting them inside a list instead of an input tensor. From the error message you have a list of 36859 (80,80) arrays while you would like to have a single array of shape (36859, 80, 80). You have the code that does this commented out `X = np.array([i[0] for i in training_data])`

, you have to ensure that every `i[0]`

is of same shape (80,80) for this to work.

Question:

I am currently getting into tensorflow and have just now started to grasp the graph like concept of it. Now I tried to implement a NN using gradient descent(Adam optimizer) to solve the cartpole environment. I start by randomly intializing my weights and then take random actions(accounting for existing weights) during training. When testing I always take the action with maximum probability. However I always get a score that hovers around 10 and variance is around 0.8. Always. it doesn't change in a notable fashion at all making it look that it always takes purely random actions at every step, not learning anything at all. As I said it seems that the weights are never updated correctly. Where and how do I need to do that?

Here's my code:

import tensorflow as tf import numpy as np from gym.envs.classic_control import CartPoleEnv env = CartPoleEnv() learning_rate = 10**(-3) gamma = 0.9999 n_train_trials = 10**3 n_test_trials = 10**2 n_actions = env.action_space.n n_obs = env.observation_space.high.__len__() goal_steps = 200 should_render = False print_per_episode = 100 state_holder = tf.placeholder(dtype=tf.float32, shape=(None, n_obs), name='symbolic_state') actions_one_hot_holder = tf.placeholder(dtype=tf.float32, shape=(None, n_actions), name='symbolic_actions_one_hot_holder') discounted_rewards_holder = tf.placeholder(dtype=tf.float32, shape=None, name='symbolic_reward') # initialize neurons list dynamically def get_neurons_list(): i = n_obs n_neurons_list = [i] while i < (n_obs * n_actions) // (n_actions // 2): i *= 2 n_neurons_list.append(i) while i // 2 > n_actions: i = i // 2 n_neurons_list.append(i) n_neurons_list.append(n_actions) # print(n_neurons_list) return n_neurons_list with tf.name_scope('nonlinear_policy'): # create list of layers with sizes n_neurons_list = get_neurons_list() network = None for i in range((len(n_neurons_list) - 1)): theta = tf.Variable(tf.random_normal([n_neurons_list[i], n_neurons_list[i+1]])) bias = tf.Variable(tf.random_normal([n_neurons_list[i+1]])) if network is None: network = tf.matmul(state_holder, theta) + bias else: network = tf.matmul(network, theta) + bias if i < len(n_neurons_list) - 1: network = tf.nn.relu(network) action_probabilities = tf.nn.softmax(network) testing_action_choice = tf.argmax(action_probabilities, dimension=1, name='testing_action_choice') with tf.name_scope('loss'): actually_chosen_probability = action_probabilities * actions_one_hot_holder L_theta = -1 * (tf.reduce_sum(tf.log(actually_chosen_probability)) * tf.reduce_sum(discounted_rewards_holder)) with tf.name_scope('train'): # We define the optimizer to use the ADAM optimizer, and ask it to minimize our loss gd_opt = tf.train.AdamOptimizer(learning_rate).minimize(L_theta) sess = tf.Session() # FOR NOW everything is symbolic, this object has to be called to compute each value of Q # Start sess.run(tf.global_variables_initializer()) observation = env.reset() batch_rewards = [] states = [] action_one_hots = [] episode_rewards = [] episode_rewards_list = [] episode_steps_list = [] step = 0 episode_no = 0 while episode_no <= n_train_trials: if should_render: env.render() step += 1 action_probability_values = sess.run(action_probabilities, feed_dict={state_holder: [observation]}) # Choose the action using the action probabilities output by the policy implemented in tensorflow. action = np.random.choice(np.arange(n_actions), p=action_probability_values.ravel()) # Calculating the one-hot action array for use by tensorflow action_arr = np.zeros(n_actions) action_arr[action] = 1. action_one_hots.append(action_arr) # Record states states.append(observation) observation, reward, done, info = env.step(action) # We don't want to go above 200 steps if step >= goal_steps: done = True batch_rewards.append(reward) episode_rewards.append(reward) # If the episode is done, and it contained at least one step, do the gradient updates if len(batch_rewards) > 0 and done: # First calculate the discounted rewards for each step batch_reward_length = len(batch_rewards) discounted_batch_rewards = batch_rewards.copy() for i in range(batch_reward_length): discounted_batch_rewards[i] *= (gamma ** (batch_reward_length - i - 1)) # Next run the gradient descent step # Note that each of action_one_hots, states, discounted_batch_rewards has the first dimension as the length # of the current trajectory gradients = sess.run(gd_opt, feed_dict={actions_one_hot_holder: action_one_hots, state_holder: states, discounted_rewards_holder: discounted_batch_rewards}) action_one_hots = [] states = [] batch_rewards = [] if done: # Done with episode. Reset stuff. episode_no += 1 episode_rewards_list.append(np.sum(episode_rewards)) episode_steps_list.append(step) episode_rewards = [] step = 0 observation = env.reset() if episode_no % print_per_episode == 0: print("Episode {}: Average steps in last {} episodes".format(episode_no, print_per_episode), np.mean(episode_steps_list[(episode_no - print_per_episode):episode_no]), '+-', np.std(episode_steps_list[(episode_no - print_per_episode):episode_no]) ) observation = env.reset() episode_rewards_list = [] episode_rewards = [] episode_steps_list = [] step = 0 episode_no = 0 print("Testing") while episode_no <= n_test_trials: env.render() step += 1 # For testing, we choose the action using an argmax. test_action, = sess.run([testing_action_choice], feed_dict={state_holder: [observation]}) observation, reward, done, info = env.step(test_action[0]) if step >= 200: done = True episode_rewards.append(reward) if done: episode_no += 1 episode_rewards_list.append(np.sum(episode_rewards)) episode_steps_list.append(step) episode_rewards = [] step = 0 observation = env.reset() if episode_no % print_per_episode == 0: print("Episode {}: Average steps in last {} episodes".format(episode_no, print_per_episode), np.mean(episode_steps_list[(episode_no - print_per_episode):episode_no]), '+-', np.std(episode_steps_list[(episode_no - print_per_episode):episode_no]) )

Answer:

Here is an example tensorflow program that uses Q Learning to learn the **CartPole** Open Gym.

It is able to quickly learn to stay upright for 80 steps.

##### Here is the code :

import math import numpy as np import sys import random sys.path.append("../gym") from gym.envs.classic_control import CartPoleEnv env = CartPoleEnv()

discount = 0.5 learning_rate = 0.5 gradient = .001 regularizaiton_factor = .1 import tensorflow as tf tf_state = tf.placeholder( dtype=tf.float32 , shape=[4] ) tf_state_2d = tf.reshape( tf_state , [1,4] ) tf_action = tf.placeholder( dtype=tf.int32 ) tf_action_1hot = tf.reshape( tf.one_hot( tf_action , 2 ) , [1,2] ) tf_delta_reward = tf.placeholder( dtype=tf.float32 ) tf_value = tf.placeholder( dtype=tf.float32 ) tf_matrix1 = tf.Variable( tf.random_uniform([4,7], -.001, .001) ) tf_matrix2 = tf.Variable( tf.random_uniform([7,2], -.001, .001) ) tf_logits = tf.matmul( tf_state_2d , tf_matrix1 ) tf_logits = tf.matmul( tf_logits , tf_matrix2 ) tf_loss = -1 * learning_rate * ( tf_delta_reward + discount * tf_value - tf_logits ) * tf_action_1hot tf_regularize = tf.reduce_mean( tf.square( tf_matrix1 )) + tf.reduce_mean( tf.square( tf_matrix2 )) tf_train = tf.train.GradientDescentOptimizer(gradient).minimize( tf_loss + tf_regularize * regularizaiton_factor ) sess = tf.Session() sess.run( tf.global_variables_initializer() ) def max_Q( state ) : actions = sess.run( tf_logits, feed_dict={ tf_state:state } ) actions = actions[0] value = actions.max() action = 0 if actions[0] == value else 1 return action , value avg_age = 0 for trial in range(1,101) : # initialize state previous_state = env.reset() # initialize action and the value of the expected reward action , value = max_Q(previous_state) previous_reward = 0 for age in range(1,301) : if trial % 100 == 0 : env.render() new_state, new_reward, done, info = env.step(action) new_state = new_state action, value = max_Q(new_state) # The cart-pole gym doesn't return a reward of Zero when done. if done : new_reward = 0 delta_reward = new_reward - previous_reward # learning phase sess.run(tf_train, feed_dict={ tf_state:previous_state, tf_action:action, tf_delta_reward:delta_reward, tf_value:value }) previous_state = new_state previous_reward = new_reward if done : break avg_age = avg_age * 0.95 + age * .05 if trial % 50 == 0 : print "Average age =",int(round(avg_age))," , trial",trial," , discount",discount," , learning_rate",learning_rate," , gradient",gradient elif trial % 10 == 0 : print int(round(avg_age)),

##### Here is the output:

6 18 23 30 Average age = 36 , trial 50 , discount 0.5 , learning_rate 0.5 , gradient 0.001 38 47 50 53 Average age = 55 , trial 100 , discount 0.5 , learning_rate 0.5 , gradient 0.001

##### Summary

I wasn't able to get Q learning with a simple neural net to be able to solve the **CartPole** problem, but have fun experimenting with different NN sizes and depths!

Hope you enjoy this code, cheers