Hot questions for Using Neural networks in openai gym

Question:

I have a simple pytorch neural net that I copied from openai, and I modified it to some extent (mostly the input).

When I run my code, the output of the network remains the same on every episode, as if no training occurs.

I want to see if any training happens, or if some other reason causes the results to be the same.

How can I make sure any movement happens to the weights?

Thanks


Answer:

Depends on what you are doing, but the easiest would be to check the weights of your model.

You can do this (and compare with the ones from previous iteration) using the following code:

for parameter in model.parameters():
    print(parameter.data)

If the weights are changing, the neural network is being optimized (which doesn't necessarily mean it learns anything useful in particular).

Question:

I am creating a deep neural network using Keras using images from the Gym library from Open AI.

I tried to reshape the images using the following code:

def reshape_dimensions(observation):
    processed = np.mean(observation,2,keepdims = False)
    cropped = processed[35:195]
    result = cropped[::2,::2]

    return result

This gives me an image of shape (80,80) but every time I try to input that shape in the first layer of the Keras network it doesn't work.

What should be the shape I should use so I can further develop the network?

Attached the whole code:

PART I retrieves the training data

import gym
import random
import numpy as np
from statistics import mean, median
from collections import Counter


### GAME VARIABLE SETTINGS ###
env = gym.make('MsPacman-v0')
env.reset()

goal_steps = 2000
score_requirement = 250
initial_games = 200

print('Options to play: ',env.unwrapped.get_action_meanings())


### DEFINE FUNCTIONS ####

def reshape_dimensions(observation):
    processed = np.mean(observation,2,keepdims = False)
    cropped = processed[35:195]
    result = cropped[::2,::2]

    return result

def initial_population():
    training_data = []
    scores = []
    accepted_scores = []

    for _ in range(initial_games):
        score = 0
        game_memory = []
        prev_obvservation = []
        for _ in range(goal_steps):
            #env.render()
            action = env.action_space.sample() #Take random action in the env
            observation, reward, done, info = env.step(action)

            reshape_observation = reshape_dimensions(observation)

            if len(prev_obvservation) > 0:
                game_memory.append([prev_obvservation, action])

            prev_obvservation = reshape_observation

            score = score + reward
            if done: 
                break

        if score >= score_requirement:
            accepted_scores.append(score)

            for data in game_memory: 
                if data[1] == 0:
                    output = [1,0,0,0,0,0,0,0,0]
                elif data[1] == 1:
                    output = [0,1,0,0,0,0,0,0,0]
                elif data[1] == 2:
                    output = [0,0,1,0,0,0,0,0,0]
                elif data[1] == 3:
                    output = [0,0,0,1,0,0,0,0,0]
                elif data[1] == 4:
                    output = [0,0,0,0,1,0,0,0,0]
                elif data[1] == 5:
                    output = [0,0,0,0,0,1,0,0,0]
                elif data[1] == 6:
                    output = [0,0,0,0,0,0,1,0,0]
                elif data[1] == 7:
                    output = [0,0,0,0,0,0,0,1,0]
                elif data[1] == 8:
                    output = [0,0,0,0,0,0,0,0,1]

                training_data.append([data[0],output])



        env.reset()
        scores.append(score)


    print('Average accepted scores:', mean(accepted_scores))
    print('Median accepted scores:', median(accepted_scores))
    print(Counter(accepted_scores))

    return training_data 



### RUN CODE ###

training_data = initial_population()
np.save('data_for_training_200.npy', training_data)

PART II trains the model

import gym
import random
import numpy as np
import keras
from statistics import mean, median
from collections import Counter
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from keras.optimizers import Adam


### LOAD DATA ###

raw_training_data = np.load("data_for_training_200.npy")
training_data = [i[0:2] for i in raw_training_data]

print(np.shape(training_data))

### DEFINE FUNCTIONS ###


def neural_network_model():

    network = Sequential()
    network.add(Dense(100, activation = 'relu', input_shape = (80,80)))
    network.add(Dense(9,activation = 'softmax'))

    optimizer = Adam(lr = 0.001)

    network.compile(optimizer = optimizer, loss = 'categorical_crossentropy', metrics=['accuracy'])

    return network

def train_model(training_data):

    X = [i[0] for i in training_data]
    y = [i[1] for i in training_data]

    #X = np.array([i[0] for i in training_data])
    #y = np.array([i[1] for i in training_data])

    print('shape of X: ', np.shape(X))
    print('shape of y: ', np.shape(y))

    early_stopping_monitor = EarlyStopping(patience = 3)

    model = neural_network_model()

    model.fit(X, y, epochs = 20, callbacks = [early_stopping_monitor])

    return model

train_model(training_data = training_data)

Answer:

It seems like you are pre-processing individual images correctly but putting them inside a list instead of an input tensor. From the error message you have a list of 36859 (80,80) arrays while you would like to have a single array of shape (36859, 80, 80). You have the code that does this commented out X = np.array([i[0] for i in training_data]), you have to ensure that every i[0] is of same shape (80,80) for this to work.

Question:

I am currently getting into tensorflow and have just now started to grasp the graph like concept of it. Now I tried to implement a NN using gradient descent(Adam optimizer) to solve the cartpole environment. I start by randomly intializing my weights and then take random actions(accounting for existing weights) during training. When testing I always take the action with maximum probability. However I always get a score that hovers around 10 and variance is around 0.8. Always. it doesn't change in a notable fashion at all making it look that it always takes purely random actions at every step, not learning anything at all. As I said it seems that the weights are never updated correctly. Where and how do I need to do that?

Here's my code:

import tensorflow as tf
import numpy as np
from gym.envs.classic_control import CartPoleEnv



env = CartPoleEnv()

learning_rate = 10**(-3)
gamma = 0.9999

n_train_trials = 10**3
n_test_trials = 10**2

n_actions = env.action_space.n
n_obs = env.observation_space.high.__len__()

goal_steps = 200

should_render = False

print_per_episode = 100

state_holder = tf.placeholder(dtype=tf.float32, shape=(None, n_obs), name='symbolic_state')
actions_one_hot_holder = tf.placeholder(dtype=tf.float32, shape=(None, n_actions),
                                        name='symbolic_actions_one_hot_holder')
discounted_rewards_holder = tf.placeholder(dtype=tf.float32, shape=None, name='symbolic_reward')

# initialize neurons list dynamically
def get_neurons_list():
    i = n_obs
    n_neurons_list = [i]

    while i < (n_obs * n_actions) // (n_actions // 2):
        i *= 2
        n_neurons_list.append(i)

    while i // 2 > n_actions:
        i = i // 2
        n_neurons_list.append(i)

    n_neurons_list.append(n_actions)

    # print(n_neurons_list)

    return n_neurons_list


with tf.name_scope('nonlinear_policy'):
    # create list of layers with sizes
    n_neurons_list = get_neurons_list()

    network = None

    for i in range((len(n_neurons_list) - 1)):
        theta = tf.Variable(tf.random_normal([n_neurons_list[i], n_neurons_list[i+1]]))
        bias = tf.Variable(tf.random_normal([n_neurons_list[i+1]]))

        if network is None:
            network = tf.matmul(state_holder, theta) + bias
        else:
            network = tf.matmul(network, theta) + bias

        if i < len(n_neurons_list) - 1:
            network = tf.nn.relu(network)

    action_probabilities = tf.nn.softmax(network)

    testing_action_choice = tf.argmax(action_probabilities, dimension=1, name='testing_action_choice')

with tf.name_scope('loss'):
    actually_chosen_probability = action_probabilities * actions_one_hot_holder

    L_theta = -1 * (tf.reduce_sum(tf.log(actually_chosen_probability)) * tf.reduce_sum(discounted_rewards_holder))


with tf.name_scope('train'):
    # We define the optimizer to use the ADAM optimizer, and ask it to minimize our loss
    gd_opt = tf.train.AdamOptimizer(learning_rate).minimize(L_theta)


sess = tf.Session()  # FOR NOW everything is symbolic, this object has to be called to compute each value of Q

# Start

sess.run(tf.global_variables_initializer())

observation = env.reset()
batch_rewards = []
states = []
action_one_hots = []

episode_rewards = []
episode_rewards_list = []
episode_steps_list = []

step = 0
episode_no = 0
while episode_no <= n_train_trials:
    if should_render: env.render()
    step += 1

    action_probability_values = sess.run(action_probabilities,
                                         feed_dict={state_holder: [observation]})
    # Choose the action using the action probabilities output by the policy implemented in tensorflow.
    action = np.random.choice(np.arange(n_actions), p=action_probability_values.ravel())

    # Calculating the one-hot action array for use by tensorflow
    action_arr = np.zeros(n_actions)
    action_arr[action] = 1.
    action_one_hots.append(action_arr)

    # Record states
    states.append(observation)

    observation, reward, done, info = env.step(action)
    # We don't want to go above 200 steps
    if step >= goal_steps:
        done = True

    batch_rewards.append(reward)
    episode_rewards.append(reward)

    # If the episode is done, and it contained at least one step, do the gradient updates
    if len(batch_rewards) > 0 and done:

        # First calculate the discounted rewards for each step
        batch_reward_length = len(batch_rewards)
        discounted_batch_rewards = batch_rewards.copy()
        for i in range(batch_reward_length):
            discounted_batch_rewards[i] *= (gamma ** (batch_reward_length - i - 1))

        # Next run the gradient descent step
        # Note that each of action_one_hots, states, discounted_batch_rewards has the first dimension as the length
        # of the current trajectory
        gradients = sess.run(gd_opt, feed_dict={actions_one_hot_holder: action_one_hots, state_holder: states,
                                                discounted_rewards_holder: discounted_batch_rewards})


        action_one_hots = []
        states = []
        batch_rewards = []

    if done:
        # Done with episode. Reset stuff.
        episode_no += 1

        episode_rewards_list.append(np.sum(episode_rewards))
        episode_steps_list.append(step)

        episode_rewards = []

        step = 0

        observation = env.reset()

        if episode_no % print_per_episode == 0:
            print("Episode {}: Average steps in last {} episodes".format(episode_no, print_per_episode),
                  np.mean(episode_steps_list[(episode_no - print_per_episode):episode_no]), '+-',
                  np.std(episode_steps_list[(episode_no - print_per_episode):episode_no])
                  )


observation = env.reset()

episode_rewards_list = []
episode_rewards = []
episode_steps_list = []

step = 0
episode_no = 0

print("Testing")
while episode_no <= n_test_trials:
    env.render()
    step += 1

    # For testing, we choose the action using an argmax.
    test_action, = sess.run([testing_action_choice],
                            feed_dict={state_holder: [observation]})

    observation, reward, done, info = env.step(test_action[0])
    if step >= 200:
        done = True
    episode_rewards.append(reward)

    if done:
        episode_no += 1

        episode_rewards_list.append(np.sum(episode_rewards))
        episode_steps_list.append(step)

        episode_rewards = []
        step = 0
        observation = env.reset()

        if episode_no % print_per_episode == 0:
            print("Episode {}: Average steps in last {} episodes".format(episode_no, print_per_episode),
                  np.mean(episode_steps_list[(episode_no - print_per_episode):episode_no]), '+-',
                  np.std(episode_steps_list[(episode_no - print_per_episode):episode_no])
                  )

Answer:

Here is an example tensorflow program that uses Q Learning to learn the CartPole Open Gym.

It is able to quickly learn to stay upright for 80 steps.

Here is the code :

import math import numpy as np import sys import random sys.path.append("../gym") from gym.envs.classic_control import CartPoleEnv env = CartPoleEnv()

discount = 0.5
learning_rate = 0.5
gradient = .001
regularizaiton_factor = .1

import tensorflow as tf

tf_state    = tf.placeholder( dtype=tf.float32 , shape=[4] )
tf_state_2d    = tf.reshape( tf_state , [1,4] )

tf_action   = tf.placeholder( dtype=tf.int32 )
tf_action_1hot = tf.reshape( tf.one_hot( tf_action , 2 ) , [1,2] )

tf_delta_reward = tf.placeholder( dtype=tf.float32 )
tf_value        = tf.placeholder( dtype=tf.float32 )

tf_matrix1   = tf.Variable( tf.random_uniform([4,7], -.001, .001) )
tf_matrix2   = tf.Variable( tf.random_uniform([7,2], -.001, .001) )

tf_logits    = tf.matmul( tf_state_2d , tf_matrix1 ) 
tf_logits    = tf.matmul( tf_logits , tf_matrix2 )


tf_loss = -1 * learning_rate * ( tf_delta_reward + discount * tf_value - tf_logits ) * tf_action_1hot
tf_regularize = tf.reduce_mean( tf.square( tf_matrix1 )) + tf.reduce_mean( tf.square( tf_matrix2 ))
tf_train = tf.train.GradientDescentOptimizer(gradient).minimize( tf_loss + tf_regularize * regularizaiton_factor )


sess = tf.Session()
sess.run( tf.global_variables_initializer() )

def max_Q( state ) :
    actions = sess.run( tf_logits, feed_dict={ tf_state:state } )
    actions = actions[0]
    value = actions.max()
    action = 0 if actions[0] == value else 1
    return action , value


avg_age = 0
for trial in range(1,101) :

    # initialize state
    previous_state = env.reset()
    # initialize action and the value of the expected reward
    action , value = max_Q(previous_state)


    previous_reward = 0
    for age in range(1,301) :
        if trial % 100 == 0 :
            env.render()

        new_state, new_reward, done, info = env.step(action)
        new_state = new_state
        action, value = max_Q(new_state)

        # The cart-pole gym doesn't return a reward of Zero when done.
        if done :
            new_reward = 0

        delta_reward = new_reward - previous_reward

        # learning phase
        sess.run(tf_train, feed_dict={ tf_state:previous_state, tf_action:action, tf_delta_reward:delta_reward, tf_value:value })

        previous_state  = new_state
        previous_reward = new_reward

        if done :
            break

    avg_age = avg_age * 0.95 + age * .05
    if trial % 50 == 0 :
        print "Average age =",int(round(avg_age))," , trial",trial," , discount",discount," , learning_rate",learning_rate," , gradient",gradient
    elif trial % 10 == 0 :
        print int(round(avg_age)),
Here is the output:
6 18 23 30 Average age = 36  , trial 50  , discount 0.5  , learning_rate 0.5  , gradient 0.001
38 47 50 53 Average age = 55  , trial 100  , discount 0.5  , learning_rate 0.5  , gradient 0.001
Summary

I wasn't able to get Q learning with a simple neural net to be able to solve the CartPole problem, but have fun experimenting with different NN sizes and depths!

Hope you enjoy this code, cheers