In this tutorial we are going to show you how we can beat the gym game CartPole-v0 using Keras. Beating the game means getting a score higher than 195 for 100 consecutive games. We are going to do this by imitating (guess why we call it an imitation network ) the well played games of a random agent. Before proceeding make sure that you have Python installed and can create
.py
files or .ipynb
files (Jupyter Notebook).
Firstly we are going to setup a base line with a random agent. Then we are going to collect training data or samples for our model to learn from. The following step is to build a Keras model. Before training this model, we will show you how we can use it to make predictions. After the training of the model we will show how the model improved over time and to finish up we will evaluate our model.
In short we are going to perform the following steps.:
The tutorial will assume that you make a new .py
file, either using Pycharm or renaming a .txt
file to .py
. For .ipynb
files you will have to use pip install jupyter notebook
and then start a jupyter server typing jupyter notebook
in a terminal. This opens a localhost website on which you can create new .ipynb
files.
The whole tutorial will require these imports to be a the top of the file (.py
or .ipynb
).
import gym
import random
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical
For these imports to work you have to install them in your virtual environment
, in short venv. In case you do not have a venv yet, it is advisable to create one. The reasons and process is explained in our python guide, for Python in general (terminal) and when using Pycharm there is a JetBrains guide on setting up a venv in Pycharm.
Make sure that your console window or terminal is showing your venv
name in brackets before entering the command:
(venv) C:\path\to\project>pip install gym==0.15.4 matplotlib==3.1.2 numpy==1.18.1 tensorflow==2.2.0
You can use this command (starting from pip install ...
) to install the needed packages.
Issues with installing TensorFlow (backend of Keras)?
There is one very common issue with installing tensorflow, that is a DLL load failure. The error occurs when installing tensorflow without a visual studio installation. The solution is an answer posted by galocen on tensorflow issue #35749 (first reply).
This is the CPU variant, which is fast enough for now, but there is also a GPU variant which is much faster (about 10x), but a bit more complicated to install. The installation guide for the GPU can be found here in case you want to install it now.
When you've installed the CPU version you might get a warning (recognizable by the starting W
) saying that TensorFlow could not find its GPU library. That's fine, you can ignore this if you're using the CPU version.
In order to see how good a model is, we are going to setup a baseline. This baseline performance is a random agent. This means that we can almost fully use the loop that is already explained in the gym guide. The changes are that we have to store the scores while playing and calculate the average at the end. Now we are going to reuse this part a few times, so it a good idea to make a function out of this. Check the link if you are not yet familiar with functions, otherwise try and implement the random average calculator.
env.action_space.sample()
to get a random actiondone
signal given by the env.step(action)
, and is unknown at the moment of starting the game. This implies a certain construction, which one?def random_average(nr_games=100):
env = gym.make("CartPole-v0")
collected_scores = []
for _ in range(nr_games):
env.reset()
done = False
score = 0
while not done:
action = env.action_space.sample()
obs, reward, done, info = env.step(action)
score += reward
collected_scores.append(score)
average = sum(collected_scores) / nr_games
print(f"\nA random model played: {nr_games} games, with an average score of:"
f" {average:5.2f}")
return average
Here we are going to collect training data from random games that scored higher then a minimum_score, in this case 100. We store only the results of these good games as we want to learn from the well-played games only. The value 100 is a bit random, but higher values will take more time, since randomness usually doesn't perform that good, while lower scores will give training examples that are not that good.
For this part we will implement the function collect_data
with the arguments nr_games
and minimum_score
. The output values should be a list of tuples
with an observation and the action taken. In order to do this you can create a temporary list while playing the game and only add it to the global list when the score is high enough (take a look at the extend method for lists). You can either implement this yourself or copy the code from below. The main focus is on showing you how we can use AI to solve this problem, not yet on the coding.
nr_games
times the full observations and actions from an episode.def collect_data(nr_games=50, minimum_score=100):
""" We are going to collect a fixed number of games having a high score. """
# Here we are going to collect all the data.
data_observations = [] # This is the X, or observation
data_actions = [] # This is the y, or label data
collected_scores = []
collected_games = 0
env = gym.make("CartPole-v0")
# While loop, since we do not know how many games to play until we have nr_games good ones.
while collected_games < nr_games:
# Start the default game loop
obs = env.reset()
done = False
# Temporarily store all actions, in case it is a good run.
temp_observations = []
temp_actions = []
score = 0
while not done:
# Pick a random action
action = env.action_space.sample()
# Store the action and rewards for when this is a good game
temp_observations.append(obs)
temp_actions.append(action)
# Update the obs, done and score
obs, reward, done, info = env.step(action)
score += reward
# Only store information about good games.
if score > minimum_score:
# Use extend to merge a list with a list.
data_observations.extend(temp_observations)
data_actions.extend(temp_actions)
# Update counters and give the users a message.
collected_scores.append(score)
collected_games += 1
# The \r is a flush action, this means that it will get overwritten on the next print.
print(f"\rAdded score: {score}, collected games: {collected_games}/{nr_games}", end='')
env.close()
print(f"\n\nCollected scores (higher is better):\n{collected_scores}")
print(f"\nCollected average: {sum(collected_scores) / nr_games}")
return data_observations, data_actions
We are going to create a model with the Keras API from TensorFlow. The model shown below is a fully connected network, with as input a vector of shape 4, and as output a vector of shape 2 (with in between a few hidden layers that have 64, 128 and 64 nodes respectively). The model is going to update its variables using the Adam optimizer and MSE (mean squared error) as loss. What these do exactly is not important for now, they will take care of the learning process of the model. The softmax layer creates a distribution, this means that the sum of the output equals 1. This way we can say something about which action will be prefered.
def create_model():
model = Sequential()
model.add(Dense(64, input_shape=(4,), activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(2, activation='softmax')) # We only have two values.
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
model.summary()
return model
Here we are going to show you an example of how to use the (untrained) model to predict an action based on input from CartPole. The output is given in probabilities of choosing that particular action. The input is our model
and all training samples X
. X
here is a list of observations.
When feeding an input to a model it always expects to get a batch of data as a numpy array. To get the batch we have to put a single observation (X[0]
) in a list before creating the numpy array. The reason for this is that we in generally fit and predict hundreds or thousands of samples. When using it for a single input we have to expand the dimension, which can be done by putting it in a list (as done above) or whenever it is already a numpy array by using the np.expand_dims(observation, axis=0)
. The axis=0
is required to indicate that we want te create an extra dimensions before the 0th dimension (in our example, a (4,) will become (1, 4) with axis=0
, and (4, 1) with axis=1
.
# Let us show an example prediction of our model
def show_example(model, X):
# Sample id
sample_id = 5
# We convert the observation to a numpy array with shape [1, 4]
# The first value 1 is required, because the model always expects to get a batch of values.
observation = np.array([X[sample_id]])
# Show the shape and values that are the model input.
print(f"Shape: {observation.shape}, values: {observation}")
# This is the model output, 2 values which sum up to 1.
action_probabilities = model.predict(observation)[0]
# Here we show the action and the model predicted probability.
print(f"\nAction and probablity")
for action, probability in zip(np.arange(2), action_probabilities):
print(f"\tAction: {action}, probability: {probability * 100:6.2f}%")
# The action we have to take is equal to the index of the highest value in the model.
print(f"\nBest action: {np.argmax(action_probabilities)}")
X, y = collect_data()
model = create_model()
show_example(model, X)
Shape: (1, 4), values: [[-0.05181469 -0.2076796 0.04708704 0.28343453]]
Action and probablity
Action: 0, probability: 50.52%
Action: 1, probability: 49.48%
Best action: 0
The main thing to note here is that we changed the original input of shape (4,) to a batch by changing the shape to (1,4) and that the model predicted the action 0.
This is the moment to use the data we collected above to find the model that is going to map the given X to the correct Y. The model is going to find a (complex) function based on the labeled data we give, that is why it is important we only collected the data of the games in which we did well.
model_x = np.array(X)
model_y = to_categorical(np.array(y), 2)
history = model.fit(model_x, model_y, epochs=20)
You might wonder about the to_catergorical
function. This is important because the model gives two outputs. Hence the correct label has to be of the same shape. This is done by using a one-hot encoding. A beginners friendly explanation is provided by Michael DelSole. But the basics is that we want our model to give high probabilities for the action that we took, so we put that one to 1, and all other actions to zero. For example, if the taken action is 1, we will encode this as [0, 1]
, so the model will learn to pick action 1.
The history of the model fit function provides us which information that we can use to plot the data output. The information that is stored can be adjusted by adding extra metrics to the model.compile
stage.
Now to get an idea of how the loss and accuracy are changing during the model training process, we can use the following function. It takes as input the history of the model fit
method.
def plot_history(history):
""" Shows the loss (Mean squared error) and accuracy over time. """
plt.subplots(1, 2, figsize=(30, 10))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'])
plt.title("Loss")
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'])
plt.title("Accuracy")
plt.show()
If you run this, you should see two graphs. The left one is the loss function, which is the MSE
. This is the difference between, the true label and predicted label squared, and then averaged over all samples. This value should get smaller and smaller over time.
The other graph shows the accuracy, or how many values were predicted correctly by the model. This should be an increasing value. Now a high accuracy is not per se a good model, can you think about why?
The final part is checking how good the model is performing after training. This will be done by letting it play 100 matches and printing the average score. Adjust the main loop by taking the observation from the env.reset
and env.step
and transform it to a model input. Predict the action with the model and feed that into the env.step
. Take good care of this function, since it will come up a few more times during this series.
numpy array
, a list can be converted to a numpy array using np.array(list)
.shape
, using obs.shape
has a leading 1
. Example: obs.shape
returns (1, 4)
. This can be done by either storing the data in an extra list, and then change it to a numpy array or using np.expand_dims(var, axis=0)
as done in the Example model output: code
.def evaluate_model(nr_games=100):
env = gym.make("CartPole-v0")
collected_scores = []
for episode in range(1, nr_games + 1):
obs = env.reset()
done = False
score = 0
while not done:
# Get action from the model
model_x = np.array([obs])
action = np.argmax(model.predict(model_x)[0])
# update everything
obs, reward, done, info = env.step(action)
score += reward
print(f"\r\tGame {episode:3d}/{nr_games:3d} score: {score}", end='')
collected_scores.append(score)
print(f"\n\nThe model played: {nr_games} games, with an average score of: {sum(collected_scores) / nr_games:5.2f}")
# Now lets compare our model versus a random model.
random_average()
print(f"\nOur model performance:")
evaluate_model(nr_games=100)
In this lesson we imitated the good games of a random model using Keras and that way solved CartPole. We encourage you to play around with the different variables such as minimum_score and amount of games to see what kind of impact it has on the final model and the time it takes to run the program.
For an overview the full program is available as .ipynb
, which is a Jupyter Notebook and a normal .py
file.