Rank | Name | Score | Git Link |
---|---|---|---|
1 | Educo | 16.5 | educo |
It was the beginning of a new era, on the first of January 2013, Google DeepMind presented the first paper that implemented a DQN and sucessfully beat Human and other Reinforcement Learning (RL) techniques in multiple games. The DQN was learned using only raw pixel inputs and was able to get good performance in six of the seven games, without adjusting any hyperparameters [1].
This paper was followed by a second paper, presented 25th of February 2015. In this paper they present a DQN architecture and implementation that was able to beat or compete with humans in 49 different games. Again using the same algorithm, network architecture and hyperparameters [2].
Fast forwarding to the 31th of March of the year 2020. Another Google DeepMind paper got released to finally finish and completely beat the Atari Benchmark. Agent57 was the first deep reinforcement learning agent that was able to obtain scores above the human baseline on all 57 atari games. This includes one of the hardest RL games Montezuma's revenge
, which is complicated due to the sparse rewards in the game [3].
To put it into the words of Károly Zsolnai-Fehér
, from two minutes papers, “What a time to be alive”.
In this lesson we are going to reproduce a large part of the first paper Playing Atari with Deep Reinforcement Learning. Please take your time to read through this paper and do not worry too much if you do not get everything at once.
For this lesson we are going to follow the blog series Beat Atari with Deep Reinforcement Learning! from Adrien Lucas Ecoffet
. Our story starts at Part 1: DQN, his earlier blog, Part 0: Intro to RL, is an introduction to RL and describes, Markov Decision Process, Q functions and policies. Take your time to read his blog and see if you understand what he is doing, do not worry about the implementation (yet).
While reading think about how you can implement his code using the knowledge obtained in previous lessons. Please note that this lesson might take several days to complete. Also since the framework contains a lot of basic code that can be optimally reused here, a lot of code will point to the framework instead of being (re)copied here.
In order to spread it out, all the code is divided in several classes. In total there are 4 classes: Preprocessor
, Model
, Memory
and Agent
. Every one of these will be discussed in more details.
The fun has begun... The first thing in his post is building a preprocessor. In a previous lesson we have build a preprocessor for the game of Pong, with that in mind build the processor. For simplicity we will provide you with a FrameStack
class. This is a wrapper that efficiently stores consecutive observation in a so called LazyFrame
. This implementation makes sure that there are no repetetive observations being stored (source code: FrameStack), and is similar to the blogs RingBuf
.
When using the framework this can be implemented using:
from core.preprocessing.wrappers import FrameStack
env = FrameStack(BreakoutPreprocessor(env), stack=4)
Where you have to create the BreakoutPreprocessor
yourself. Note that the FrameStack
is the last wrapper, this has to do with the fact that the frames are stacked internally and you get a list back. Now the most important thing is to make sure to copy the frame before transfering it to a numpy array and making sure that the frame is in the right order. This is because the frames are stacked on the first axis, e.g. (4, 210, 160, 3), instead of (210, 160, 3, 4). In any case use np.array(obs[:])
for the copying and obs.transpose([1, 2, 3, 0])
, for possible order changes (docs: np.transpose).
import numpy as np
from core.preprocessing.wrappers import BaseWrapper
class BreakoutPreprocessor(BaseWrapper):
def __init__(self, env, *args, **kwargs):
super().__init__(env, *args, **kwargs)
def step(self, action):
""" Here we get the basic output and change it. """
# Perform the basic step.
obs, reward, done, info = self.env.step(action)
return self.preprocess(obs), reward, done, info
def reset(self):
""" Convert the reset state as well. """
obs = self.env.reset()
return self.preprocess(obs)
@staticmethod
def preprocess(obs):
""" Preprocess the observation as required. """
# Remember that we are in a vector environment
obs_gray = np.mean(obs, axis=3).astype(np.uint8)
obs_down_sample = obs_gray[:, ::2, ::2]
return obs_down_sample
import numpy as np
class BreakoutPreprocessor:
def __init__(self, env, *args, **kwargs):
self.env = env
def __getattr__(self, name):
return getattr(self.env, name)
def __str__(self):
return '<{}{}>'.format(type(self).__name__, self.env)
def __repr__(self):
return str(self)
def step(self, action):
""" Here we get the basic output and change it. """
# Perform the basic step.
obs, reward, done, info = self.env.step(action)
return self.preprocess(obs), np.sign(reward), done, info
def reset(self):
""" Convert the reset state as well. """
obs = self.env.reset()
return self.preprocess(obs)
@staticmethod
def preprocess(obs):
""" Preprocess the observation as required. """
# Remember that we are in a vector environment
obs_gray = np.mean(obs, axis=3).astype(np.uint8)
obs_down_sample = obs_gray[:, ::2, ::2]
return obs_down_sample
In the paper and blog the whole model is shown and depicted. For simplicity the model that will be shown here extends a Keras BaseModel from the Framework (this includes some premade checkpoints and saving options). The model should implement the following features:
update_epsilon
method)act
method that returns the action(s) when given a state(s).fit_batch
method)import keras
import numpy as np
from core.models.extern import BaseModelKeras
class BreakoutModel(BaseModelKeras):
def __init__(self, input_shape, output_shape):
super().__init__(input_shape, output_shape)
self.model = self.create_model(input_shape, output_shape)
# Define epsilon
self.epsilon = 1
self.epsilon_max = self.epsilon
self.epsilon_min = 0.05
self.epsilon_steps = 1_000_000
self.epsilon_dstep = (self.epsilon - self.epsilon_min) / self.epsilon_steps
self.frames = 0
@staticmethod
def create_model(input_shape, output_shape, *args, **kwargs):
""" Creates the model. """
frames_input = keras.layers.Input(input_shape, name='frames')
actions_input = keras.layers.Input((output_shape,), name='mask')
# We stored uint8 to reduce memory size against floats.
normalize = keras.layers.Lambda(lambda x: x / 255.0)(frames_input)
# Default convolutional layer with flattening
conv_1 = keras.layers.Conv2D(16, (8, 8), strides=(4, 4), activation='relu')(normalize)
conv_2 = keras.layers.Conv2D(32, (4, 4), strides=(2, 2), activation='relu')(conv_1)
flatten = keras.layers.Flatten()(conv_2)
# Hidden layer and final outputs
hidden = keras.layers.Dense(256, activation='relu')(flatten)
output = keras.layers.Dense(output_shape)(hidden)
# We apply the masking, this is equal to `output * mask`
masked = keras.layers.multiply([output, actions_input])
# Create optimizer
optimizer = keras.optimizers.RMSprop(lr=0.00025, rho=0.95, epsilon=0.01)
# Create and compile model with an optimizer
model = keras.models.Model(inputs=[frames_input, actions_input], outputs=masked)
model.compile(optimizer, loss='mse')
return model
def predict(self, states):
""" Return the predictions of your model. """
return self.model.predict([states, np.ones((states.shape[0], self.output_shape))])
def act(self, states):
""" Return the actions to execute, combined with epsilon. """
if np.random.random() <= self.epsilon:
return np.random.randint(0, self.output_shape, states.shape[0])
q_values = self.predict(states)
return np.argmax(q_values, axis=1)
def act_mask(self, states, mask: np.ndarray) -> np.ndarray:
return self.model.predict([states, mask])
def train(self, x, y):
""" Trains a model. """
self.model.fit(x, y, verbose=0)
def fit_batch(self, gamma, start_states, actions, rewards, is_terminal, next_states):
# Make sure to make hard copies, otherwise there is no advantage in memory
start_states = np.array(start_states[:]).transpose([0, 2, 3, 1])
next_states = np.array(next_states[:]).transpose([0, 2, 3, 1])
next_q_values: np.ndarray = self.act_mask(next_states, mask=np.ones_like(actions))
next_q_values[is_terminal] = 0
rewards[is_terminal] = -1
q_values = rewards + gamma * np.max(next_q_values, axis=1)
self.train([start_states, actions], actions * q_values[:, None])
def update_epsilon(self):
self.epsilon = max(self.epsilon_max - self.epsilon_dstep * self.frames, self.epsilon_min)
self.frames += 1
Not added, if you have one / need one, contact the education committee.
The memory requirements are the same as in the past sessions, but there is one difference. The amount of data that is going to be stored is huge. A single observation is bytes, or Kb. Now according to the paper we have to store 1 million of these. Which leads to GB. Now by using the lazy frames we can reduce this amount drastically, but it is still a lot of space for the Replay Memory.
For this reason there is a better optimized numpy replay memory that stores everything in the smallest possible format, such that everything is compacter. The implementation is available here. And when looking at that file you can see that it is part of the framework.
An example usage is given below when using the framework code:
from core.memory import BaseMemoryNumpy
# Setup memory
memory = BaseMemoryNumpy(
size=120_000,
shape=(105, 80, 4),
action_space=self.env.action_space,
stacked_frames=True
)
Not added, if you have one / need one, contact the education committee.
If you have no access to the gitlab server, this is the whole memory:
import numpy as np
class BaseMemoryNumpy:
"""
The replay-memory holds many previous states of the game-environment.
This helps stabilize training of the Neural Network because the data
is more diverse when sampled over thousands of different states.
size: int
Capacity of the replay-memory. This is the number of states.
shape: Union[tuple, list]
The dimensions of an observation
action_space: int
Number of possible actions in the game-environment.
This is available in case you want to extend it with q values
"""
def __init__(self, size, shape, action_space=None, stacked_frames=False):
self.size = size
self.shape = shape
self.action_space = action_space
self.stacked_frames = stacked_frames
self.pointer = 0
self.filled = False
if stacked_frames:
# States are expected to be LazyFrames
self.states = [None for _ in range(size)]
self.next_state = [None for _ in range(size)]
else:
self.states = np.zeros(shape=(size, *shape), dtype=np.float32)
self.next_state = np.zeros(shape=(size, *shape), dtype=np.float32)
self.actions = np.zeros(shape=size, dtype=np.uint8)
self.rewards = np.zeros(shape=size, dtype=np.float16)
self.done = np.zeros(shape=size, dtype=np.bool)
def __getitem__(self, item):
state = self.states[item]
action = self.actions[item]
reward = self.rewards[item]
done = self.done[item]
next_state = self.next_state[item]
return state, action, reward, done, next_state
def __len__(self):
return self.pointer if not self.filled else self.size
@property
def is_full(self):
""" Indicator for when the pointer has reached the limit. """
return self.pointer == self.size
@property
def percentage(self):
return self.pointer / self.size * 100 if not self.filled else 100.0
def reset(self):
self.pointer = 0
self.filled = False
def get_batch(self, batch_size=32):
idx = np.random.randint(0, len(self) - 1, batch_size)
if self.stacked_frames:
states = [self.states[x] for x in idx]
states_next = [self.next_state[x] for x in idx]
else:
states = self.states[idx]
states_next = self.next_state[idx]
actions = self.actions[idx]
rewards = self.rewards[idx]
done = self.done[idx]
return states, actions, rewards, done, states_next
def add(self, state, action, reward, done, next_state):
if self.is_full:
self.pointer = 0
self.filled = True
k = self.pointer
self.pointer += 1
self.states[k] = state
self.actions[k] = action
self.rewards[k] = reward
self.done[k] = done
self.next_state[k] = next_state
The agent is split into three parts, the __init__
, run
and evaluate
.
In the __init__
you have to create: the environment (and wrap it), model, memory and any possible saving logic.
The run
will contain the game loop logic, and is very similar to previous creations.
Finally the evaluate
can be used to evaluate your agent, and can also be used to initialize the running mean.
Good luck
import time
import numpy as np
import os, sys, pathlib
# Get the top directory (this is required for console usage)
dir_atari = str(pathlib.Path(os.path.abspath(__file__)).parents[4])
sys.path.append(dir_atari)
dir_core = os.path.join(dir_atari, 'framework')
sys.path.append(dir_core)
from core import MultiEnv
from core.agents import AbstractAgent
from core.memory import BaseMemoryNumpy
from core.preprocessing.wrappers import FrameStack
from bots.breakout.dqn.deepmind import BreakoutModel
from bots.breakout.dqn.deepmind import BreakoutPreprocessor
class Breakout(AbstractAgent):
nr_episodes = 1_000_000
batch_size = 32
shape = (105, 80, 4)
resume = True
save = True
save_interval = 100
save_folder = 'data'
def __init__(self, setup):
super().__init__(setup)
self.setup = setup
self.instances = sum(setup.values())
# Get a breakout deterministic 4 environment
self.env = MultiEnv(setup, use_multiprocessing=True)
# Apply all the pre processing, note that we add frame stacking as last, since it reduces memory size)
self.env = FrameStack(BreakoutPreprocessor(self.env), 4)
# Retrieve the model, note that the model also requires a mask.
self.action_space = list(self.env.action_space.values()).pop(0).n
self.model = BreakoutModel(input_shape=self.shape, output_shape=self.action_space)
# Setup memory
self.memory = BaseMemoryNumpy(
size=120_000,
shape=self.shape,
action_space=self.env.action_space,
stacked_frames=True
)
# We update the memory, because we are using a one hot encoding for the actions
self.memory.actions = np.zeros(shape=(self.memory.size, self.action_space), dtype=bool)
# Define data storage of process
self.save_file = os.path.join(self.save_folder, self.current_time + ".txt")
if not os.path.exists(self.save_folder):
os.mkdir(self.save_folder)
# For restoring a model form a breakpoint
if self.resume:
# self.save_name = self.model.create_save_directory(
# agent_name=self.__class__.__name__,
# game_name=next(iter(self.env.spec.keys())),
# custom_name="")
self.model.dir_load = self.save_folder
self.model.load_checkpoint(load_name='last')
@property
def current_time(self):
return time.strftime('%Y-%b-%d-%a_%H.%M.%S')
def run(self, nr_games=None):
self.nr_episodes = nr_games if nr_games is not None else self.nr_episodes
episodes = 0
running_mean = self.evaluate(nr_games=250) # Also initializes the replay memory
scores = np.zeros((self.instances,))
obs = self.env.reset()
while episodes < self.nr_episodes:
for _ in range(4):
# We need to have a copy of the observation, otherwise we lose frame stacking.
model_input = np.array(obs[:]).transpose([0, 2, 3, 1])
# Retrieve the action
self.model.update_epsilon()
actions = self.model.act(states=model_input)
# Perform a step
next_obs, rewards, done, info = self.env.step(actions=actions)
scores += rewards
# We create a mask of the actions and add it to the memory
actions = np.eye(self.action_space)[actions]
# Separately add every observation to the memory.
for args in zip(obs, actions, rewards, done, next_obs):
self.memory.add(*args)
# Update the episodes.
for idx in np.where(done)[0]:
print(f"Finished game {episodes + 1:7,d} / {self.nr_episodes:7,d}, "
f"frames: {self.model.frames: 10,d}, "
f"with score {int(scores[idx]): 3d}, "
f"epsilon {self.model.epsilon:0.3f}, " +
(f"memory: {self.memory.percentage: 6.2f}%, " if not self.memory.filled else '') +
f"running mean: {running_mean:6.3f}")
# log the process
with open(self.save_file, 'a+') as file:
file.write(f"episode {episodes + 1:7,d}, "
f"score: {int(scores[idx]): 3d}, "
f"running mean: {running_mean:6.3f}, "
f"epsilon {self.model.epsilon:0.3f}, "
f"frames: {self.model.frames: 10,d}\n")
# update information
running_mean = running_mean * 0.99 + 0.01 * scores[idx]
episodes += 1
scores[idx] = 0
if self.save and episodes % self.save_interval == 0:
self.model.dir_save = self.save_folder
self.model.save_checkpoint('model.pkl')
obs = next_obs[:]
if len(self.memory) > 8 * self.batch_size:
for _ in range(self.instances):
self.model.fit_batch(0.99, *self.memory.get_batch(self.batch_size))
def evaluate(self, nr_games=10, render=False):
print(f"\nEvaluating agent for {nr_games} games on {self.setup}", end='\n\n')
venv = MultiEnv(self.setup, use_multiprocessing=True)
venv = FrameStack(BreakoutPreprocessor(venv), 4)
frames = 0
episodes = 0
score_sum = 0
scores = np.zeros((self.instances,))
obs = venv.reset()
while episodes < nr_games:
if render:
venv.render()
model_input = np.array(obs).transpose([0, 2, 3, 1])
actions = self.model.act(states=model_input)
# Perform a step
next_obs, rewards, done, info = venv.step(actions=actions)
scores += rewards
frames += 1
# We create a mask of the actions and add it to the memory
actions = np.eye(self.action_space)[actions]
# Separately add every observation to the memory.
for args in zip(obs, actions, rewards, done, next_obs):
self.memory.add(*args)
# Update the episodes.
for idx in np.where(done)[0]:
print(f"Finished game {episodes + 1:7,d} / {nr_games:3,d}, with score {int(scores[idx]): 3d}")
score_sum += scores[idx]
episodes += 1
scores[idx] = 0
obs = next_obs[:]
venv.close()
print(f"\nPlayed {nr_games} games, frames: {frames:6,d}, with an average of {score_sum / nr_games: 5.2f}\n")
return score_sum / nr_games
if __name__ == '__main__':
agent = Breakout({'BreakoutDeterministic-v4': 2})
agent.run()
agent.evaluate(nr_games=50, render=True)
Not added, if you have one / need one, contact the education committee.
Game | Frames | Mean |
---|---|---|
0 | 0 | 0 |
5.000 | 150.781 | 1.32 |
10.000 | 322.001 | 2.87 |
15.000 | 529.161 | 4.54 |
20.000 | 802.022 | 7.68 |
25.000 | 1.182.586 | 11.41 |
30.000 | 1.664.879 | 16.51 |
As you might have noticed, there is a lot going on in the creation of a machine learning agent. A good way to verify if everything is working correctly is taking your time and first test if the model can overfit on a single batch. If it is unable to do that, there is probably an implementation error.
Overall the above steps are used in most processes, obtaining your training data, preprocessing your data, implementing a correct model for that kind of data, finetuning the hyperparameters, evaluating your model and data over time.