Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented Jun 26, 2020

In #1 we successfully implemented DQN starting with the framework we implemented in github.com/dbulhosa/ilsvrc for training Imagenet models. We tested our DQN implementation by solving Open AI Gym's CartPole-v0 environment. In #2, #3, and #4 we tuned our hyperparameters and models to create agents that solved not just CartPole-v0 better than our first solution, but also Acrobot-v1, MountainCar-v0, and CartPole-v1. Solving these problems and extending our library in the process has made us more confident about this framework, sufficiently so to use it with Atari games.

We next extend the framework to accommodate Atari games. This requires:

  • Adding support for features built from observations, rather than passing observations directly to the model.
  • Writing code to serialize and deserialize replay buffer data if the replay buffer needed (about 1M transitions) is too big to fit in our computer.
  • Change hyperparameters to match those from Atari papers, including adding an exploration schedule and seeing if there is any learning rate scheduling.

@ghost
Copy link
Author

ghost commented Jun 27, 2020

Here's what we need to change to generalize the EnvironmentSequence class to fit the Atari use case:

  • Generalize the get_latest_observation method to a get_latest_feature method that returns whatever the latest input feature for the model is, whether it's just the observation itself or some processed version of it.
  • Implement some method that creates the necessary feature from a given observation or observation index.
  • Use the get_latest_feature method in the get_action method instead of get_latest_observation.
  • Tweak the get_states_length and sample_indices methods to sample indices that make sense for the features being created, specifically when said features are built from multiple stacked observations as is the case in Atari.
  • Change get_minibatch method to use sampled indices to generated right features for states array. Rename states array to features array.

Note that the simulation and existing replay buffers can stay the same. Our philosophy will be not to store features but to calculate them on the fly based on observations instead. In the case of Atari games this should significantly reduce the memory usage required at the expense of minimal compute. This also allows us to preserve the simulation, replay memory, target agent, double DQN logic, initialization, and minibatch generation.

We can make some of these methods abstract in a parent class and then define them in child classes. This way we will generalize the existing EnvironmentSequence class to different features!

danielbulhosa and others added 9 commits June 26, 2020 21:45
We're moving towards creating child classes implementing different
feature creation functions. In order to do this we extracted the
relevant functions from the original `EnvironmentSequence` class
(now named `SynchronousSequence` for generality) into a new class.
We named the new class `EnvironmentSequence` to keep our code com-
patible with the agents we've already implemented for classic control.
We may rename it later.

See the notes in the PR for the methods that were moved and why.
Hotfix: Accidentally committed these to master instead of atari_rl branch
We ended up simplifying the inheritance significantly
and moving some methods back to this class.

- Move `get_action` method back into this class. Replace
  `get_latest_observation` call in this method to `get_latest_feature`.
  This modification generalizes this method to allow it to
  pass general features rather than observations to the agent
  in order to get the action.
- Moved back `get_latest_observations` into this class, but
  changed to `get_latest_feature`. We don't get the n latest
  observations now, just the very latest feature (which is
  all we were using anyways). This method simply calls the
  abstract `get_feature_by_index` method which gets implemented
  in the child classes.
- We modify the `get_minibatch` method to use the aforementioned
  `get_feature_by_index` method along with the sampled indices
  to create the stack of states, rather than indexing the
  observation buffer directly. This generalizes the creation
  of minibatches, which now can be built directly from observations
  OR from features created from observations.
- Move `sample_indices` method back to this class, but change it
  to have a custom lower bound on the indices that can be sampled.
  This way, if we create features by stacking observations we can
  avoid sampling observations without enough preceeding history
  in the buffer to create a stack.
- The custom lower bound in the `sampled_indices` method is
  computed by a new abstract method named `get_states_start`.
- Added abstract method for `get_feature_by_index` method.

This makes it so that to define our child classes we only need
to define two methods!

- `get_feature_by_index` which handles getting the correct
  feature for a particular timestep.
- `get_states_start` to define the lower bound of timesteps
  we can sample to create features.
@ghost
Copy link
Author

ghost commented Jun 27, 2020

Codewise we're mostly there we just need to tailor the agent in our rundef to match the specs for the Atari agents. Everything else is in place and should in theory work!

NOTE: We will need to preprocess rewards as well to normalize them. Thus we'll want to add a method that preprocesses rewards like the one we have for preprocessing feature, both to the parent and child classes.

In the Atari papers the actual reward from the game is
modified to +1, -1, or 0 depending on whether its zero
or not and its sign. This is so that the same agent can
learn to play different games irrespective of the scale
of the rewards.

In order to allow this functionality in our framework
we have wrapped the reward buffer with a `get_reward_by_index`
method that allows us to do an intermediate transformation
on the rewards before using them to compute the minibatch
labels. This is similar to how we wrapper the observation
buffer with the `get_feature_by_index` method to allow
transformations of observations into features. It is the
next natural step towards abstracting environment outputs
so they can be preprocessed before passing them to the agent.

We modified the parent environment file to implement the
abstract method `get_reward_by_index` and then we implemented
corresponding methods in each of the two child classes. For
the `EnvironmentSequence` class used for classical discrete
control, there is no transform, we just feed the raw rewards
to the agent as we have been doing.
In the Atari paper the evaluation is done over an open ended
number of episodes, but with an upper bound on the number of
frames. To allow this in our evaluation callback we had to
modify the evaluation code to account for this. We allow this
by:

- Adding a `num_max_iter` parameter to the constructor of the
  class to specify the maximum number of iterations.
- Allowing both this new parameter and the existing `num_episodes`
  to take on the value np.inf to specify they should be without
  bound.
- Using a while loop instead of a for loop to iterate over episodes
  in the simulation method. This way if the number of episodes is
  unbounded the loop will still work.
- Adding conditional checks in the simulation loop to check if
  we have exceeded the maximum number of iterations and halt
  and exit simulation correctly if so.
- Change reward and episode length arrays to lists to allow
  for possible variable length.
- Return, print, and log to Tensorboard the number of episodes
  simulated so if `num_max_iter` is set to a finite number we
  know how many episodes we actually simulated. Obviously if
  there is no limit in the number of iterations but there is
  one on the number of episodes the value printed and logged
  will be equal to that limit always.
@ghost
Copy link
Author

ghost commented Jun 29, 2020

Everything should be code complete. We are getting a CUDA error when running the rundef file. We need to take a look at that - -we think resetting the computer may be enough to fix it. Take a look at the FIXMEs in the rundef file for Breakout v00.

We also need to decide if we want to use double DQN and if so make the necessary tweaks to the rundef. These are all specified in the double DQN paper and with the exception of bias one should be straightforward to implement.

FIXES: The reset ended up doing the trick for the CUDA error. We ended up deciding to use double DQN but with the original DQN parameter and spec for simplicity -- mainly because of the extra work of implementing a constraint on the final layer's kernal biases. We also realized from reading the double DQN paper that frames produced during action repeats are skipped by the agent for learning altogether, and made changes in our code to enable this "frame skipping" functionality.

The `skip_frames` parameter determines whether frames
that the environment produces during skip actions are
saved in the buffer (and thus used for training) or not.
After reading the DQN and double DQN papers more closely
we realized that during action repeats frames seemed to
not be stored in the replay memory. The paper says the
DQN agents "do not see" these frames. Thus we added a
mechanism to skip these frames altogether and not add
them to the memory.

For backwards compatibility purposes this parameter
is defaulted to `False` in the class constructor.
Since we will be skipping frames altogether from the training
process we thought it would make sense to add all of the rewards
for the frames skipped and adding them together into a single reward.
That way if rewards are created by the environment in skipped
frames the agent still knows about them. We implemented this
by modifying the `get_rewards_by_index` method to look at
all previous rewards in the action repeat window.
- We implemented skip frames in our generators, so we set the
  `skip_frames` parameter to `True` so that this agent would
  skip frames.
- In reading the double DQN paper we realized that many of the
  parameters in the original Atari paper are formulated in terms
  of "steps" where a step is four repeated actions and the resulting
  rewards, observations, etc. We reformulated our parameter definitions
  to be defined in terms of steps, which are defined in terms of # of
  actions repeated.
@ghost
Copy link
Author

ghost commented Jul 1, 2020

The rundef for the Atari Breakout agent runs, yay! The training runner does not run for this agent though. We got a couple of important errors to solve to fix this:

  • Our agent evaluation callback reimplemented the simulation method rather than reusing the one from the Sequence class used by the agent. As such the evaluation method does not know how to create features from observations, it has no notion of skip frames, and it does not have a buffer to create frames from. We think we'll probably want to rethink this method, and have it create it's own sequence instances so it has access to these methods. This is currently causing an error with evaluation since the on train start callback call is passing image tensors with the wrong shape to the model.

  • Our stacked frames are stacking frames across different episodes, including trying to sum None rewards at episode ends with rewards at the end and beginning of subsequent episodes. We need to prevent frame stacking across episodes. The easiest way to accomplish this is by excluding indices from minibatches that cross episodes. [DONE - Easy fix]

…sode ends

Certain frame stacks were intersection episode ends. It doesn't make sense
to pass frame stacks to the model that intersect the transition between
episodes. We changed the logic calculating valid indices to calculate
the validity of all indices in a frame stack, not just the leading index.
@ghost
Copy link
Author

ghost commented Jul 2, 2020

We tried running an epoch and found that a single one (about 200k frames) would take about 48 minutes, which would mean a total running time of 30+ days. This is much longer that what was stated in the paper. We did some performance analysis by timing the simulation and the minibatch generator. We found that:

  • The minibatch getter was about an order of magnitude slower than the simulation loop.
  • The valid index computing and buffer sampling + feature generation steps were about an order of magnitude slower than other parts of the minibatch generation (including the model evaluation).

We think that if make the valid index and observation buffer sampling and featurization calculations more efficient we can reduce epoch time by almost 10x. Will focus on this next before worrying about evaluation.

  • Valid index optimization. [DONE - 100x improvement by vectorizing with numpy]
  • Observation buffer sampling and feature compute, 15ms for each of states and next states array.
    • Most expensive subroutines are taking max of adjacent frame pairs and cv2 transformations.
    • Each of these contribute equally to the time cost of the feature creation.
    • Each of the methods within the feature creation method contribute equally to time cost.
    • NOTE, we checked, the maximum and cv2 preprocessing functions commute! We can use this to our advantage.
      Maybe we should preprocess the images during simulation, and take the pairwise max during minibatch creation.

We realized though that it would be cheaper memory-wise to store an 84 x 84 x 4 image stack vs. what we're doing currently, which is saving single 210 x 160 x 3 RGB images. It may thus be both more computationally efficient and memory efficient to save the features in the buffer instead of saving the raw observations. Making the change though will require changing how we handle recording into the buffer and how we handle feature transformations in the child sequence classes.

Also, just realized that due to frame skips stacks of 4 frames really correspond to 16 frames of play (since the replay buffer does not store intervening frames). Is this what we want? Does this make sense? It may but I need to think about it more. The answer to this question can be influenced by the question of how to properly do a feature buffer.

@ghost
Copy link
Author

ghost commented Jul 17, 2020

We got it to work!!! Increasing the number of workers from 0 to 1 successfully has the epoch time down to 11.7 minutes per epoch!!! The total time for running a full training (excluding evaluation) with 200M frames is now 2 days which is fantastic. For reference running models on Imagenet took us 2 days+, and the original DQN implementation took over a week to run!!!

We timed the per minibatch time and it came to 14ms, which aligns with our previous estimate for the forward and backpropagation time. We doubt we'll get it to run much better than this, mainly because in theory the bottleneck now should be the forward and backpropagation itself, not the minibatch generation. Only the latter is under our control. Irrespective, this performance is more than satisfactory.

Only things left is to fix are:

  • That indexing bug we mentioned in the previous comment. A potential way to surface this bug is making the experience replay much smaller temporarily, so that the error doesn't get lost in sampling.
    • We fixed it. It was a simple bug: We weren't accounting for the length of the frame stack when invalidating replay memory indices in the leftmost side of the replay memory (the oldest).
  • Improving the evaluation callback as discussed in previous comments.
    • In Progress. See comment below.
  • Making v00/rundef.py into v01/rundef.py.
    • Done.
  • Frame stacks can technically cross across episodes when we use get_latest_feature to select an action if the action selection is happening near the beginning of the episode.
    • We fixed this. When getting an action we check if the latest frames in the replay memory contain an episode end. If so we select a random action. Since episode starts are infrequent and frame stacks are pretty short (4-5 frames) this shouldn't really affect the performance of the agent much. It also was a pretty elegant, minimal solution codewise (as opposed to the keras-rl solution where black frames replace the frames from the preceding episode).

Also we noticed that using the built in model cloning method in Keras and manually compiling the copy we use for the target model seems to be faster than deepcopying. This is a tiny optimization we could look at in the future, but we don't think it's worth the extra code complexity.

Two changes:

- Add a case in get_states_length method to account for when pair_max
  is false and the frame stack we sample is 4 frames instead of 5.
- Use get_states_length method to calculate correct point at which to
  invalidate indices in the far left of the valid index memory buffer.
  The get_states_length leftmost entries of the memory buffer are
  invalid for sampling because there are not enough frames to the left
  of these to sample a full stack. The fact this was not accounted for
  in the index invalidation logic was creating index errors.
When selecting actions we pass a stack of frames (our feature) to the
model. In particular we pass the most recent stack of frames (the
rightmost end) in the replay memory. If the action is selected early
in a new episode some of the frames in the rencent frame stack will
include frames from the previous episode. Thus if we stacked these
frames we would be passing frames from two different episodes to
the model. This clearly wouldn't make sense as a feature.

We therefore made a change to action selection: When the most
recent frame stack crosses an episode boundary we select a
random action instead of passing the frame stack to the model.
This avoids this issue. Furthermore, since frame stacks are
relatively small (4-5 frames) and episode starts are not that
common this should not affect the performance of the agent
very much.
As mentioned in previous commits we solved the issue which
limited us from running two threads: one for creating minibatches
and another for backpropagating. The problem had to do with copies
of the same model running on different threads.

We also removed the shuffling on minibatch indices. Since our
agent is stateful our minibatch generation only makes sense in
order. Hence the change.
…l callback

As stated in the PR the evaluation callback as written cannot work
for non-classical enviroments, specifically our Atari one. This is
because the code for generating features to feed to the model to
get the next action are part of the AtariSequence class. The
evaluation callback does not have access to this class. In order
to be able to pass this class to the callback we have to make changes
to the AtariSequence class that would let it play nice with the callback.
The changes for this purpose were:

- Extract the core logic within the for loop in the `simulate` method
  in order to have the logic for simulating a "single iteration" in
  its own method. This new method is called `simulate_single`. This
  makes the logic of both the `simulate` and `simulate_single` methods
  clearer.

- Write getter methods for the latest state variables we didn't already
  have them before, namely rewards, done, and action. As we mention below
  this helped clean our logic. It also gives an interface for the evaluation
  callback to request the AtariSequence class for reward and done values
  at each simulation single step to calculate model performance (based
  on total rewards) and control episode termination logic (based on
  done value).

We also removed some unnecessary complexity from the SynchronousSequence
class:

- Since the OpenAI Gym implementation of the Atari environments has
  frame skipping built in we removed our manual frame skipping. It
  is no longer present as a constructor parameter nor does it influence
  control flow in the simulation logic. We removed it both from the
  SynchronousSequence class and the AtariSequence class.

- Since we wrote a method for simulating a single step we did not
  require the variables `self.initial_sims` and `self.initial_iterations`.
  This variables were used to account for the fact that to fill the
  replay buffer initially by some amount N, we needed to do so in
  multiples of the minimum batch size. Using the method `simulate_single`
  instead of `simulate` for doing the initial filling of the replay
  buffer is significantly more intuitive and simple. We updated references
  to these two extraneous variables throughout the class code, using
  `self.replay_buffer_min` instead.

- Removed internal variables keeping track of the last simulated state
  (e.g. self.prev_observation, self.prev_action, self.prev_done, self.prev_reward).
  The replay buffer contains all of this information already (in
  the rightmost buffer entries). Rather than storing these in separate
  variables we wrote getters to get the rightmost (index - 1) entry
  in each buffer. Since indexing in Python lists is O(1) we should not
  incur any serious compute cost from this, but it simplifies code
  significantly.
A major bug with the evaluation callback when used for the Atari
environment is that the callback did not contain the methods for
generating features and actions. In order to correct this we have
modified the custom callback class so that:

- Instad of passing the environment, the gamma value, and the
  exploration schedule, we just pass a method that constructs
  an instance of the appropriate SynchronousSequence subclass
  for the problem.

- Each time simulations are required for evaluating the performance
  of an agent, an instance of the SynchronousSequence subclass is
  created and passed to the callback class's `simulate_episodes`
  method. The instantiation of sequence classes can be seen in the
  `on_train_begin` and `on_epoch_end` methods.

- The `simulate_episodes` method utilizes the `simulate_single`,
  state getter methods, and internal iteration and episode tracker
  variables contained in the SynchronousSequence class. This way,
  the simulation logic of the callback has access to the logic for
  computing complex features and resulting actions contained in the
  sequence class.

- Given that the sequence class provides the complete logic for
  getting states we removed the methods for getting actions from
  the custom callback class. The sequence class takes care of this
  now.

We also added missing documentation for the callback class constructor
parameter `num_max_iter`.
@ghost
Copy link
Author

ghost commented Jul 31, 2020

We made the necessary changes for having the evaluation callback class use the sequence class for generating features and selecting actions. Some outstanding items:

  • Debug evaluation callback and sequence constructor method. Getting some bugs at the moment. [DONE - Much easier than anticipated. Very minor bug.]
  • Fix classical control rundefs. The change to have the evaluation callback use the sequence constructor method is breaking for their rundefs. We need to make changes to those rundefs similar to the changes we just did for the breakout_v0 rundef. [DONE - we also took the opportunity to simplify how the instantiation of evaluation sequence instances gets done]
  • Make the Atari evaluation sequence constructor take 1 parameter instead of 0. It should take the exploration schedule lambda as a parameter. This was the evaluation callback can use the constructor to create an agent + environment that follows the random policy. This is needed for the initial agent evaluation at the beginning of training. [DONE - Added parameter in constructor to use random policy or not instead]

We're so close!!!

- Assume that sequence construtor has a parameter `random` that if
  set to true returns a sequence representing an agent using the
  random policy.
- Use this `random` parameter to instantiate the random policy
  agent in the `on_train_begin` method.
- Fix `agmeth.evaluate_state` method signature to include graph
  parameter (more recent addition). Pass graph to this method
  by pulling graph property from the sequence class constructed
  for evaluation.
- Instead of using environment contained in sequence constructed
  to get initial states, for each initial state we want to sample
  we construct a new sequence and simulate for the minimum number
  of steps needed to construct the initial feature. This is done
  in the `on_epoch_end` method. Doing it this way allows us to use
  the shared API we constructed for SynchronousSequence subclasses
  to use them to generate the necessary features.
We found that sampling 10,000 initial states at the end of each epoch
would take 20 minutes each time. This is more than 2x the length of an
epoch itself!

Instead of doing this each epoch we made to changes:

- Move the sampling of initial states to the construction of the
  custom callback. This way we only pay the cost of sampling the
  states once per training rather than once per epoch.

- Serialize sampled initial states and save them into a compress
  .npz file. This way we only incur the sampling cost once **per
  model** (possibly per environment) and we can have a consistent
  baseline of states for comparing the performance of different
  versions of the model (or different, but similar models with
  the same environment).

Doing this makes evaluation extremely fast, at the price of the
initial run to generate the sampled initial states.
…lass

We noticed that it would make more sense to have the constructor for
evaluation sequence subclass instances to be included as part of the
class itself. We added a method `create_validation_instance` to each
subclass of `SynchronousSequence`. This method returns an instance
of the subclass with parameters chosen to be used for the validation
callback. The method takes the evaluation exploration schedule as a
parameter. This is what is expected by the (updated -- see next commits)
evaluation callback.

We also created the constructor for the evaluation of agents using the
`EnvironmentSequence` class, which we previously had not implemented.
This allows us to use the evaluation callback in its current implementation
with our classical control agents. Thus this makes our changes to the callback
compatible backwards with our previous work.
…ck def

We made breaking changes to the evaluation callback and sequence subclasses
that made them not work with the classical control agents. We updated the
rundef of these agents to make them backwards compatible:

- Extract graph for these agents' models so we can run generators and
  models in different threads if needed for double DQN. Double DQN
  is not implemented for these models so its not relevant but the
  SynchronousSequence constructor expects the graph to be passed
  to it anyways. May simplify this extraneous parameter for these
  agents at a later time.

- Move instantiation of sequence class before callback parameter
  definition so we can pass reference to sequence instance method
  to callback as parameter.

- Update evaluation callback parameters to match how the evaluation
  callback is currently defined. Of note is passing the sequence
  subclass's `create_validation_instance` method to the evaluation
  callback so it can create instances of the sequence for evaluation
  purposes.
…ctor param

- We have readded the epsilon constructor parameter which allows us to pass
  to the constructor a lambda that calculates the epsilon that should be used
  based on the iteration number.
- Rather than taking the parameter random(=True/False), the sequence constructor
  function now takes the epsilon lambda. This simplifies how we handle creating
  different types of agents for evaluation.
- Fixed a bug where the calculated number of episodes ran was one larger than
  the number that were actually ran.
- Fixed logic for serializing initial states sampled so that if the parameter
  `init_state_dir` is None we do not serialize the sampled states at the end
  of the sampling method.
@ghost
Copy link
Author

ghost commented Aug 2, 2020

We should be good to go! Finally!!! :D :D :D

@ghost
Copy link
Author

ghost commented Aug 2, 2020

Note that unlike the paper instead of normalizing each individual reward we normalize groups of 3-5 rewards depending on how many frames are skipped. In practice this should shrink the scale of the rewards (since we're mapping 3-5 rewards to the interval [-1, 1]) but it should overall be less restrictive than the normalization of the original paper. This is because we reward groups of actions with positive reward, even if intervening rewards in the group have negative outcomes. In theory rewarding positive reward groups should be overall more expressive than rewarding single positive rewards.

@ghost
Copy link
Author

ghost commented Aug 7, 2020

We keep getting Process finished with exit code 137 (interrupted by signal 9: SIGKILL) after about 25 epochs. We're pretty sure it's a memory leak but we're not sure where. It may be that one of the replay buffers in growing larger than the maximum size. We'll try seeing if we can initiate the replay buffer with more instances than the replay buffer max.

If it's not this I'm not really sure what it is. It could be related to:

  • Copying the current model into the target model multiple times.
  • Making new sequence instances for the evaluation callback that are not getting deleted.

However Python should be cleaning up all of this variables so this would be quite surprising.

It's also worth noting that time per epoch continues to increase as training goes on even beyond the 5th epoch which is when the replay memory should reach its maximum size. Training time increases as epochs go on which is also surprising because the code is written as to be independent of the replay memory size (lists are supposed to be O(1) for example though maybe in practice this isn't exactly true). This is another thing we should look at which may be related to the memory leak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant