-
Notifications
You must be signed in to change notification settings - Fork 0
Extend discrete classic control framework to train Atari double DQN model #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Here's what we need to change to generalize the
Note that the simulation and existing replay buffers can stay the same. Our philosophy will be not to store features but to calculate them on the fly based on observations instead. In the case of Atari games this should significantly reduce the memory usage required at the expense of minimal compute. This also allows us to preserve the simulation, replay memory, target agent, double DQN logic, initialization, and minibatch generation. We can make some of these methods abstract in a parent class and then define them in child classes. This way we will generalize the existing |
We're moving towards creating child classes implementing different feature creation functions. In order to do this we extracted the relevant functions from the original `EnvironmentSequence` class (now named `SynchronousSequence` for generality) into a new class. We named the new class `EnvironmentSequence` to keep our code com- patible with the agents we've already implemented for classic control. We may rename it later. See the notes in the PR for the methods that were moved and why.
Hotfix: Accidentally committed these to master instead of atari_rl branch
We ended up simplifying the inheritance significantly and moving some methods back to this class. - Move `get_action` method back into this class. Replace `get_latest_observation` call in this method to `get_latest_feature`. This modification generalizes this method to allow it to pass general features rather than observations to the agent in order to get the action. - Moved back `get_latest_observations` into this class, but changed to `get_latest_feature`. We don't get the n latest observations now, just the very latest feature (which is all we were using anyways). This method simply calls the abstract `get_feature_by_index` method which gets implemented in the child classes. - We modify the `get_minibatch` method to use the aforementioned `get_feature_by_index` method along with the sampled indices to create the stack of states, rather than indexing the observation buffer directly. This generalizes the creation of minibatches, which now can be built directly from observations OR from features created from observations. - Move `sample_indices` method back to this class, but change it to have a custom lower bound on the indices that can be sampled. This way, if we create features by stacking observations we can avoid sampling observations without enough preceeding history in the buffer to create a stack. - The custom lower bound in the `sampled_indices` method is computed by a new abstract method named `get_states_start`. - Added abstract method for `get_feature_by_index` method. This makes it so that to define our child classes we only need to define two methods! - `get_feature_by_index` which handles getting the correct feature for a particular timestep. - `get_states_start` to define the lower bound of timesteps we can sample to create features.
|
Codewise we're mostly there we just need to tailor the agent in our NOTE: We will need to preprocess rewards as well to normalize them. Thus we'll want to add a method that preprocesses rewards like the one we have for preprocessing feature, both to the parent and child classes. |
In the Atari papers the actual reward from the game is modified to +1, -1, or 0 depending on whether its zero or not and its sign. This is so that the same agent can learn to play different games irrespective of the scale of the rewards. In order to allow this functionality in our framework we have wrapped the reward buffer with a `get_reward_by_index` method that allows us to do an intermediate transformation on the rewards before using them to compute the minibatch labels. This is similar to how we wrapper the observation buffer with the `get_feature_by_index` method to allow transformations of observations into features. It is the next natural step towards abstracting environment outputs so they can be preprocessed before passing them to the agent. We modified the parent environment file to implement the abstract method `get_reward_by_index` and then we implemented corresponding methods in each of the two child classes. For the `EnvironmentSequence` class used for classical discrete control, there is no transform, we just feed the raw rewards to the agent as we have been doing.
In the Atari paper the evaluation is done over an open ended number of episodes, but with an upper bound on the number of frames. To allow this in our evaluation callback we had to modify the evaluation code to account for this. We allow this by: - Adding a `num_max_iter` parameter to the constructor of the class to specify the maximum number of iterations. - Allowing both this new parameter and the existing `num_episodes` to take on the value np.inf to specify they should be without bound. - Using a while loop instead of a for loop to iterate over episodes in the simulation method. This way if the number of episodes is unbounded the loop will still work. - Adding conditional checks in the simulation loop to check if we have exceeded the maximum number of iterations and halt and exit simulation correctly if so. - Change reward and episode length arrays to lists to allow for possible variable length. - Return, print, and log to Tensorboard the number of episodes simulated so if `num_max_iter` is set to a finite number we know how many episodes we actually simulated. Obviously if there is no limit in the number of iterations but there is one on the number of episodes the value printed and logged will be equal to that limit always.
|
Everything should be code complete. We are getting a CUDA error when running the rundef file. We need to take a look at that - -we think resetting the computer may be enough to fix it. Take a look at the FIXMEs in the rundef file for Breakout v00. We also need to decide if we want to use double DQN and if so make the necessary tweaks to the rundef. These are all specified in the double DQN paper and with the exception of bias one should be straightforward to implement. FIXES: The reset ended up doing the trick for the CUDA error. We ended up deciding to use double DQN but with the original DQN parameter and spec for simplicity -- mainly because of the extra work of implementing a constraint on the final layer's kernal biases. We also realized from reading the double DQN paper that frames produced during action repeats are skipped by the agent for learning altogether, and made changes in our code to enable this "frame skipping" functionality. |
The `skip_frames` parameter determines whether frames that the environment produces during skip actions are saved in the buffer (and thus used for training) or not. After reading the DQN and double DQN papers more closely we realized that during action repeats frames seemed to not be stored in the replay memory. The paper says the DQN agents "do not see" these frames. Thus we added a mechanism to skip these frames altogether and not add them to the memory. For backwards compatibility purposes this parameter is defaulted to `False` in the class constructor.
Since we will be skipping frames altogether from the training process we thought it would make sense to add all of the rewards for the frames skipped and adding them together into a single reward. That way if rewards are created by the environment in skipped frames the agent still knows about them. We implemented this by modifying the `get_rewards_by_index` method to look at all previous rewards in the action repeat window.
- We implemented skip frames in our generators, so we set the `skip_frames` parameter to `True` so that this agent would skip frames. - In reading the double DQN paper we realized that many of the parameters in the original Atari paper are formulated in terms of "steps" where a step is four repeated actions and the resulting rewards, observations, etc. We reformulated our parameter definitions to be defined in terms of steps, which are defined in terms of # of actions repeated.
|
The rundef for the Atari Breakout agent runs, yay! The training runner does not run for this agent though. We got a couple of important errors to solve to fix this:
|
…sode ends Certain frame stacks were intersection episode ends. It doesn't make sense to pass frame stacks to the model that intersect the transition between episodes. We changed the logic calculating valid indices to calculate the validity of all indices in a frame stack, not just the leading index.
|
We tried running an epoch and found that a single one (about 200k frames) would take about 48 minutes, which would mean a total running time of 30+ days. This is much longer that what was stated in the paper. We did some performance analysis by timing the simulation and the minibatch generator. We found that:
We think that if make the valid index and observation buffer sampling and featurization calculations more efficient we can reduce epoch time by almost 10x. Will focus on this next before worrying about evaluation.
We realized though that it would be cheaper memory-wise to store an 84 x 84 x 4 image stack vs. what we're doing currently, which is saving single 210 x 160 x 3 RGB images. It may thus be both more computationally efficient and memory efficient to save the features in the buffer instead of saving the raw observations. Making the change though will require changing how we handle recording into the buffer and how we handle feature transformations in the child sequence classes. Also, just realized that due to frame skips stacks of 4 frames really correspond to 16 frames of play (since the replay buffer does not store intervening frames). Is this what we want? Does this make sense? It may but I need to think about it more. The answer to this question can be influenced by the question of how to properly do a feature buffer. |
|
We got it to work!!! Increasing the number of workers from 0 to 1 successfully has the epoch time down to 11.7 minutes per epoch!!! The total time for running a full training (excluding evaluation) with 200M frames is now 2 days which is fantastic. For reference running models on Imagenet took us 2 days+, and the original DQN implementation took over a week to run!!! We timed the per minibatch time and it came to 14ms, which aligns with our previous estimate for the forward and backpropagation time. We doubt we'll get it to run much better than this, mainly because in theory the bottleneck now should be the forward and backpropagation itself, not the minibatch generation. Only the latter is under our control. Irrespective, this performance is more than satisfactory. Only things left is to fix are:
Also we noticed that using the built in model cloning method in Keras and manually compiling the copy we use for the target model seems to be faster than deepcopying. This is a tiny optimization we could look at in the future, but we don't think it's worth the extra code complexity. |
Two changes: - Add a case in get_states_length method to account for when pair_max is false and the frame stack we sample is 4 frames instead of 5. - Use get_states_length method to calculate correct point at which to invalidate indices in the far left of the valid index memory buffer. The get_states_length leftmost entries of the memory buffer are invalid for sampling because there are not enough frames to the left of these to sample a full stack. The fact this was not accounted for in the index invalidation logic was creating index errors.
When selecting actions we pass a stack of frames (our feature) to the model. In particular we pass the most recent stack of frames (the rightmost end) in the replay memory. If the action is selected early in a new episode some of the frames in the rencent frame stack will include frames from the previous episode. Thus if we stacked these frames we would be passing frames from two different episodes to the model. This clearly wouldn't make sense as a feature. We therefore made a change to action selection: When the most recent frame stack crosses an episode boundary we select a random action instead of passing the frame stack to the model. This avoids this issue. Furthermore, since frame stacks are relatively small (4-5 frames) and episode starts are not that common this should not affect the performance of the agent very much.
As mentioned in previous commits we solved the issue which limited us from running two threads: one for creating minibatches and another for backpropagating. The problem had to do with copies of the same model running on different threads. We also removed the shuffling on minibatch indices. Since our agent is stateful our minibatch generation only makes sense in order. Hence the change.
…l callback As stated in the PR the evaluation callback as written cannot work for non-classical enviroments, specifically our Atari one. This is because the code for generating features to feed to the model to get the next action are part of the AtariSequence class. The evaluation callback does not have access to this class. In order to be able to pass this class to the callback we have to make changes to the AtariSequence class that would let it play nice with the callback. The changes for this purpose were: - Extract the core logic within the for loop in the `simulate` method in order to have the logic for simulating a "single iteration" in its own method. This new method is called `simulate_single`. This makes the logic of both the `simulate` and `simulate_single` methods clearer. - Write getter methods for the latest state variables we didn't already have them before, namely rewards, done, and action. As we mention below this helped clean our logic. It also gives an interface for the evaluation callback to request the AtariSequence class for reward and done values at each simulation single step to calculate model performance (based on total rewards) and control episode termination logic (based on done value). We also removed some unnecessary complexity from the SynchronousSequence class: - Since the OpenAI Gym implementation of the Atari environments has frame skipping built in we removed our manual frame skipping. It is no longer present as a constructor parameter nor does it influence control flow in the simulation logic. We removed it both from the SynchronousSequence class and the AtariSequence class. - Since we wrote a method for simulating a single step we did not require the variables `self.initial_sims` and `self.initial_iterations`. This variables were used to account for the fact that to fill the replay buffer initially by some amount N, we needed to do so in multiples of the minimum batch size. Using the method `simulate_single` instead of `simulate` for doing the initial filling of the replay buffer is significantly more intuitive and simple. We updated references to these two extraneous variables throughout the class code, using `self.replay_buffer_min` instead. - Removed internal variables keeping track of the last simulated state (e.g. self.prev_observation, self.prev_action, self.prev_done, self.prev_reward). The replay buffer contains all of this information already (in the rightmost buffer entries). Rather than storing these in separate variables we wrote getters to get the rightmost (index - 1) entry in each buffer. Since indexing in Python lists is O(1) we should not incur any serious compute cost from this, but it simplifies code significantly.
A major bug with the evaluation callback when used for the Atari environment is that the callback did not contain the methods for generating features and actions. In order to correct this we have modified the custom callback class so that: - Instad of passing the environment, the gamma value, and the exploration schedule, we just pass a method that constructs an instance of the appropriate SynchronousSequence subclass for the problem. - Each time simulations are required for evaluating the performance of an agent, an instance of the SynchronousSequence subclass is created and passed to the callback class's `simulate_episodes` method. The instantiation of sequence classes can be seen in the `on_train_begin` and `on_epoch_end` methods. - The `simulate_episodes` method utilizes the `simulate_single`, state getter methods, and internal iteration and episode tracker variables contained in the SynchronousSequence class. This way, the simulation logic of the callback has access to the logic for computing complex features and resulting actions contained in the sequence class. - Given that the sequence class provides the complete logic for getting states we removed the methods for getting actions from the custom callback class. The sequence class takes care of this now. We also added missing documentation for the callback class constructor parameter `num_max_iter`.
|
We made the necessary changes for having the evaluation callback class use the sequence class for generating features and selecting actions. Some outstanding items:
We're so close!!! |
- Assume that sequence construtor has a parameter `random` that if set to true returns a sequence representing an agent using the random policy. - Use this `random` parameter to instantiate the random policy agent in the `on_train_begin` method. - Fix `agmeth.evaluate_state` method signature to include graph parameter (more recent addition). Pass graph to this method by pulling graph property from the sequence class constructed for evaluation. - Instead of using environment contained in sequence constructed to get initial states, for each initial state we want to sample we construct a new sequence and simulate for the minimum number of steps needed to construct the initial feature. This is done in the `on_epoch_end` method. Doing it this way allows us to use the shared API we constructed for SynchronousSequence subclasses to use them to generate the necessary features.
We found that sampling 10,000 initial states at the end of each epoch would take 20 minutes each time. This is more than 2x the length of an epoch itself! Instead of doing this each epoch we made to changes: - Move the sampling of initial states to the construction of the custom callback. This way we only pay the cost of sampling the states once per training rather than once per epoch. - Serialize sampled initial states and save them into a compress .npz file. This way we only incur the sampling cost once **per model** (possibly per environment) and we can have a consistent baseline of states for comparing the performance of different versions of the model (or different, but similar models with the same environment). Doing this makes evaluation extremely fast, at the price of the initial run to generate the sampled initial states.
…lass We noticed that it would make more sense to have the constructor for evaluation sequence subclass instances to be included as part of the class itself. We added a method `create_validation_instance` to each subclass of `SynchronousSequence`. This method returns an instance of the subclass with parameters chosen to be used for the validation callback. The method takes the evaluation exploration schedule as a parameter. This is what is expected by the (updated -- see next commits) evaluation callback. We also created the constructor for the evaluation of agents using the `EnvironmentSequence` class, which we previously had not implemented. This allows us to use the evaluation callback in its current implementation with our classical control agents. Thus this makes our changes to the callback compatible backwards with our previous work.
…ck def We made breaking changes to the evaluation callback and sequence subclasses that made them not work with the classical control agents. We updated the rundef of these agents to make them backwards compatible: - Extract graph for these agents' models so we can run generators and models in different threads if needed for double DQN. Double DQN is not implemented for these models so its not relevant but the SynchronousSequence constructor expects the graph to be passed to it anyways. May simplify this extraneous parameter for these agents at a later time. - Move instantiation of sequence class before callback parameter definition so we can pass reference to sequence instance method to callback as parameter. - Update evaluation callback parameters to match how the evaluation callback is currently defined. Of note is passing the sequence subclass's `create_validation_instance` method to the evaluation callback so it can create instances of the sequence for evaluation purposes.
…ctor param - We have readded the epsilon constructor parameter which allows us to pass to the constructor a lambda that calculates the epsilon that should be used based on the iteration number. - Rather than taking the parameter random(=True/False), the sequence constructor function now takes the epsilon lambda. This simplifies how we handle creating different types of agents for evaluation. - Fixed a bug where the calculated number of episodes ran was one larger than the number that were actually ran. - Fixed logic for serializing initial states sampled so that if the parameter `init_state_dir` is None we do not serialize the sampled states at the end of the sampling method.
|
We should be good to go! Finally!!! :D :D :D |
|
Note that unlike the paper instead of normalizing each individual reward we normalize groups of 3-5 rewards depending on how many frames are skipped. In practice this should shrink the scale of the rewards (since we're mapping 3-5 rewards to the interval [-1, 1]) but it should overall be less restrictive than the normalization of the original paper. This is because we reward groups of actions with positive reward, even if intervening rewards in the group have negative outcomes. In theory rewarding positive reward groups should be overall more expressive than rewarding single positive rewards. |
|
We keep getting If it's not this I'm not really sure what it is. It could be related to:
However Python should be cleaning up all of this variables so this would be quite surprising. It's also worth noting that time per epoch continues to increase as training goes on even beyond the 5th epoch which is when the replay memory should reach its maximum size. Training time increases as epochs go on which is also surprising because the code is written as to be independent of the replay memory size (lists are supposed to be |
In #1 we successfully implemented DQN starting with the framework we implemented in github.com/dbulhosa/ilsvrc for training Imagenet models. We tested our DQN implementation by solving Open AI Gym's CartPole-v0 environment. In #2, #3, and #4 we tuned our hyperparameters and models to create agents that solved not just
CartPole-v0better than our first solution, but alsoAcrobot-v1,MountainCar-v0, andCartPole-v1. Solving these problems and extending our library in the process has made us more confident about this framework, sufficiently so to use it with Atari games.We next extend the framework to accommodate Atari games. This requires: