Extend discrete classic control framework to train Atari double DQN model #5

ghost · 2020-06-26T05:20:03Z

In #1 we successfully implemented DQN starting with the framework we implemented in github.com/dbulhosa/ilsvrc for training Imagenet models. We tested our DQN implementation by solving Open AI Gym's CartPole-v0 environment. In #2, #3, and #4 we tuned our hyperparameters and models to create agents that solved not just CartPole-v0 better than our first solution, but also Acrobot-v1, MountainCar-v0, and CartPole-v1. Solving these problems and extending our library in the process has made us more confident about this framework, sufficiently so to use it with Atari games.

We next extend the framework to accommodate Atari games. This requires:

Adding support for features built from observations, rather than passing observations directly to the model.
Writing code to serialize and deserialize replay buffer data if the replay buffer needed (about 1M transitions) is too big to fit in our computer.
Change hyperparameters to match those from Atari papers, including adding an exploration schedule and seeing if there is any learning rate scheduling.

ghost · 2020-06-27T00:23:16Z

Here's what we need to change to generalize the EnvironmentSequence class to fit the Atari use case:

Generalize the get_latest_observation method to a get_latest_feature method that returns whatever the latest input feature for the model is, whether it's just the observation itself or some processed version of it.
Implement some method that creates the necessary feature from a given observation or observation index.
Use the get_latest_feature method in the get_action method instead of get_latest_observation.
Tweak the get_states_length and sample_indices methods to sample indices that make sense for the features being created, specifically when said features are built from multiple stacked observations as is the case in Atari.
Change get_minibatch method to use sampled indices to generated right features for states array. Rename states array to features array.

Note that the simulation and existing replay buffers can stay the same. Our philosophy will be not to store features but to calculate them on the fly based on observations instead. In the case of Atari games this should significantly reduce the memory usage required at the expense of minimal compute. This also allows us to preserve the simulation, replay memory, target agent, double DQN logic, initialization, and minibatch generation.

We can make some of these methods abstract in a parent class and then define them in child classes. This way we will generalize the existing EnvironmentSequence class to different features!

We're moving towards creating child classes implementing different feature creation functions. In order to do this we extracted the relevant functions from the original `EnvironmentSequence` class (now named `SynchronousSequence` for generality) into a new class. We named the new class `EnvironmentSequence` to keep our code com- patible with the agents we've already implemented for classic control. We may rename it later. See the notes in the PR for the methods that were moved and why.

Hotfix: Accidentally committed these to master instead of atari_rl branch

We ended up simplifying the inheritance significantly and moving some methods back to this class. - Move `get_action` method back into this class. Replace `get_latest_observation` call in this method to `get_latest_feature`. This modification generalizes this method to allow it to pass general features rather than observations to the agent in order to get the action. - Moved back `get_latest_observations` into this class, but changed to `get_latest_feature`. We don't get the n latest observations now, just the very latest feature (which is all we were using anyways). This method simply calls the abstract `get_feature_by_index` method which gets implemented in the child classes. - We modify the `get_minibatch` method to use the aforementioned `get_feature_by_index` method along with the sampled indices to create the stack of states, rather than indexing the observation buffer directly. This generalizes the creation of minibatches, which now can be built directly from observations OR from features created from observations. - Move `sample_indices` method back to this class, but change it to have a custom lower bound on the indices that can be sampled. This way, if we create features by stacking observations we can avoid sampling observations without enough preceeding history in the buffer to create a stack. - The custom lower bound in the `sampled_indices` method is computed by a new abstract method named `get_states_start`. - Added abstract method for `get_feature_by_index` method. This makes it so that to define our child classes we only need to define two methods! - `get_feature_by_index` which handles getting the correct feature for a particular timestep. - `get_states_start` to define the lower bound of timesteps we can sample to create features.

ghost · 2020-06-27T07:24:16Z

Codewise we're mostly there we just need to tailor the agent in our rundef to match the specs for the Atari agents. Everything else is in place and should in theory work!

NOTE: We will need to preprocess rewards as well to normalize them. Thus we'll want to add a method that preprocesses rewards like the one we have for preprocessing feature, both to the parent and child classes.

In the Atari papers the actual reward from the game is modified to +1, -1, or 0 depending on whether its zero or not and its sign. This is so that the same agent can learn to play different games irrespective of the scale of the rewards. In order to allow this functionality in our framework we have wrapped the reward buffer with a `get_reward_by_index` method that allows us to do an intermediate transformation on the rewards before using them to compute the minibatch labels. This is similar to how we wrapper the observation buffer with the `get_feature_by_index` method to allow transformations of observations into features. It is the next natural step towards abstracting environment outputs so they can be preprocessed before passing them to the agent. We modified the parent environment file to implement the abstract method `get_reward_by_index` and then we implemented corresponding methods in each of the two child classes. For the `EnvironmentSequence` class used for classical discrete control, there is no transform, we just feed the raw rewards to the agent as we have been doing.

In the Atari paper the evaluation is done over an open ended number of episodes, but with an upper bound on the number of frames. To allow this in our evaluation callback we had to modify the evaluation code to account for this. We allow this by: - Adding a `num_max_iter` parameter to the constructor of the class to specify the maximum number of iterations. - Allowing both this new parameter and the existing `num_episodes` to take on the value np.inf to specify they should be without bound. - Using a while loop instead of a for loop to iterate over episodes in the simulation method. This way if the number of episodes is unbounded the loop will still work. - Adding conditional checks in the simulation loop to check if we have exceeded the maximum number of iterations and halt and exit simulation correctly if so. - Change reward and episode length arrays to lists to allow for possible variable length. - Return, print, and log to Tensorboard the number of episodes simulated so if `num_max_iter` is set to a finite number we know how many episodes we actually simulated. Obviously if there is no limit in the number of iterations but there is one on the number of episodes the value printed and logged will be equal to that limit always.

ghost · 2020-06-29T08:57:58Z

Everything should be code complete. We are getting a CUDA error when running the rundef file. We need to take a look at that - -we think resetting the computer may be enough to fix it. Take a look at the FIXMEs in the rundef file for Breakout v00.

We also need to decide if we want to use double DQN and if so make the necessary tweaks to the rundef. These are all specified in the double DQN paper and with the exception of bias one should be straightforward to implement.

FIXES: The reset ended up doing the trick for the CUDA error. We ended up deciding to use double DQN but with the original DQN parameter and spec for simplicity -- mainly because of the extra work of implementing a constraint on the final layer's kernal biases. We also realized from reading the double DQN paper that frames produced during action repeats are skipped by the agent for learning altogether, and made changes in our code to enable this "frame skipping" functionality.

The `skip_frames` parameter determines whether frames that the environment produces during skip actions are saved in the buffer (and thus used for training) or not. After reading the DQN and double DQN papers more closely we realized that during action repeats frames seemed to not be stored in the replay memory. The paper says the DQN agents "do not see" these frames. Thus we added a mechanism to skip these frames altogether and not add them to the memory. For backwards compatibility purposes this parameter is defaulted to `False` in the class constructor.

Since we will be skipping frames altogether from the training process we thought it would make sense to add all of the rewards for the frames skipped and adding them together into a single reward. That way if rewards are created by the environment in skipped frames the agent still knows about them. We implemented this by modifying the `get_rewards_by_index` method to look at all previous rewards in the action repeat window.

- We implemented skip frames in our generators, so we set the `skip_frames` parameter to `True` so that this agent would skip frames. - In reading the double DQN paper we realized that many of the parameters in the original Atari paper are formulated in terms of "steps" where a step is four repeated actions and the resulting rewards, observations, etc. We reformulated our parameter definitions to be defined in terms of steps, which are defined in terms of # of actions repeated.

ghost · 2020-07-01T06:34:45Z

The rundef for the Atari Breakout agent runs, yay! The training runner does not run for this agent though. We got a couple of important errors to solve to fix this:

Our agent evaluation callback reimplemented the simulation method rather than reusing the one from the Sequence class used by the agent. As such the evaluation method does not know how to create features from observations, it has no notion of skip frames, and it does not have a buffer to create frames from. We think we'll probably want to rethink this method, and have it create it's own sequence instances so it has access to these methods. This is currently causing an error with evaluation since the on train start callback call is passing image tensors with the wrong shape to the model.
Our stacked frames are stacking frames across different episodes, including trying to sum None rewards at episode ends with rewards at the end and beginning of subsequent episodes. We need to prevent frame stacking across episodes. The easiest way to accomplish this is by excluding indices from minibatches that cross episodes. [DONE - Easy fix]

…sode ends Certain frame stacks were intersection episode ends. It doesn't make sense to pass frame stacks to the model that intersect the transition between episodes. We changed the logic calculating valid indices to calculate the validity of all indices in a frame stack, not just the leading index.

ghost · 2020-07-02T05:17:52Z

We tried running an epoch and found that a single one (about 200k frames) would take about 48 minutes, which would mean a total running time of 30+ days. This is much longer that what was stated in the paper. We did some performance analysis by timing the simulation and the minibatch generator. We found that:

The minibatch getter was about an order of magnitude slower than the simulation loop.
The valid index computing and buffer sampling + feature generation steps were about an order of magnitude slower than other parts of the minibatch generation (including the model evaluation).

We think that if make the valid index and observation buffer sampling and featurization calculations more efficient we can reduce epoch time by almost 10x. Will focus on this next before worrying about evaluation.

Valid index optimization. [DONE - 100x improvement by vectorizing with numpy]
Observation buffer sampling and feature compute, 15ms for each of states and next states array.
- Most expensive subroutines are taking max of adjacent frame pairs and cv2 transformations.
- Each of these contribute equally to the time cost of the feature creation.
- Each of the methods within the feature creation method contribute equally to time cost.
- NOTE, we checked, the maximum and cv2 preprocessing functions commute! We can use this to our advantage.
  Maybe we should preprocess the images during simulation, and take the pairwise max during minibatch creation.

We realized though that it would be cheaper memory-wise to store an 84 x 84 x 4 image stack vs. what we're doing currently, which is saving single 210 x 160 x 3 RGB images. It may thus be both more computationally efficient and memory efficient to save the features in the buffer instead of saving the raw observations. Making the change though will require changing how we handle recording into the buffer and how we handle feature transformations in the child sequence classes.

Also, just realized that due to frame skips stacks of 4 frames really correspond to 16 frames of play (since the replay buffer does not store intervening frames). Is this what we want? Does this make sense? It may but I need to think about it more. The answer to this question can be influenced by the question of how to properly do a feature buffer.

ghost · 2020-07-17T09:06:56Z

We got it to work!!! Increasing the number of workers from 0 to 1 successfully has the epoch time down to 11.7 minutes per epoch!!! The total time for running a full training (excluding evaluation) with 200M frames is now 2 days which is fantastic. For reference running models on Imagenet took us 2 days+, and the original DQN implementation took over a week to run!!!

We timed the per minibatch time and it came to 14ms, which aligns with our previous estimate for the forward and backpropagation time. We doubt we'll get it to run much better than this, mainly because in theory the bottleneck now should be the forward and backpropagation itself, not the minibatch generation. Only the latter is under our control. Irrespective, this performance is more than satisfactory.

Only things left is to fix are:

That indexing bug we mentioned in the previous comment. A potential way to surface this bug is making the experience replay much smaller temporarily, so that the error doesn't get lost in sampling.
- We fixed it. It was a simple bug: We weren't accounting for the length of the frame stack when invalidating replay memory indices in the leftmost side of the replay memory (the oldest).
Improving the evaluation callback as discussed in previous comments.
- In Progress. See comment below.
Making v00/rundef.py into v01/rundef.py.
- Done.
Frame stacks can technically cross across episodes when we use get_latest_feature to select an action if the action selection is happening near the beginning of the episode.
- We fixed this. When getting an action we check if the latest frames in the replay memory contain an episode end. If so we select a random action. Since episode starts are infrequent and frame stacks are pretty short (4-5 frames) this shouldn't really affect the performance of the agent much. It also was a pretty elegant, minimal solution codewise (as opposed to the keras-rl solution where black frames replace the frames from the preceding episode).

Also we noticed that using the built in model cloning method in Keras and manually compiling the copy we use for the target model seems to be faster than deepcopying. This is a tiny optimization we could look at in the future, but we don't think it's worth the extra code complexity.

Two changes: - Add a case in get_states_length method to account for when pair_max is false and the frame stack we sample is 4 frames instead of 5. - Use get_states_length method to calculate correct point at which to invalidate indices in the far left of the valid index memory buffer. The get_states_length leftmost entries of the memory buffer are invalid for sampling because there are not enough frames to the left of these to sample a full stack. The fact this was not accounted for in the index invalidation logic was creating index errors.

When selecting actions we pass a stack of frames (our feature) to the model. In particular we pass the most recent stack of frames (the rightmost end) in the replay memory. If the action is selected early in a new episode some of the frames in the rencent frame stack will include frames from the previous episode. Thus if we stacked these frames we would be passing frames from two different episodes to the model. This clearly wouldn't make sense as a feature. We therefore made a change to action selection: When the most recent frame stack crosses an episode boundary we select a random action instead of passing the frame stack to the model. This avoids this issue. Furthermore, since frame stacks are relatively small (4-5 frames) and episode starts are not that common this should not affect the performance of the agent very much.

As mentioned in previous commits we solved the issue which limited us from running two threads: one for creating minibatches and another for backpropagating. The problem had to do with copies of the same model running on different threads. We also removed the shuffling on minibatch indices. Since our agent is stateful our minibatch generation only makes sense in order. Hence the change.

…l callback As stated in the PR the evaluation callback as written cannot work for non-classical enviroments, specifically our Atari one. This is because the code for generating features to feed to the model to get the next action are part of the AtariSequence class. The evaluation callback does not have access to this class. In order to be able to pass this class to the callback we have to make changes to the AtariSequence class that would let it play nice with the callback. The changes for this purpose were: - Extract the core logic within the for loop in the `simulate` method in order to have the logic for simulating a "single iteration" in its own method. This new method is called `simulate_single`. This makes the logic of both the `simulate` and `simulate_single` methods clearer. - Write getter methods for the latest state variables we didn't already have them before, namely rewards, done, and action. As we mention below this helped clean our logic. It also gives an interface for the evaluation callback to request the AtariSequence class for reward and done values at each simulation single step to calculate model performance (based on total rewards) and control episode termination logic (based on done value). We also removed some unnecessary complexity from the SynchronousSequence class: - Since the OpenAI Gym implementation of the Atari environments has frame skipping built in we removed our manual frame skipping. It is no longer present as a constructor parameter nor does it influence control flow in the simulation logic. We removed it both from the SynchronousSequence class and the AtariSequence class. - Since we wrote a method for simulating a single step we did not require the variables `self.initial_sims` and `self.initial_iterations`. This variables were used to account for the fact that to fill the replay buffer initially by some amount N, we needed to do so in multiples of the minimum batch size. Using the method `simulate_single` instead of `simulate` for doing the initial filling of the replay buffer is significantly more intuitive and simple. We updated references to these two extraneous variables throughout the class code, using `self.replay_buffer_min` instead. - Removed internal variables keeping track of the last simulated state (e.g. self.prev_observation, self.prev_action, self.prev_done, self.prev_reward). The replay buffer contains all of this information already (in the rightmost buffer entries). Rather than storing these in separate variables we wrote getters to get the rightmost (index - 1) entry in each buffer. Since indexing in Python lists is O(1) we should not incur any serious compute cost from this, but it simplifies code significantly.

A major bug with the evaluation callback when used for the Atari environment is that the callback did not contain the methods for generating features and actions. In order to correct this we have modified the custom callback class so that: - Instad of passing the environment, the gamma value, and the exploration schedule, we just pass a method that constructs an instance of the appropriate SynchronousSequence subclass for the problem. - Each time simulations are required for evaluating the performance of an agent, an instance of the SynchronousSequence subclass is created and passed to the callback class's `simulate_episodes` method. The instantiation of sequence classes can be seen in the `on_train_begin` and `on_epoch_end` methods. - The `simulate_episodes` method utilizes the `simulate_single`, state getter methods, and internal iteration and episode tracker variables contained in the SynchronousSequence class. This way, the simulation logic of the callback has access to the logic for computing complex features and resulting actions contained in the sequence class. - Given that the sequence class provides the complete logic for getting states we removed the methods for getting actions from the custom callback class. The sequence class takes care of this now. We also added missing documentation for the callback class constructor parameter `num_max_iter`.

ghost · 2020-07-31T10:15:05Z

We made the necessary changes for having the evaluation callback class use the sequence class for generating features and selecting actions. Some outstanding items:

Debug evaluation callback and sequence constructor method. Getting some bugs at the moment. [DONE - Much easier than anticipated. Very minor bug.]
Fix classical control rundefs. The change to have the evaluation callback use the sequence constructor method is breaking for their rundefs. We need to make changes to those rundefs similar to the changes we just did for the breakout_v0 rundef. [DONE - we also took the opportunity to simplify how the instantiation of evaluation sequence instances gets done]
Make the Atari evaluation sequence constructor take 1 parameter instead of 0. It should take the exploration schedule lambda as a parameter. This was the evaluation callback can use the constructor to create an agent + environment that follows the random policy. This is needed for the initial agent evaluation at the beginning of training. [DONE - Added parameter in constructor to use random policy or not instead]

We're so close!!!

- Assume that sequence construtor has a parameter `random` that if set to true returns a sequence representing an agent using the random policy. - Use this `random` parameter to instantiate the random policy agent in the `on_train_begin` method. - Fix `agmeth.evaluate_state` method signature to include graph parameter (more recent addition). Pass graph to this method by pulling graph property from the sequence class constructed for evaluation. - Instead of using environment contained in sequence constructed to get initial states, for each initial state we want to sample we construct a new sequence and simulate for the minimum number of steps needed to construct the initial feature. This is done in the `on_epoch_end` method. Doing it this way allows us to use the shared API we constructed for SynchronousSequence subclasses to use them to generate the necessary features.

We found that sampling 10,000 initial states at the end of each epoch would take 20 minutes each time. This is more than 2x the length of an epoch itself! Instead of doing this each epoch we made to changes: - Move the sampling of initial states to the construction of the custom callback. This way we only pay the cost of sampling the states once per training rather than once per epoch. - Serialize sampled initial states and save them into a compress .npz file. This way we only incur the sampling cost once **per model** (possibly per environment) and we can have a consistent baseline of states for comparing the performance of different versions of the model (or different, but similar models with the same environment). Doing this makes evaluation extremely fast, at the price of the initial run to generate the sampled initial states.

…lass We noticed that it would make more sense to have the constructor for evaluation sequence subclass instances to be included as part of the class itself. We added a method `create_validation_instance` to each subclass of `SynchronousSequence`. This method returns an instance of the subclass with parameters chosen to be used for the validation callback. The method takes the evaluation exploration schedule as a parameter. This is what is expected by the (updated -- see next commits) evaluation callback. We also created the constructor for the evaluation of agents using the `EnvironmentSequence` class, which we previously had not implemented. This allows us to use the evaluation callback in its current implementation with our classical control agents. Thus this makes our changes to the callback compatible backwards with our previous work.

…ck def We made breaking changes to the evaluation callback and sequence subclasses that made them not work with the classical control agents. We updated the rundef of these agents to make them backwards compatible: - Extract graph for these agents' models so we can run generators and models in different threads if needed for double DQN. Double DQN is not implemented for these models so its not relevant but the SynchronousSequence constructor expects the graph to be passed to it anyways. May simplify this extraneous parameter for these agents at a later time. - Move instantiation of sequence class before callback parameter definition so we can pass reference to sequence instance method to callback as parameter. - Update evaluation callback parameters to match how the evaluation callback is currently defined. Of note is passing the sequence subclass's `create_validation_instance` method to the evaluation callback so it can create instances of the sequence for evaluation purposes.

…ctor param - We have readded the epsilon constructor parameter which allows us to pass to the constructor a lambda that calculates the epsilon that should be used based on the iteration number. - Rather than taking the parameter random(=True/False), the sequence constructor function now takes the epsilon lambda. This simplifies how we handle creating different types of agents for evaluation. - Fixed a bug where the calculated number of episodes ran was one larger than the number that were actually ran. - Fixed logic for serializing initial states sampled so that if the parameter `init_state_dir` is None we do not serialize the sampled states at the end of the sampling method.

ghost · 2020-08-02T05:26:31Z

We should be good to go! Finally!!! :D :D :D

ghost · 2020-08-02T10:58:33Z

Note that unlike the paper instead of normalizing each individual reward we normalize groups of 3-5 rewards depending on how many frames are skipped. In practice this should shrink the scale of the rewards (since we're mapping 3-5 rewards to the interval [-1, 1]) but it should overall be less restrictive than the normalization of the original paper. This is because we reward groups of actions with positive reward, even if intervening rewards in the group have negative outcomes. In theory rewarding positive reward groups should be overall more expressive than rewarding single positive rewards.

ghost · 2020-08-07T06:20:23Z

We keep getting Process finished with exit code 137 (interrupted by signal 9: SIGKILL) after about 25 epochs. We're pretty sure it's a memory leak but we're not sure where. It may be that one of the replay buffers in growing larger than the maximum size. We'll try seeing if we can initiate the replay buffer with more instances than the replay buffer max.

If it's not this I'm not really sure what it is. It could be related to:

Copying the current model into the target model multiple times.
Making new sequence instances for the evaluation callback that are not getting deleted.

However Python should be cleaning up all of this variables so this would be quite surprising.

It's also worth noting that time per epoch continues to increase as training goes on even beyond the 5th epoch which is when the replay memory should reach its maximum size. Training time increases as epochs go on which is also surprising because the code is written as to be independent of the replay memory size (lists are supposed to be O(1) for example though maybe in practice this isn't exactly true). This is another thing we should look at which may be related to the memory leak.

Initial empty commit

a82ad46

danielbulhosa and others added 9 commits June 26, 2020 21:45

Update README with general RL learnings so far

0455fc4

Rename parent sequence more fittingly

c1e0429

Merge pull request #6 from dbulhosa/atari_rl_initial

cbce9b0

Hotfix: Accidentally committed these to master instead of atari_rl branch

Merge branch 'master' into atari_rl

e9e1350

Significantly simplify env gen based on sync gen interface

d1b070b

Implement feature creation and indexing for Atari seq

1c2ba60

Add documentation to core Atari preprocessing funcs

3bb4f83

danielbulhosa added 8 commits June 29, 2020 01:13

Add max_epochs param to test_generator debug func

2521ed0

Reshape Atari feature output so stack dimension is last

5401125

Fix sync generator, get_action method expects state list

00127e2

rundef file for breakout DQN agent, with appropriate params

dfc4b74

Check episode or iter upper bound set, avoid inf loops

076a907

Update evulation parameters for Breakout rundef v00

82147a7

danielbulhosa added 5 commits June 29, 2020 01:59

Pass max iter on eval from rundef to eval callback

5093643

hotfix: reward buffer is list not array

30c1d67

Vectorize invalid index calculation using numpy, 100x faster

091768b

danielbulhosa added 8 commits July 17, 2020 02:34

Get and store default graph during model compile

edd560b

Clean up feature postprocessing method a bit

c6816e5

Add val seq constructor method, update eval callback params

95e9772

danielbulhosa added 10 commits August 1, 2020 19:43

Hotfix: Eliminate ref to scheduler in train runner

a2e3de9

Hotfix: fix errors in print message of eval callback

d205433

Hotfix: create pair max var, add to val seq replay buffer len

0284ba0

Add option to val seq constructor to use random policy

fa1f813

Add copy of initial environment to SyncSeq class for eval

ec3fe23

danielbulhosa added 6 commits August 2, 2020 02:43

Reduce eval initial state number, found 1000 to be enough

7bcd27e

Hotfix: Rewards should be normalized, THEN summed

57f60c0

Sum of rewards not needed since ALE handles frameskipping

0834730

Remove action_repeat parameter from sequence classes, unneeded

823e299

Remove action repeat parameter from agents, unneeded

eaadf38

Hotfix: remove action_repeat ref from val seq constructor

eef7721

Fix frameskipping setting in rundef for breakout

a98083d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extend discrete classic control framework to train Atari double DQN model #5

Extend discrete classic control framework to train Atari double DQN model #5

Uh oh!

ghost commented Jun 26, 2020 •

edited by ghost

Loading

Uh oh!

ghost commented Jun 27, 2020 •

edited by ghost

Loading

Uh oh!

ghost commented Jun 27, 2020 •

edited by ghost

Loading

Uh oh!

ghost commented Jun 29, 2020 •

edited by ghost

Loading

Uh oh!

ghost commented Jul 1, 2020 •

edited by ghost

Loading

Uh oh!

ghost commented Jul 2, 2020 •

edited by ghost

Loading

Uh oh!

ghost commented Jul 17, 2020 •

edited by ghost

Loading

Uh oh!

ghost commented Jul 31, 2020 •

edited by ghost

Loading

Uh oh!

ghost commented Aug 2, 2020

Uh oh!

ghost commented Aug 2, 2020

Uh oh!

ghost commented Aug 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Extend discrete classic control framework to train Atari double DQN model #5

Are you sure you want to change the base?

Extend discrete classic control framework to train Atari double DQN model #5

Uh oh!

Conversation

ghost commented Jun 26, 2020 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jun 27, 2020 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jun 27, 2020 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jun 29, 2020 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jul 1, 2020 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jul 2, 2020 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jul 17, 2020 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jul 31, 2020 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Aug 2, 2020

Uh oh!

ghost commented Aug 2, 2020

Uh oh!

ghost commented Aug 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ghost commented Jun 26, 2020 •

edited by ghost

Loading

ghost commented Jun 27, 2020 •

edited by ghost

Loading

ghost commented Jun 27, 2020 •

edited by ghost

Loading

ghost commented Jun 29, 2020 •

edited by ghost

Loading

ghost commented Jul 1, 2020 •

edited by ghost

Loading

ghost commented Jul 2, 2020 •

edited by ghost

Loading

ghost commented Jul 17, 2020 •

edited by ghost

Loading

ghost commented Jul 31, 2020 •

edited by ghost

Loading