-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[Feature Request] Allowing Multiple Rewards #1160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
https://github.com/openai/gym/blob/master/gym/core.py#L86 where did you see that?
you have got the info dictionary for that. |
Hey araffin, thanks for the speedy response :)
|
All envs, including custom envs must inherit from |
Hey qgallouedec, thanks for joining the convo. I'm not too sure then because after training an SB3 model, I manually used my gym's environment of env.reset(), env.step(), and model.predict() to generate an episode, and when I modified my step function to return a tuple reward it worked fine. And I'm definitely inheriting gym.Env. |
You are asking a question about Python here, more than about gym or SB3. Inheritance must follow the Liskov substitution principle. One of the corollaries is that you can't overwrite the type of returned objects. See python/mypy#1237 In the context of your question, since the superclass |
from stable_baselines3.common.env_util import make_vec_env
env = make_vec_env(env_id, monitor_kwargs=dict(info_keywords="info_key_to_log")) EDIT: typo |
Use this instead (here, I log from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback
class TensorboardCallback(BaseCallback):
def _on_step(self) -> bool:
probs = self.locals["infos"][0]["prob"]
self.logger.record("prob", probs)
return True
PPO("MlpPolicy", "Taxi-v3", tensorboard_log="./tensorboard/").learn(10_000, callback=TensorboardCallback()) |
Hi Quentin, everything you said makes sense, thank you for that. I'm not too familiar with SB3, so I'll have to spend a bit of time understanding the logger and monitoring classes and testing it out. Thanks again! I'll close it for the time being. |
🚀 Feature
In env.step(), the reward is not just a scalar value. It is a list or tuple of rewards.
Ex. reward = (reward1, reward2, reward3).
Motivation
Open AI Gym allows you to return a tuple of rewards. Ex. in a car racing game, what is the reward of getting closer to the end, what is the reward of collecting coins, etc.
There is much benefit if we can log the rewards from the environment. It would allow for faster debugging, reward tuning, model explainability, etc.
Pitch
Ideally, there are two components:
Thank you!
Alternatives
No response
Additional context
No response
Checklist
The text was updated successfully, but these errors were encountered: