Skip to content

[Feature Request] Allowing Multiple Rewards #1160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
James-R-Han opened this issue Nov 7, 2022 · 8 comments
Closed
1 task done

[Feature Request] Allowing Multiple Rewards #1160

James-R-Han opened this issue Nov 7, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@James-R-Han
Copy link

🚀 Feature

In env.step(), the reward is not just a scalar value. It is a list or tuple of rewards.

Ex. reward = (reward1, reward2, reward3).

Motivation

  1. Open AI Gym allows you to return a tuple of rewards. Ex. in a car racing game, what is the reward of getting closer to the end, what is the reward of collecting coins, etc.

  2. There is much benefit if we can log the rewards from the environment. It would allow for faster debugging, reward tuning, model explainability, etc.

Pitch

Ideally, there are two components:

  1. SB3 would sum this reward tuple and go about standard model learning.
  2. Tensorboard logging would show the rewards of different types during an episode. (There should be an option (perhaps in info) to name the rewards coming back from env.step())

Thank you!

Alternatives

No response

Additional context

No response

Checklist

  • I have checked that there is no similar issue in the repo
@James-R-Han James-R-Han added the enhancement New feature or request label Nov 7, 2022
@James-R-Han James-R-Han changed the title Allowing Multiple Rewards [Feature Request] Allowing Multiple Rewards Nov 7, 2022
@araffin
Copy link
Member

araffin commented Nov 7, 2022

Open AI Gym allows you to return a tuple of rewards.

https://github.com/openai/gym/blob/master/gym/core.py#L86

where did you see that?
The type of the reward is a float.

Alternatives
There is much benefit if we can log the rewards from the environment.

you have got the info dictionary for that.
And you can use wrappers/callbacks to log additional data.

@James-R-Han
Copy link
Author

James-R-Han commented Nov 7, 2022

Hey araffin, thanks for the speedy response :)

  1. If you define a custom environment, you can return whatever you want in the reward variable. All OpenAI Gym is, is a framework to develop the key steps of an RL environment. The problem is having an OpenAI Gym environment interact with SB3 where they expect a reward to be a float or integer.

  2. The info dictionary is a good idea, but when I initially thought of it, I wasn't sure how I would be able to integrate/pass that through tensorboard. Do you have any ideas on how to do this?

@qgallouedec
Copy link
Collaborator

  1. If you define a custom environment, you can return whatever you want in the reward variable. All OpenAI Gym is, is a framework to develop the key steps of an RL environment. The problem is having an OpenAI Gym environment interact with SB3 where they expect a reward to be a float or integer.

All envs, including custom envs must inherit from gym.Env. Therefore, the overwritten step method must match the abstract method in terms of typing. Since gym.Env defines the reward as a float, your custom env must also return a float reward. This is something you can't overwrite

@James-R-Han
Copy link
Author

Hey qgallouedec, thanks for joining the convo.

I'm not too sure then because after training an SB3 model, I manually used my gym's environment of env.reset(), env.step(), and model.predict() to generate an episode, and when I modified my step function to return a tuple reward it worked fine. And I'm definitely inheriting gym.Env.

@qgallouedec
Copy link
Collaborator

qgallouedec commented Nov 7, 2022

You are asking a question about Python here, more than about gym or SB3.

Inheritance must follow the Liskov substitution principle. One of the corollaries is that you can't overwrite the type of returned objects. See python/mypy#1237

In the context of your question, since the superclass gym.Env defines the reward as a float, you can't override that with a tuple (if it works, it's because of Python's great resilience, not because the code is properly written or structured).

@qgallouedec
Copy link
Collaborator

qgallouedec commented Nov 7, 2022

2. The info dictionary is a good idea, but when I initially thought of it, I wasn't sure how I would be able to integrate/pass that through tensorboard. Do you have any ideas on how to do this?

This should work:

from stable_baselines3.common.env_util import make_vec_env

env = make_vec_env(env_id, monitor_kwargs=dict(info_keywords="info_key_to_log"))

EDIT: typo make_env_env -> make_vec_env
EDIT 2: Example code won't log into tensorboard

@qgallouedec
Copy link
Collaborator

qgallouedec commented Nov 7, 2022

Use this instead (here, I log "prob" from info dict with "Taxi-v3" env)

from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback


class TensorboardCallback(BaseCallback):
    def _on_step(self) -> bool:
        probs = self.locals["infos"][0]["prob"]
        self.logger.record("prob", probs)
        return True

PPO("MlpPolicy", "Taxi-v3", tensorboard_log="./tensorboard/").learn(10_000, callback=TensorboardCallback())

@James-R-Han
Copy link
Author

Hi Quentin, everything you said makes sense, thank you for that. I'm not too familiar with SB3, so I'll have to spend a bit of time understanding the logger and monitoring classes and testing it out. Thanks again! I'll close it for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants