[Feature Request] Allowing Multiple Rewards #1160

James-R-Han · 2022-11-07T15:08:22Z

🚀 Feature

In env.step(), the reward is not just a scalar value. It is a list or tuple of rewards.

Ex. reward = (reward1, reward2, reward3).

Motivation

Open AI Gym allows you to return a tuple of rewards. Ex. in a car racing game, what is the reward of getting closer to the end, what is the reward of collecting coins, etc.
There is much benefit if we can log the rewards from the environment. It would allow for faster debugging, reward tuning, model explainability, etc.

Pitch

Ideally, there are two components:

SB3 would sum this reward tuple and go about standard model learning.
Tensorboard logging would show the rewards of different types during an episode. (There should be an option (perhaps in info) to name the rewards coming back from env.step())

Thank you!

Alternatives

No response

Additional context

No response

Checklist

I have checked that there is no similar issue in the repo

araffin · 2022-11-07T15:13:24Z

Open AI Gym allows you to return a tuple of rewards.

https://github.com/openai/gym/blob/master/gym/core.py#L86

where did you see that?
The type of the reward is a float.

Alternatives
There is much benefit if we can log the rewards from the environment.

you have got the info dictionary for that.
And you can use wrappers/callbacks to log additional data.

James-R-Han · 2022-11-07T15:22:16Z

Hey araffin, thanks for the speedy response :)

If you define a custom environment, you can return whatever you want in the reward variable. All OpenAI Gym is, is a framework to develop the key steps of an RL environment. The problem is having an OpenAI Gym environment interact with SB3 where they expect a reward to be a float or integer.
The info dictionary is a good idea, but when I initially thought of it, I wasn't sure how I would be able to integrate/pass that through tensorboard. Do you have any ideas on how to do this?

qgallouedec · 2022-11-07T15:29:24Z

If you define a custom environment, you can return whatever you want in the reward variable. All OpenAI Gym is, is a framework to develop the key steps of an RL environment. The problem is having an OpenAI Gym environment interact with SB3 where they expect a reward to be a float or integer.

All envs, including custom envs must inherit from gym.Env. Therefore, the overwritten step method must match the abstract method in terms of typing. Since gym.Env defines the reward as a float, your custom env must also return a float reward. This is something you can't overwrite

James-R-Han · 2022-11-07T15:32:34Z

Hey qgallouedec, thanks for joining the convo.

I'm not too sure then because after training an SB3 model, I manually used my gym's environment of env.reset(), env.step(), and model.predict() to generate an episode, and when I modified my step function to return a tuple reward it worked fine. And I'm definitely inheriting gym.Env.

qgallouedec · 2022-11-07T15:54:04Z

You are asking a question about Python here, more than about gym or SB3.

Inheritance must follow the Liskov substitution principle. One of the corollaries is that you can't overwrite the type of returned objects. See python/mypy#1237

In the context of your question, since the superclass gym.Env defines the reward as a float, you can't override that with a tuple (if it works, it's because of Python's great resilience, not because the code is properly written or structured).

qgallouedec · 2022-11-07T16:09:25Z

2. The info dictionary is a good idea, but when I initially thought of it, I wasn't sure how I would be able to integrate/pass that through tensorboard. Do you have any ideas on how to do this?

~~This should work~~:

from stable_baselines3.common.env_util import make_vec_env

env = make_vec_env(env_id, monitor_kwargs=dict(info_keywords="info_key_to_log"))

EDIT: typo make_env_env -> make_vec_env
EDIT 2: Example code won't log into tensorboard

qgallouedec · 2022-11-07T16:33:40Z

Use this instead (here, I log "prob" from info dict with "Taxi-v3" env)

from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback


class TensorboardCallback(BaseCallback):
    def _on_step(self) -> bool:
        probs = self.locals["infos"][0]["prob"]
        self.logger.record("prob", probs)
        return True

PPO("MlpPolicy", "Taxi-v3", tensorboard_log="./tensorboard/").learn(10_000, callback=TensorboardCallback())

James-R-Han · 2022-11-07T21:21:10Z

Hi Quentin, everything you said makes sense, thank you for that. I'm not too familiar with SB3, so I'll have to spend a bit of time understanding the logger and monitoring classes and testing it out. Thanks again! I'll close it for the time being.

James-R-Han added the enhancement New feature or request label Nov 7, 2022

James-R-Han changed the title ~~Allowing Multiple Rewards~~ [Feature Request] Allowing Multiple Rewards Nov 7, 2022

James-R-Han closed this as completed Nov 7, 2022

qgallouedec mentioned this issue Dec 19, 2022

How to access agent states using custom callbacks #1219

Closed

4 tasks

araffin mentioned this issue Dec 19, 2022

Improve tensorboard custom callback doc #1221

Merged

14 tasks

qgallouedec mentioned this issue Apr 10, 2023

[Question] Logging when using SubprocVecEnv and multiple environments #1437

Closed

4 tasks

cherrywoods mentioned this issue Sep 16, 2024

[Feature Request] Safe Reinforcement Learning & Multi-Objective Reinforcement Learning #2008

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Allowing Multiple Rewards #1160

[Feature Request] Allowing Multiple Rewards #1160

James-R-Han commented Nov 7, 2022

araffin commented Nov 7, 2022

Uh oh!

James-R-Han commented Nov 7, 2022 •

edited

Loading

Uh oh!

qgallouedec commented Nov 7, 2022

Uh oh!

James-R-Han commented Nov 7, 2022

Uh oh!

qgallouedec commented Nov 7, 2022 •

edited

Loading

Uh oh!

qgallouedec commented Nov 7, 2022 •

edited

Loading

Uh oh!

qgallouedec commented Nov 7, 2022 •

edited

Loading

Uh oh!

James-R-Han commented Nov 7, 2022

Uh oh!

[Feature Request] Allowing Multiple Rewards #1160

[Feature Request] Allowing Multiple Rewards #1160

Comments

James-R-Han commented Nov 7, 2022

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Checklist

araffin commented Nov 7, 2022

Uh oh!

James-R-Han commented Nov 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Nov 7, 2022

Uh oh!

James-R-Han commented Nov 7, 2022

Uh oh!

qgallouedec commented Nov 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Nov 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Nov 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

James-R-Han commented Nov 7, 2022

Uh oh!

James-R-Han commented Nov 7, 2022 •

edited

Loading

qgallouedec commented Nov 7, 2022 •

edited

Loading

qgallouedec commented Nov 7, 2022 •

edited

Loading

qgallouedec commented Nov 7, 2022 •

edited

Loading