Implementation of advantage function #4476
Unanswered
gauss-clb
asked this question in
Community | Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/coati/experience_maker/naive.py#L52
Why
value
only uses prompt part, https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/coati/models/base/critic.py#L49, butr
uses prompt+response?Why
reward=r-self.kl_coef*kl_divergence(action_log_probs, base_action_log_probs)
, is there any theory to support it?Beta Was this translation helpful? Give feedback.
All reactions