[skyrl-train] Add example for on-policy distillation by erictang000 · Pull Request #585 · NovaSky-AI/SkyRL

erictang000 · 2025-10-27T23:34:25Z

Overview

Adds an example of extending the RayPPOTrainer to do on policy distillation by providing a custom apply_reward_kl_penalty function, a pass through advantage estimator, and using the importance_sampling loss function as detailed in the thinky blog.

distilling rl trained qwen3-4b-base (dapo recipe) back into qwen3-4b-base

gemini-code-assist

Code Review

This pull request adds an example for on-policy distillation, which is a great addition. The core logic change to support a separate reference model path is correct. I've added a few comments to the new example files:

In main_on_policy_distill.py, I've suggested a refactoring for clarity and a change to the loss calculation to improve training stability.
In run_on_policy_distill_math.sh, I've pointed out a misleading checkpoint path and a missing newline at the end of the file.
Overall, the changes look good and the example is very helpful.

skyrl-train/examples/on_policy_distillation/main_on_policy_distill.py

skyrl-train/examples/on_policy_distillation/run_on_policy_distill_math.sh

skyrl-train/retry.sh

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

skyrl-train/examples/on_policy_distillation/run_on_policy_distill_math.sh

…00/SkyRL into on_policy_distillation

…olicy_distillation

tyler-griggs

Can you add a README.md to this examples sub-directory with a brief overview of what is in this example and how to run it

skyrl-train/examples/on_policy_distillation/run_on_policy_distill_math.sh

Added a README for On-Policy Distillation with usage instructions and references.

Added details about On-Policy Distillation and reverse KL loss in the README.

Updated README.md to enhance explanation of On-Policy Distillation and provide quickstart instructions.

…olicy_distillation

Adds an example of extending the `RayPPOTrainer` to do on policy distillation by providing a custom `apply_reward_kl_penalty` function, a pass through advantage estimator, and using the `importance_sampling` loss function as detailed in the [thinky blog](https://tinker-docs.thinkingmachines.ai/losses#policy-gradient-importance_sampling). qwen3-4b-base <img width="1108" height="580" alt="image" src="https://github.com/user-attachments/assets/059dd495-3cdf-4df1-bf27-70e010e368e4" /> --------- Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

# Overview Adds an example of extending the `RayPPOTrainer` to do on policy distillation by providing a custom `apply_reward_kl_penalty` function, a pass through advantage estimator, and using the `importance_sampling` loss function as detailed in the [thinky blog](https://tinker-docs.thinkingmachines.ai/losses#policy-gradient-importance_sampling). ### distilling rl trained qwen3-4b-base (dapo recipe) back into qwen3-4b-base <img width="1108" height="580" alt="image" src="https://github.com/user-attachments/assets/059dd495-3cdf-4df1-bf27-70e010e368e4" /> --------- Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

erictang000 added 2 commits October 27, 2025 21:37

add ref model path and custom on policy distill trainer

e7599de

working gsm8k run with qwen3-8b

7593da7

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

use dr.grpo aggregation

c0d149c

SumanthRH reviewed Oct 28, 2025

View reviewed changes

skyrl-train/retry.sh Outdated Show resolved Hide resolved

Apply suggestions from code review

d66d436

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

SumanthRH reviewed Oct 28, 2025

View reviewed changes

skyrl-train/examples/on_policy_distillation/run_on_policy_distill_math.sh Outdated Show resolved Hide resolved

erictang000 added 5 commits November 4, 2025 00:39

x

c32b1da

Merge branch 'on_policy_distillation' of https://github.com/erictang0…

86a530e

…00/SkyRL into on_policy_distillation

Merge branch 'main' of https://github.com/erictang000/SkyRL into on_p…

b4a02ff

…olicy_distillation

remove retry script

cb5d690

x

98c4d1f

tyler-griggs reviewed Nov 4, 2025

View reviewed changes

skyrl-train/examples/on_policy_distillation/run_on_policy_distill_math.sh Outdated Show resolved Hide resolved

erictang000 and others added 9 commits November 6, 2025 11:28

Create README for On-Policy Distillation example

8e7a236

Added a README for On-Policy Distillation with usage instructions and references.

Enhance README with On-Policy Distillation details

b635774

Added details about On-Policy Distillation and reverse KL loss in the README.

Improve README for On-Policy Distillation example

84c8ced

Updated README.md to enhance explanation of On-Policy Distillation and provide quickstart instructions.

x

76451d2

Merge branch 'main' of https://github.com/erictang000/SkyRL into on_p…

0288d78

…olicy_distillation

Merge branch 'main' of https://github.com/erictang000/SkyRL into on_p…

0aa3a80

…olicy_distillation

x

f332681

x

dc7d730

fix

d2a50e1

erictang000 requested a review from SumanthRH November 6, 2025 20:22

Update README.md

f561504

SumanthRH approved these changes Nov 6, 2025

View reviewed changes

SumanthRH merged commit d6a866c into NovaSky-AI:main Nov 6, 2025
3 checks passed

erictang000 mentioned this pull request Dec 1, 2025

[skyrl-train] Add example using generalized on-policy logit distillation #726

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[skyrl-train] Add example for on-policy distillation#585

[skyrl-train] Add example for on-policy distillation#585
SumanthRH merged 19 commits intoNovaSky-AI:mainfrom
erictang000:on_policy_distillation

erictang000 commented Oct 27, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tyler-griggs left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

erictang000 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

distilling rl trained qwen3-4b-base (dapo recipe) back into qwen3-4b-base

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tyler-griggs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

erictang000 commented Oct 27, 2025 •

edited

Loading