Reproducing StackLLaMA

I've reproduced the whole StackLLaMA pipeline using the changes in #398 #399 #400 

Here is the [corresponding wandb report](https://wandb.ai/mnoukhov/trl/reports/StackLLaMA-Repro--Vmlldzo0NTM1MDk2)

A couple notes:
- As my base llama I used `huggyllama/llama-7b` 
- My supervised ft run was better the in the blog post, reaching a lower ppl
- My reward modelling run was worse than the blog post (67%) and only reached 63% after one epoch. So I ran it for two epochs and got ~66% which I felt was sufficient
- The RL training curves look very similar. I found that I could achieve similar performance with a lower KL coefficient (0.02) in less training time, 600 epochs vs 1200 but still have the original KL coefficient run (0.2)

I've also published my adapter weights on the hub 
https://huggingface.co/mnoukhov/llama-7b-se-peft
https://huggingface.co/mnoukhov/llama-7b-se-rm-peft
https://huggingface.co/mnoukhov/llama-7b-se-rl-peft

Use the `merge_peft` script in #398 to merge `huggyllama/llama-7b` and `llama-7b-se-peft` to make `llama-7b-se`
Then merge `llama-7b-se` with `llama-7b-se-rm-peft` to make the reward model and `llama-7b-se-rl-peft` to the make StackLLaMA



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproducing StackLLaMA #401

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproducing StackLLaMA #401

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions