generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Closed
Description
I've reproduced the whole StackLLaMA pipeline using the changes in #398 #399 #400
Here is the corresponding wandb report
A couple notes:
- As my base llama I used
huggyllama/llama-7b - My supervised ft run was better the in the blog post, reaching a lower ppl
- My reward modelling run was worse than the blog post (67%) and only reached 63% after one epoch. So I ran it for two epochs and got ~66% which I felt was sufficient
- The RL training curves look very similar. I found that I could achieve similar performance with a lower KL coefficient (0.02) in less training time, 600 epochs vs 1200 but still have the original KL coefficient run (0.2)
I've also published my adapter weights on the hub
https://huggingface.co/mnoukhov/llama-7b-se-peft
https://huggingface.co/mnoukhov/llama-7b-se-rm-peft
https://huggingface.co/mnoukhov/llama-7b-se-rl-peft
Use the merge_peft script in #398 to merge huggyllama/llama-7b and llama-7b-se-peft to make llama-7b-se
Then merge llama-7b-se with llama-7b-se-rm-peft to make the reward model and llama-7b-se-rl-peft to the make StackLLaMA
lvwerra, younesbelkada, kashif, dh2shin and evgenii-nikishin
Metadata
Metadata
Assignees
Labels
No labels