Skip to content

Reproducing StackLLaMA #401

@mnoukhov

Description

@mnoukhov

I've reproduced the whole StackLLaMA pipeline using the changes in #398 #399 #400

Here is the corresponding wandb report

A couple notes:

  • As my base llama I used huggyllama/llama-7b
  • My supervised ft run was better the in the blog post, reaching a lower ppl
  • My reward modelling run was worse than the blog post (67%) and only reached 63% after one epoch. So I ran it for two epochs and got ~66% which I felt was sufficient
  • The RL training curves look very similar. I found that I could achieve similar performance with a lower KL coefficient (0.02) in less training time, 600 epochs vs 1200 but still have the original KL coefficient run (0.2)

I've also published my adapter weights on the hub
https://huggingface.co/mnoukhov/llama-7b-se-peft
https://huggingface.co/mnoukhov/llama-7b-se-rm-peft
https://huggingface.co/mnoukhov/llama-7b-se-rl-peft

Use the merge_peft script in #398 to merge huggyllama/llama-7b and llama-7b-se-peft to make llama-7b-se
Then merge llama-7b-se with llama-7b-se-rm-peft to make the reward model and llama-7b-se-rl-peft to the make StackLLaMA

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions