-
Notifications
You must be signed in to change notification settings - Fork 18
Adding Unsloth for LoRA / QLoRA SFT Algorithm #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Maxusmusti
wants to merge
10
commits into
main
Choose a base branch
from
lora-check
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 9 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
41eba5c
First pass LoRA SFT implementation
Maxusmusti bb283c7
Update deps for new unsloth release versions
Maxusmusti fd077c7
Update lora target module selection
Maxusmusti b35bcdf
Updating data processing and dropout
Maxusmusti 8394437
Update distributed settings
Maxusmusti c3095b0
Separate deps and update package refs
Maxusmusti d66f8a0
Remove extra backend
Maxusmusti 586b781
Add basic docs/examples
Maxusmusti 58c87a3
First round coderabbit feedback
Maxusmusti 7f0d4c3
Add post-fix xformers version for proper cuda compatibility
Maxusmusti File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,190 @@ | ||
| # LoRA + SFT Usage Guide | ||
|
|
||
| Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that allows you to fine-tune large language models with significantly reduced memory requirements. Training hub implements LoRA combined with supervised fine-tuning (SFT) using the optimized Unsloth backend. | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### Basic LoRA Training | ||
|
|
||
| ```python | ||
| from training_hub import lora_sft | ||
|
|
||
| result = lora_sft( | ||
| model_path="meta-llama/Llama-2-7b-hf", | ||
| data_path="./training_data.jsonl", | ||
| ckpt_output_dir="./outputs", | ||
| lora_r=16, # LoRA rank | ||
| lora_alpha=32, # LoRA scaling parameter | ||
| num_epochs=3, | ||
| learning_rate=2e-4 | ||
| ) | ||
| ``` | ||
|
|
||
| ### Single-GPU Launch | ||
|
|
||
| For standard single-GPU training, run your script directly with Python (same as other algorithms): | ||
|
|
||
| ```bash | ||
| python my_training_script.py | ||
| ``` | ||
|
|
||
| ### Multi-GPU Launch | ||
|
|
||
| **Important:** Unlike other algorithms in training-hub which handle distributed setup internally, LoRA training requires `torchrun` for multi-GPU setups due to Unsloth's distributed training requirements: | ||
|
|
||
| ```bash | ||
| # For 4 GPUs | ||
| torchrun --nproc-per-node=4 my_training_script.py | ||
|
|
||
| # For 8 GPUs | ||
| torchrun --nproc-per-node=8 my_training_script.py | ||
| ``` | ||
|
|
||
| ## Installation | ||
Maxusmusti marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ```bash | ||
| pip install training-hub[lora] | ||
| ``` | ||
|
|
||
| This includes: | ||
| - Unsloth optimizations for 2x faster training and 70% less VRAM | ||
| - PyTorch-optimized xformers for better performance | ||
| - TRL for advanced training features | ||
|
|
||
| ## LoRA Parameters | ||
|
|
||
| ### Core LoRA Settings | ||
| - **`lora_r`**: LoRA rank (default: 16) - Higher values capture more information but use more memory | ||
| - **`lora_alpha`**: LoRA scaling parameter (default: 32) - Controls the magnitude of LoRA updates | ||
| - **`lora_dropout`**: Dropout rate for LoRA layers (default: 0.0) - Optimized for Unsloth | ||
| - **`target_modules`**: List of modules to apply LoRA to (default: auto-detect) | ||
|
|
||
| ### Quantization (QLoRA) | ||
| For even lower memory usage, enable 4-bit quantization: | ||
|
|
||
| ```python | ||
| result = lora_sft( | ||
| model_path="meta-llama/Llama-2-13b-hf", | ||
| data_path="./data.jsonl", | ||
| ckpt_output_dir="./outputs", | ||
| lora_r=64, # Higher rank for quantized model | ||
| lora_alpha=128, | ||
| load_in_4bit=True, # Enable QLoRA | ||
| learning_rate=1e-4 # Lower LR for quantized training | ||
| ) | ||
| ``` | ||
|
|
||
| ## Dataset Formats | ||
|
|
||
| LoRA training supports the same dataset formats as SFT: | ||
|
|
||
| ### Messages Format (Recommended) | ||
| ```json | ||
| { | ||
| "messages": [ | ||
| {"role": "user", "content": "What is machine learning?"}, | ||
| {"role": "assistant", "content": "Machine learning is..."} | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| ### Alpaca Format | ||
| ```json | ||
| { | ||
| "instruction": "Explain machine learning", | ||
| "input": "", | ||
| "output": "Machine learning is..." | ||
| } | ||
| ``` | ||
|
|
||
| ## Memory Benefits | ||
|
|
||
| LoRA provides significant memory savings compared to full fine-tuning by only training low-rank adaptation matrices instead of the full model weights. The exact memory reduction depends on your specific model, LoRA configuration, and batch size settings. | ||
|
|
||
| ## Multi-GPU Training | ||
|
|
||
| For distributed training across multiple GPUs: | ||
|
|
||
| ```python | ||
| result = lora_sft( | ||
| model_path="meta-llama/Llama-2-7b-hf", | ||
| data_path="./large_dataset.jsonl", | ||
| ckpt_output_dir="./outputs", | ||
|
|
||
| # LoRA settings | ||
| lora_r=32, | ||
| lora_alpha=64, | ||
|
|
||
| # Distributed training | ||
| effective_batch_size=128, # Total across all GPUs | ||
| micro_batch_size=2, # Per GPU | ||
|
|
||
| # Training settings | ||
| num_epochs=3, | ||
| learning_rate=2e-4 | ||
| ) | ||
| ``` | ||
|
|
||
| Launch with torchrun: | ||
| ```bash | ||
| torchrun --nproc-per-node=4 my_script.py | ||
| ``` | ||
|
|
||
| ## Performance Tips | ||
|
|
||
| 1. **Use Unsloth optimizations** (included by default) | ||
| 2. **Enable BF16** for better performance: `bf16=True` | ||
| 3. **Use sample packing**: `sample_packing=True` | ||
| 4. **Optimize batch sizes**: Start with `micro_batch_size=2` and adjust | ||
| 5. **For large models**: Use `load_in_4bit=True` for QLoRA | ||
|
|
||
| ## Advanced Configuration | ||
|
|
||
| ### Custom Target Modules | ||
| ```python | ||
| result = lora_sft( | ||
| model_path="meta-llama/Llama-2-7b-hf", | ||
| data_path="./data.jsonl", | ||
| ckpt_output_dir="./outputs", | ||
| target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # Attention only | ||
| lora_r=16, | ||
| lora_alpha=32 | ||
| ) | ||
| ``` | ||
|
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does wandb need to be installed separately or does it come with one of the new dependencies? Iirc, |
||
| ### Weights & Biases Integration | ||
| ```python | ||
| result = lora_sft( | ||
| model_path="meta-llama/Llama-2-7b-hf", | ||
| data_path="./data.jsonl", | ||
| ckpt_output_dir="./outputs", | ||
| lora_r=16, | ||
| lora_alpha=32, | ||
| wandb_project="my-lora-project", | ||
| wandb_entity="my-team" | ||
| ) | ||
| ``` | ||
|
|
||
| ## Examples | ||
|
|
||
| See [lora_example.py](../lora_example.py) for complete working examples including: | ||
| - Basic LoRA training | ||
| - QLoRA with 4-bit quantization | ||
| - Multi-GPU distributed training | ||
| - Different dataset format handling | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Memory Issues | ||
| - Reduce `micro_batch_size` | ||
| - Enable `load_in_4bit=True` for QLoRA | ||
| - Lower the `lora_r` value | ||
|
|
||
| ### Multi-GPU Issues | ||
| - Ensure you're using `torchrun` for multi-GPU (not direct Python execution) | ||
| - Check that `effective_batch_size` is divisible by `nproc_per_node * micro_batch_size` | ||
| - For very large models, try `enable_model_splitting=True` | ||
|
|
||
| ### Installation Issues | ||
| - If xformers conflicts occur, the LoRA extras use PyTorch-optimized builds | ||
| - For CUDA version issues, try the appropriate extra: `[lora-cu129]` or `[lora-cu130]` | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Maxusmusti Would you mind adding this flag to the README? I think in most cases this flag will need to be present, unless the users get lucky with the prebuilt wheels.
I can't suggest it, but the same change needs to be made on line 106.