OpenSyntheticCC is a repository for fine-tuning language models on synthetic Chain-of-Thought (CoT) and code datasets. It provides scripts and configurations for distributed training, especially with DeepSpeed, and supports large-scale supervised fine-tuning.
- Fine-tuning on synthetic CoT and code datasets
- Distributed training with DeepSpeed and torchrun
- Customizable training parameters via shell scripts
- Data collation and tokenization for instruction-following tasks
- Example scripts for quick start
- Clone the repository:
git clone https://github.com/your-username/OpenSyntheticCC.git cd OpenSyntheticCC - Install dependencies:
pip install -r requirements.txt
- Format: JSONL, each line should contain
instructionandresponsefields. - Note: Due to privacy reasons, the dataset is not open-sourced for now. In the future, we will release scripts for generating synthetic datasets.
- Edit
sft.shto set your model path, data path, and output directory. - Run the script:
bash sft.sh
- The script uses
torchrunand DeepSpeed for distributed training. Training parameters (batch size, learning rate, etc.) can be modified insft.sh.
- You can also run
finetune.pydirectly:python finetune.py --model_name_or_path <MODEL_PATH> --data_path <DATA_PATH> --output_dir <OUTPUT_DIR> ...
- See
sft.shfor a full example of arguments.
- DeepSpeed configuration is provided in
deepspeed.json. - The script supports multi-node and multi-GPU training.
finetune.py: Main training script for supervised fine-tuning.sft.sh: Example shell script for distributed training.deepspeed.json: DeepSpeed configuration for efficient large model training.git.sh: Helper script for quick git add/commit/push..gitignore: Ignore logs, archives, and Java-related files.
Contributions are welcome! Please open an issue or submit a pull request.
This project is licensed under the MIT License.