Skip to content

richardodliu/OpenSyntheticCC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenSyntheticCC

Introduction

OpenSyntheticCC is a repository for fine-tuning language models on synthetic Chain-of-Thought (CoT) and code datasets. It provides scripts and configurations for distributed training, especially with DeepSpeed, and supports large-scale supervised fine-tuning.

Features

  • Fine-tuning on synthetic CoT and code datasets
  • Distributed training with DeepSpeed and torchrun
  • Customizable training parameters via shell scripts
  • Data collation and tokenization for instruction-following tasks
  • Example scripts for quick start

Installation

  1. Clone the repository:
    git clone https://github.com/your-username/OpenSyntheticCC.git
    cd OpenSyntheticCC
  2. Install dependencies:
    pip install -r requirements.txt

Usage

1. Prepare your dataset

  • Format: JSONL, each line should contain instruction and response fields.
  • Note: Due to privacy reasons, the dataset is not open-sourced for now. In the future, we will release scripts for generating synthetic datasets.

2. Fine-tune a model

  • Edit sft.sh to set your model path, data path, and output directory.
  • Run the script:
    bash sft.sh
  • The script uses torchrun and DeepSpeed for distributed training. Training parameters (batch size, learning rate, etc.) can be modified in sft.sh.

3. Custom Training

  • You can also run finetune.py directly:
    python finetune.py --model_name_or_path <MODEL_PATH> --data_path <DATA_PATH> --output_dir <OUTPUT_DIR> ...
  • See sft.sh for a full example of arguments.

Distributed Training

  • DeepSpeed configuration is provided in deepspeed.json.
  • The script supports multi-node and multi-GPU training.

File Overview

  • finetune.py: Main training script for supervised fine-tuning.
  • sft.sh: Example shell script for distributed training.
  • deepspeed.json: DeepSpeed configuration for efficient large model training.
  • git.sh: Helper script for quick git add/commit/push.
  • .gitignore: Ignore logs, archives, and Java-related files.

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published