Skip to content

Commit dfc63d8

Browse files
authored
[data] add coig-p dataset (hiyouga#7657)
1 parent 77cdde7 commit dfc63d8

11 files changed

Lines changed: 325 additions & 915 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -384,6 +384,7 @@ You also can add a custom chat template to [template.py](src/llamafactory/data/t
384384

385385
- [DPO mixed (en&zh)](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k)
386386
- [UltraFeedback (en)](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized)
387+
- [COIG-P (en&zh)](https://huggingface.co/datasets/m-a-p/COIG-P)
387388
- [RLHF-V (en)](https://huggingface.co/datasets/openbmb/RLHF-V-Dataset)
388389
- [VLFeedback (en)](https://huggingface.co/datasets/Zhihui/VLFeedback)
389390
- [Orca DPO Pairs (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)

README_zh.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -387,6 +387,7 @@ https://github.com/user-attachments/assets/43b700c6-a178-41db-b1f8-8190a5d3fcfc
387387

388388
- [DPO mixed (en&zh)](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k)
389389
- [UltraFeedback (en)](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized)
390+
- [COIG-P (en&zh)](https://huggingface.co/datasets/m-a-p/COIG-P)
390391
- [RLHF-V (en)](https://huggingface.co/datasets/openbmb/RLHF-V-Dataset)
391392
- [VLFeedback (en)](https://huggingface.co/datasets/Zhihui/VLFeedback)
392393
- [Orca DPO Pairs (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)

data/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
8585

8686
### Pre-training Dataset
8787

88-
- [Example dataset](c4_demo.json)
88+
- [Example dataset](c4_demo.jsonl)
8989

9090
In pre-training, only the `text` column will be used for model learning.
9191

data/README_zh.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@
8585

8686
### 预训练数据集
8787

88-
- [样例数据集](c4_demo.json)
88+
- [样例数据集](c4_demo.jsonl)
8989

9090
在预训练时,只有 `text` 列中的内容会用于模型学习。
9191

data/c4_demo.json

Lines changed: 0 additions & 902 deletions
This file was deleted.

data/c4_demo.jsonl

Lines changed: 300 additions & 0 deletions
Large diffs are not rendered by default.

data/dataset_info.json

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -527,6 +527,16 @@
527527
"rejected": "rejected"
528528
}
529529
},
530+
"coig_p": {
531+
"hf_hub_url": "m-a-p/COIG-P",
532+
"ranking": true,
533+
"formatting": "sharegpt",
534+
"columns": {
535+
"messages": "conversations",
536+
"chosen": "chosen",
537+
"rejected": "rejected"
538+
}
539+
},
530540
"rlhf_v": {
531541
"hf_hub_url": "llamafactory/RLHF-V",
532542
"ranking": true,
@@ -622,7 +632,7 @@
622632
}
623633
},
624634
"c4_demo": {
625-
"file_name": "c4_demo.json",
635+
"file_name": "c4_demo.jsonl",
626636
"columns": {
627637
"prompt": "text"
628638
}

requirements.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
transformers>=4.41.2,<=4.51.1,!=4.46.*,!=4.47.*,!=4.48.0
2-
datasets>=2.16.0,<=3.4.1
3-
accelerate>=0.34.0,<=1.5.2
4-
peft>=0.14.0,<=0.15.0
2+
datasets>=2.16.0,<=3.5.0
3+
accelerate>=0.34.0,<=1.6.0
4+
peft>=0.14.0,<=0.15.1
55
trl>=0.8.6,<=0.9.6
66
tokenizers>=0.19.0,<=0.21.0
77
gradio>=4.38.0,<=5.21.0

src/llamafactory/__init__.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,9 @@
2020
Dependency graph:
2121
main:
2222
transformers>=4.41.2,<=4.51.1,!=4.46.*,!=4.47.*,!=4.48.0
23-
datasets>=2.16.0,<=3.4.1
24-
accelerate>=0.34.0,<=1.5.2
25-
peft>=0.14.0,<=0.15.0
23+
datasets>=2.16.0,<=3.5.0
24+
accelerate>=0.34.0,<=1.6.0
25+
peft>=0.14.0,<=0.15.1
2626
trl>=0.8.6,<=0.9.6
2727
attention:
2828
transformers>=4.42.4 (gemma+fa2)

src/llamafactory/extras/misc.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -90,9 +90,9 @@ def check_version(requirement: str, mandatory: bool = False) -> None:
9090
def check_dependencies() -> None:
9191
r"""Check the version of the required packages."""
9292
check_version("transformers>=4.41.2,<=4.51.1,!=4.46.0,!=4.46.1,!=4.46.2,!=4.46.3,!=4.47.0,!=4.47.1,!=4.48.0")
93-
check_version("datasets>=2.16.0,<=3.4.1")
94-
check_version("accelerate>=0.34.0,<=1.5.2")
95-
check_version("peft>=0.14.0,<=0.15.0")
93+
check_version("datasets>=2.16.0,<=3.5.0")
94+
check_version("accelerate>=0.34.0,<=1.6.0")
95+
check_version("peft>=0.14.0,<=0.15.1")
9696
check_version("trl>=0.8.6,<=0.9.6")
9797
if is_transformers_version_greater_than("4.46.0") and not is_transformers_version_greater_than("4.48.1"):
9898
logger.warning_rank0_once("There are known bugs in transformers v4.46.0-v4.48.0, please use other versions.")

0 commit comments

Comments
 (0)