Skip to content

Commit 80ef317

Browse files
authored
Add xtune to finetuning (opea-project#1432)
Signed-off-by: jilongwa <[email protected]>
1 parent a763c35 commit 80ef317

32 files changed

+4937
-1
lines changed

.github/workflows/docker/compose/finetuning-compose.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,7 @@ services:
1111
build:
1212
dockerfile: comps/finetuning/src/Dockerfile.intel_hpu
1313
image: ${REGISTRY:-opea}/finetuning-gaudi:${TAG:-latest}
14+
finetuning-xtune:
15+
build:
16+
dockerfile: comps/finetuning/src/Dockerfile.xtune
17+
image: ${REGISTRY:-opea}/finetuning-xtune:${TAG:-latest}

comps/finetuning/deployment/docker_compose/compose.yaml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,26 @@ services:
1515
- HF_TOKEN=${HF_TOKEN}
1616
ipc: host
1717
restart: always
18+
finetuning-xtune:
19+
image: ${REGISTRY:-opea}/finetuning-xtune:${TAG:-latest}
20+
container_name: finetuning-xtune
21+
ports:
22+
- "${PORT1:-8015}:8015"
23+
- "${PORT2:-8265}:8265"
24+
- "${PORT3:-7860}:7860"
25+
environment:
26+
- no_proxy=${no_proxy}
27+
- https_proxy=${https_proxy}
28+
- http_proxy=${http_proxy}
29+
- HF_TOKEN=${HF_TOKEN}
30+
devices:
31+
- "/dev/dri:/dev/dri"
32+
volumes:
33+
- ${DATA:-/data}:${DATA:-/data}
34+
group_add:
35+
- ${RENDER_GROUP_ID:-110}
36+
ipc: host
37+
restart: always
1838
finetuning-gaudi:
1939
extends: finetuning
2040
image: ${REGISTRY:-opea}/finetuning-gaudi:${TAG:-latest}

comps/finetuning/src/Dockerfile.xtune

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
# Use the same python version with ray
5+
FROM python:3.10.14
6+
7+
ARG HF_TOKEN
8+
ARG DATA
9+
10+
ENV HF_TOKEN=$HF_TOKEN
11+
ENV DATA=$DATA
12+
13+
RUN useradd -m -s /bin/bash user && \
14+
mkdir -p /home/user && \
15+
chown -R user /home/user/
16+
17+
RUN wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg && \
18+
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy unified" | \
19+
tee /etc/apt/sources.list.d/intel-gpu-jammy.list &&\
20+
apt update -y &&\
21+
apt install -y \
22+
libze-intel-gpu1 libze1 intel-opencl-icd clinfo \
23+
libze-dev intel-ocloc \
24+
intel-level-zero-gpu-raytracing \
25+
vim \
26+
rsync
27+
28+
COPY comps /home/user/comps
29+
30+
31+
RUN chown -R user /home/user/comps/finetuning
32+
33+
ENV PATH=$PATH:/home/user/.local/bin
34+
RUN cd /home/user/comps/finetuning/src/integrations/xtune && git config --global user.name "test" && git config --global user.email "test" && bash prepare_xtune.sh
35+
36+
RUN python -m pip install --upgrade pip setuptools peft && \
37+
python -m pip install -r /home/user/comps/finetuning/src/requirements.txt && \
38+
python -m pip install --no-deps transformers==4.45.0 datasets==2.21.0 accelerate==0.34.2 peft==0.12.0
39+
40+
ENV PYTHONPATH=$PYTHONPATH:/home/user
41+
42+
43+
44+
WORKDIR /home/user/comps/finetuning/src
45+
46+
RUN echo PKGPATH=$(python3 -c "import pkg_resources; print(pkg_resources.get_distribution('oneccl-bind-pt').location)") >> run.sh && \
47+
echo 'export LD_LIBRARY_PATH=$PKGPATH/oneccl_bindings_for_pytorch/opt/mpi/lib/:$LD_LIBRARY_PATH' >> run.sh && \
48+
echo 'source $PKGPATH/oneccl_bindings_for_pytorch/env/setvars.sh' >> run.sh && \
49+
echo 'export FINETUNING_COMPONENT_NAME="XTUNE_FINETUNING"' >> run.sh && \
50+
echo ray start --head --dashboard-host=0.0.0.0 >> run.sh && \
51+
echo export RAY_ADDRESS=http://localhost:8265 >> run.sh && \
52+
echo python opea_finetuning_microservice.py >> run.sh && \
53+
echo 'export DATA=$DATA' >> run.sh && \
54+
echo 'ZE_AFFINITY_MASK=0 llamafactory-cli webui &' >> run.sh
55+
56+
CMD bash run.sh

comps/finetuning/src/README.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,9 @@ ray start --address='${head_node_ip}:6379'
3939

4040
```bash
4141
export HF_TOKEN=${your_huggingface_token}
42-
python finetuning_service.py
42+
# export FINETUNING_COMPONENT_NAME="which component you want to run"
43+
# export FINETUNING_COMPONENT_NAME="OPEA_FINETUNING" or export FINETUNING_COMPONENT_NAME="XTUNE_FINETUNING"
44+
python opea_finetuning_microservice.py
4345
```
4446

4547
## 🚀2. Start Microservice with Docker (Option 2)
@@ -99,6 +101,10 @@ cd ../deployment/docker_compose
99101
docker compose -f compose.yaml up finetuning-gaudi -d
100102
```
101103

104+
### 2.3 Setup Xtune on Arc A770
105+
106+
Please follow [doc](./integrations/xtune/README.md) to install Xtune on Arc A770
107+
102108
## 🚀3. Consume Finetuning Service
103109

104110
### 3.1 Upload a training file
@@ -261,6 +267,11 @@ curl http://${your_ip}:8015/v1/finetune/list_checkpoints -X POST -H "Content-Typ
261267

262268
After fine-tuning job is done, fine-tuned model can be chosen from listed checkpoints, then the fine-tuned model can be used in other microservices. For example, fine-tuned reranking model can be used in [reranks](../../rerankings/src/README.md) microservice by assign its path to the environment variable `RERANK_MODEL_ID`, fine-tuned embedding model can be used in [embeddings](../../embeddings/src/README.md) microservice by assign its path to the environment variable `model`, LLMs after instruction tuning can be used in [llms](../../llms/src/text-generation/README.md) microservice by assign its path to the environment variable `your_hf_llm_model`.
263269

270+
### 3.5 Xtune
271+
272+
Once you follow `3.2 Setup Xtune on Arc A770`, you can access Xtune in web through http://localhost:7860/
273+
Please see [Xtune doc](./integrations/xtune/README.md) for details.
274+
264275
## 🚀4. Descriptions for Finetuning parameters
265276

266277
We utilize [OpenAI finetuning parameters](https://platform.openai.com/docs/api-reference/fine-tuning) and extend it with more customizable parameters, see the definitions at [finetune_config](https://github.com/opea-project/GenAIComps/blob/main/comps/finetuning/src/integrations/finetune_config.py).

comps/finetuning/src/integrations/finetune_config.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,21 @@ class LoraConfig(BaseModel):
3737
target_modules: Optional[List[str]] = None
3838

3939

40+
class XtuneConfig(BaseModel):
41+
tool: str = ""
42+
trainer: str = ""
43+
model: str = ""
44+
config_file: str = ""
45+
dataset: str = ""
46+
dataset_root: str = ""
47+
device: str = ""
48+
49+
@validator("tool")
50+
def check_task(cls, v: str):
51+
assert v in ["", "clip", "adaclip"]
52+
return v
53+
54+
4055
class GeneralConfig(BaseModel):
4156
base_model: str = None
4257
tokenizer_name: Optional[str] = None
@@ -48,6 +63,7 @@ class GeneralConfig(BaseModel):
4863
save_strategy: str = "no"
4964
config: LoadConfig = LoadConfig()
5065
lora_config: Optional[LoraConfig] = LoraConfig()
66+
xtune_config: Optional[XtuneConfig] = XtuneConfig()
5167
enable_gradient_checkpointing: bool = False
5268
task: str = "instruction_tuning"
5369

Lines changed: 241 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
import os
5+
import random
6+
import re
7+
import time
8+
import urllib.parse
9+
import uuid
10+
from pathlib import Path
11+
from typing import Dict
12+
13+
from fastapi import BackgroundTasks, File, Form, HTTPException, UploadFile
14+
from pydantic_yaml import to_yaml_file
15+
from ray.job_submission import JobSubmissionClient
16+
17+
from comps import CustomLogger, OpeaComponent, OpeaComponentRegistry
18+
from comps.cores.proto.api_protocol import (
19+
FileObject,
20+
FineTuningJob,
21+
FineTuningJobCheckpoint,
22+
FineTuningJobIDRequest,
23+
FineTuningJobList,
24+
UploadFileRequest,
25+
)
26+
from comps.finetuning.src.integrations.finetune_config import FinetuneConfig, FineTuningParams
27+
28+
logger = CustomLogger("opea")
29+
30+
DATASET_BASE_PATH = "datasets"
31+
JOBS_PATH = "jobs"
32+
OUTPUT_DIR = "output"
33+
34+
if not os.path.exists(DATASET_BASE_PATH):
35+
os.mkdir(DATASET_BASE_PATH)
36+
if not os.path.exists(JOBS_PATH):
37+
os.mkdir(JOBS_PATH)
38+
if not os.path.exists(OUTPUT_DIR):
39+
os.mkdir(OUTPUT_DIR)
40+
41+
FineTuningJobID = str
42+
CheckpointID = str
43+
CheckpointPath = str
44+
45+
CHECK_JOB_STATUS_INTERVAL = 5 # Check every 5 secs
46+
47+
global ray_client
48+
ray_client: JobSubmissionClient = None
49+
50+
running_finetuning_jobs: Dict[FineTuningJobID, FineTuningJob] = {}
51+
finetuning_job_to_ray_job: Dict[FineTuningJobID, str] = {}
52+
checkpoint_id_to_checkpoint_path: Dict[CheckpointID, CheckpointPath] = {}
53+
54+
55+
# Add a background task to periodicly update job status
56+
def update_job_status(job_id: FineTuningJobID):
57+
while True:
58+
job_status = ray_client.get_job_status(finetuning_job_to_ray_job[job_id])
59+
status = str(job_status).lower()
60+
# Ray status "stopped" is OpenAI status "cancelled"
61+
status = "cancelled" if status == "stopped" else status
62+
logger.info(f"Status of job {job_id} is '{status}'")
63+
running_finetuning_jobs[job_id].status = status
64+
if status == "succeeded" or status == "cancelled" or status == "failed":
65+
break
66+
time.sleep(CHECK_JOB_STATUS_INTERVAL)
67+
68+
69+
async def save_content_to_local_disk(save_path: str, content):
70+
save_path = Path(save_path)
71+
try:
72+
if isinstance(content, str):
73+
with open(save_path, "w", encoding="utf-8") as file:
74+
file.write(content)
75+
else:
76+
with save_path.open("wb") as fout:
77+
content = await content.read()
78+
fout.write(content)
79+
except Exception as e:
80+
logger.info(f"Write file failed. Exception: {e}")
81+
raise Exception(status_code=500, detail=f"Write file {save_path} failed. Exception: {e}")
82+
83+
84+
async def upload_file(purpose: str = Form(...), file: UploadFile = File(...)):
85+
return UploadFileRequest(purpose=purpose, file=file)
86+
87+
88+
@OpeaComponentRegistry.register("XTUNE_FINETUNING")
89+
class XtuneFinetuning(OpeaComponent):
90+
"""A specialized finetuning component derived from OpeaComponent for finetuning services."""
91+
92+
def __init__(self, name: str, description: str, config: dict = None):
93+
super().__init__(name, "finetuning", description, config)
94+
95+
def create_finetuning_jobs(self, request: FineTuningParams, background_tasks: BackgroundTasks):
96+
model = request.model
97+
train_file = request.training_file
98+
finetune_config = FinetuneConfig(General=request.General)
99+
if finetune_config.General.xtune_config.device == "XPU":
100+
flag = 1
101+
else:
102+
flag = 0
103+
if os.getenv("HF_TOKEN", None):
104+
finetune_config.General.config.token = os.getenv("HF_TOKEN", None)
105+
106+
job = FineTuningJob(
107+
id=f"ft-job-{uuid.uuid4()}",
108+
model=model,
109+
created_at=int(time.time()),
110+
training_file=train_file,
111+
hyperparameters={},
112+
status="running",
113+
seed=random.randint(0, 1000) if request.seed is None else request.seed,
114+
)
115+
116+
finetune_config_file = f"{JOBS_PATH}/{job.id}.yaml"
117+
to_yaml_file(finetune_config_file, finetune_config)
118+
119+
global ray_client
120+
ray_client = JobSubmissionClient() if ray_client is None else ray_client
121+
if finetune_config.General.xtune_config.tool == "clip":
122+
ray_job_id = ray_client.submit_job(
123+
# Entrypoint shell command to execute
124+
entrypoint=f"cd integrations/xtune/src/llamafactory/clip_finetune && export DATA={finetune_config.General.xtune_config.dataset_root} && bash scripts/clip_finetune/{finetune_config.General.xtune_config.trainer}.sh {finetune_config.General.xtune_config.dataset} {finetune_config.General.xtune_config.model} 0 {finetune_config.General.xtune_config.device} > /tmp/test.log 2>&1 || true",
125+
)
126+
127+
else:
128+
if flag == 1:
129+
ray_job_id = ray_client.submit_job(
130+
# Entrypoint shell command to execute
131+
entrypoint=f"cd integrations/xtune/src/llamafactory/adaclip_finetune && python train.py --config {finetune_config.General.xtune_config.config_file} --frames_dir {finetune_config.General.xtune_config.dataset_root}{finetune_config.General.xtune_config.dataset}/frames --top_k 16 --freeze_cnn --frame_agg mlp --resume {finetune_config.General.xtune_config.model} --xpu --batch_size 8 > /tmp/test.log 2>&1 || true",
132+
)
133+
else:
134+
ray_job_id = ray_client.submit_job(
135+
# Entrypoint shell command to execute
136+
entrypoint=f"cd integrations/xtune/src/llamafactory/adaclip_finetune && python train.py --config {finetune_config.General.config_file} --frames_dir {finetune_config.General.dataset_root}{finetune_config.General.dataset}/frames --top_k 16 --freeze_cnn --frame_agg mlp --resume {finetune_config.General.model}--batch_size 8 > /tmp/test.log 2>&1 || true",
137+
)
138+
139+
logger.info(f"Submitted Ray job: {ray_job_id} ...")
140+
141+
running_finetuning_jobs[job.id] = job
142+
finetuning_job_to_ray_job[job.id] = ray_job_id
143+
144+
background_tasks.add_task(update_job_status, job.id)
145+
146+
return job
147+
148+
def list_finetuning_jobs(self):
149+
finetuning_jobs_list = FineTuningJobList(data=list(running_finetuning_jobs.values()), has_more=False)
150+
151+
return finetuning_jobs_list
152+
153+
def retrieve_finetuning_job(self, request: FineTuningJobIDRequest):
154+
fine_tuning_job_id = request.fine_tuning_job_id
155+
156+
job = running_finetuning_jobs.get(fine_tuning_job_id)
157+
if job is None:
158+
raise HTTPException(status_code=404, detail=f"Fine-tuning job '{fine_tuning_job_id}' not found!")
159+
return job
160+
161+
def cancel_finetuning_job(self, request: FineTuningJobIDRequest):
162+
fine_tuning_job_id = request.fine_tuning_job_id
163+
164+
ray_job_id = finetuning_job_to_ray_job.get(fine_tuning_job_id)
165+
if ray_job_id is None:
166+
raise HTTPException(status_code=404, detail=f"Fine-tuning job '{fine_tuning_job_id}' not found!")
167+
168+
global ray_client
169+
ray_client = JobSubmissionClient() if ray_client is None else ray_client
170+
ray_client.stop_job(ray_job_id)
171+
172+
job = running_finetuning_jobs.get(fine_tuning_job_id)
173+
job.status = "cancelled"
174+
return job
175+
176+
def list_finetuning_checkpoints(self, request: FineTuningJobIDRequest):
177+
fine_tuning_job_id = request.fine_tuning_job_id
178+
179+
job = running_finetuning_jobs.get(fine_tuning_job_id)
180+
if job is None:
181+
raise HTTPException(status_code=404, detail=f"Fine-tuning job '{fine_tuning_job_id}' not found!")
182+
output_dir = os.path.join(OUTPUT_DIR, job.id)
183+
checkpoints = []
184+
if os.path.exists(output_dir):
185+
# Iterate over the contents of the directory and add an entry for each
186+
files = os.listdir(output_dir)
187+
for file in files: # Loop over directory contents
188+
file_path = os.path.join(output_dir, file)
189+
if os.path.isdir(file_path) and file.startswith("checkpoint"):
190+
steps = re.findall("\d+", file)[0]
191+
checkpointsResponse = FineTuningJobCheckpoint(
192+
id=f"ftckpt-{uuid.uuid4()}", # Generate a unique ID
193+
created_at=int(time.time()), # Use the current timestamp
194+
fine_tuned_model_checkpoint=file_path, # Directory path itself
195+
fine_tuning_job_id=fine_tuning_job_id,
196+
object="fine_tuning.job.checkpoint",
197+
step_number=steps,
198+
)
199+
checkpoints.append(checkpointsResponse)
200+
if job.status == "succeeded":
201+
checkpointsResponse = FineTuningJobCheckpoint(
202+
id=f"ftckpt-{uuid.uuid4()}", # Generate a unique ID
203+
created_at=int(time.time()), # Use the current timestamp
204+
fine_tuned_model_checkpoint=output_dir, # Directory path itself
205+
fine_tuning_job_id=fine_tuning_job_id,
206+
object="fine_tuning.job.checkpoint",
207+
)
208+
checkpoints.append(checkpointsResponse)
209+
210+
return checkpoints
211+
212+
async def upload_training_files(self, request: UploadFileRequest):
213+
file = request.file
214+
if file is None:
215+
raise HTTPException(status_code=404, detail="upload file failed!")
216+
filename = urllib.parse.quote(file.filename, safe="")
217+
save_path = os.path.join(DATASET_BASE_PATH, filename)
218+
await save_content_to_local_disk(save_path, file)
219+
220+
fileBytes = os.path.getsize(save_path)
221+
fileInfo = FileObject(
222+
id=f"file-{uuid.uuid4()}",
223+
object="file",
224+
bytes=fileBytes,
225+
created_at=int(time.time()),
226+
filename=filename,
227+
purpose="fine-tune",
228+
)
229+
230+
return fileInfo
231+
232+
def invoke(self, *args, **kwargs):
233+
pass
234+
235+
def check_health(self) -> bool:
236+
"""Checks the health of the component.
237+
238+
Returns:
239+
bool: True if the component is healthy, False otherwise.
240+
"""
241+
return True

0 commit comments

Comments
 (0)