Can we decouple the data preprocssing/tokenization step from the fine-tuning phase?

Hi @joecummings, after we discussed in https://github.com/kubeflow/trainer/pull/2410 and Kubeflow WG Training Call, we found that `torchtune` is an amazing tool for fine-tuning LLMs and decided to adopt it as our low-level runtime for Kubeflow LLM Trainer. For now, we've started the implementation based on `torchtune`. Thanks for your engagement in the discussion.

![Image](https://github.com/user-attachments/assets/d1ef53fc-04f5-4273-a432-2738bcc96377)

However, we want to **decouple the data preprocessing and tokenization step from the main fine-tuning phase**, so as to:

- **Reduce the time for using GPUs**: we will wrap `torchtune` into a container and request for GPU resource for it (GPU is expensive and paid according to usage time)
- **Integrate the data preprocessing / tokenization step with our [data initializer](https://github.com/kubeflow/trainer/issues/2210)**: Do these steps ahead of fine-tuning and offload them to CPU

We wonder if `torchtune` have best practice to achieve these goals. And we'll appreciate it if you could offer some precious suggestions. Thanks!

Also /cc @andreyvelich @tenzen-y @astefanutti @deepanker13 @saileshd1402 @seanlaii

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we decouple the data preprocssing/tokenization step from the fine-tuning phase? #2497

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Can we decouple the data preprocssing/tokenization step from the fine-tuning phase? #2497

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions