Skip to content

Can we decouple the data preprocssing/tokenization step from the fine-tuning phase? #2497

@Electronic-Waste

Description

@Electronic-Waste

Hi @joecummings, after we discussed in kubeflow/trainer#2410 and Kubeflow WG Training Call, we found that torchtune is an amazing tool for fine-tuning LLMs and decided to adopt it as our low-level runtime for Kubeflow LLM Trainer. For now, we've started the implementation based on torchtune. Thanks for your engagement in the discussion.

Image

However, we want to decouple the data preprocessing and tokenization step from the main fine-tuning phase, so as to:

  • Reduce the time for using GPUs: we will wrap torchtune into a container and request for GPU resource for it (GPU is expensive and paid according to usage time)
  • Integrate the data preprocessing / tokenization step with our data initializer: Do these steps ahead of fine-tuning and offload them to CPU

We wonder if torchtune have best practice to achieve these goals. And we'll appreciate it if you could offer some precious suggestions. Thanks!

Also /cc @andreyvelich @tenzen-y @astefanutti @deepanker13 @saileshd1402 @seanlaii

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions