Fix for Squad Dataset Download by rhmukundan · Pull Request #13893 · NVIDIA-NeMo/NeMo

rhmukundan · 2025-06-11T18:28:54Z

Fix: Prevent Race Condition When Downloading SQuAD Dataset

Current Issue

When running with multiple GPUs per node and/or multiple nodes, downloading the SQuAD dataset can fail due to a race condition - multiple processes attempt to download the dataset simultaneously. This makes the dataset download to fail.

Fix

This fix ensures that the SQuAD dataset is downloaded in a separate SLURM job using only 1 node and 1 GPU per node. This prevents concurrent downloads and eliminates the race condition.

with run.Experiment(exp_name) as exp:
        if not SKIP_IMPORT:
            assert args.hf_token is not None, "HF token is required for importing checkpoint from HuggingFace"
            exp.add(*import_ckpt_experiment(executor, model(), source=f"hf://{HF_MODEL_URI}"))
            # Prepare SQuAD dataset before finetuning
            exp.add(*prepare_squad_dataset_experiment(executor, HF_MODEL_URI, seq_length=4096))

guyueh1 · 2025-06-12T14:27:50Z

I have a few questions

where are these added functions used?
Is the purpose to avoid downloading squad from the network and use cached files?
if yes, I think existing squad dataset should be able to use cached files if you properly set HF_HOME or HF_HUB envvar, but I haven't tried it recently, i can confirm later.

rhmukundan · 2025-06-12T14:34:11Z

@guyueh1 Thanks for your comment.
If you noticed, it is using the same function in nemo to download the dataset. Currently it is part of the overall job which means (num_nodes > 1 and num_gpus_per_node > 1) which causes the race condition which prevents the download and throws an error.
In order to fix this, I have created prepare_squad_dataset and prepare_squad_dataset_experiment functions which create a separate slurm job (1 Node and 1 GPU Per node) to avoid the race condition and download the dataset successfully. The functions used are pretty much what is part of the current nemo implementation (https://github.com/NVIDIA/NeMo/blob/577b7fb77f6767e23dec2926ba9b9d2a3722713b/scripts/performance/utils.py#L523) to get the squad dataset. What has changed is the way how it is downloaded (1 node and 1 GPU-per-node) via a separate slurm job similar to the checkpoint download.

This implementation will be used in finetune recipes like the following:

with run.Experiment(exp_name) as exp:
        if not SKIP_IMPORT:
            assert args.hf_token is not None, "HF token is required for importing checkpoint from HuggingFace"
            exp.add(*import_ckpt_experiment(executor, model(), source=f"hf://{HF_MODEL_URI}"))
            # Prepare SQuAD dataset before finetuning
            exp.add(*prepare_squad_dataset_experiment(executor, HF_MODEL_URI, seq_length=4096))

guyueh1 · 2025-06-12T16:10:37Z

@rhmukundan I see. Can you provide an example of how to use it in run script?

rhmukundan · 2025-06-12T18:44:28Z

@guyueh1: I have pushed the Finetuning LLAMA4 Maverick recipe (that has the fix included) and also added the fix to LLAMA3 Finetuning 70b file.
cc: @ashbhandare

guyueh1 · 2025-06-13T16:07:22Z

@malay-nagda can you review the added utility functions in util.py?

scripts/performance/utils.py

scripts/performance/llm/finetune_llama4_e128.py

scripts/performance/llm/finetune_llama3_70b.py

scripts/performance/llm/finetune_llama4_e128.py

scripts/performance/utils.py

… dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

…t function Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

scripts/performance/llm/finetune_llama31_405b.py

scripts/performance/llm/finetune_llama4_e128.py

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>

malay-nagda

LGTM

github-actions · 2025-06-17T16:20:34Z

[🤖]: Hi @rhmukundan 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

guyueh1 · 2025-06-17T16:26:21Z

@rhmukundan is this ready for merge?

guyueh1

LGTM

guyueh1 · 2025-06-13T16:04:15Z

scripts/performance/llm/finetune_llama3_70b.py

        if not SKIP_IMPORT:
            assert args.hf_token is not None, "HF token is required for importing checkpoint from HuggingFace"
            exp.add(*import_ckpt_experiment(executor, model(), source=f"hf://{HF_MODEL_URI}"))
+            exp.add(*prepare_squad_dataset_experiment(executor, HF_MODEL_URI, seq_length=4096))


it's best to add a command-line argument, for instance standalone_dataset_preparation, for users to indicate they want this or the default way; that argument should default to false

guyueh1 · 2025-06-16T15:29:04Z

scripts/performance/llm/finetune_llama31_405b.py

 # downloaded from HuggingFace
 SKIP_IMPORT = False

+# Set this to True if dataset is already downloaded. If set to False,


This is not very clear; previously without this PR , are we not downloading the dataset from huggingface? My impression is it's still done somewhere in the dataset building process, just not explicitly, right? I think the difference is here you are separating it out as a new nemo-run experiment. If comment like this, users would think setting it to True without a local file will error out, but in reality it won't?

Could you further explain what different things are happening between False and True here;

scripts/performance/llm/finetune_llama3_70b.py

* Fix for Squad Dataset Download Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Giving the option to pass the sequence length from the finetune script Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Rebase Pushing llama4 finetuning e128 script and llama3 70b finetuning to include the dataset download fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Finetune Llama4 Recipe with dataset download fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Address PR comments Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Tweaks to finetune_llama4_e128 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Addressing PR comments Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Giving an option to have either AutoTokenizer or NullTokenizer for preparing the dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix kwargs Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * User passing vocab_size while using the NullTokenizer for downloading dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Adding model configs for finetune llama4 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Rebase Introducing the fix to llama3 finetuning recipes as well Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Setting default vocab_size to None in prepare_squad_dataset_experiment function Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix merge conflicts Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fixing the search condition for the dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Removing NullTokenizer from Finetuning scripts Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Import cleanup Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> --------- Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Co-authored-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Amir Hussein <amhussein@nvidia.com>

* Fix for Squad Dataset Download Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Giving the option to pass the sequence length from the finetune script Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Rebase Pushing llama4 finetuning e128 script and llama3 70b finetuning to include the dataset download fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Finetune Llama4 Recipe with dataset download fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Address PR comments Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Tweaks to finetune_llama4_e128 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Addressing PR comments Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Giving an option to have either AutoTokenizer or NullTokenizer for preparing the dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix kwargs Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * User passing vocab_size while using the NullTokenizer for downloading dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Adding model configs for finetune llama4 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Rebase Introducing the fix to llama3 finetuning recipes as well Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Setting default vocab_size to None in prepare_squad_dataset_experiment function Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix merge conflicts Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fixing the search condition for the dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Removing NullTokenizer from Finetuning scripts Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Import cleanup Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> --------- Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Co-authored-by: rhmukundan <rhmukundan@users.noreply.github.com>

rhmukundan requested review from ashbhandare, nvmarnold and sshiddib June 11, 2025 18:30

rhmukundan self-assigned this Jun 11, 2025

artbataev temporarily deployed to public June 11, 2025 18:33 — with GitHub Actions Inactive

rhmukundan marked this pull request as ready for review June 11, 2025 19:00

rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from 24d8b8b to 5143a37 Compare June 11, 2025 19:29

artbataev temporarily deployed to public June 11, 2025 19:34 — with GitHub Actions Inactive

rhmukundan requested review from erhoo82, guyueh1, sharathts and yaoyu-33 June 12, 2025 08:40

rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from 577b7fb to c64ca64 Compare June 12, 2025 18:33

artbataev temporarily deployed to public June 12, 2025 18:38 — with GitHub Actions Inactive

rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from b92b424 to 83c3056 Compare June 12, 2025 18:41

artbataev temporarily deployed to public June 12, 2025 18:46 — with GitHub Actions Inactive

guyueh1 requested a review from malay-nagda June 13, 2025 16:07

malay-nagda reviewed Jun 14, 2025

View reviewed changes

rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from 3ccd871 to 26bdc30 Compare June 14, 2025 21:25

artbataev temporarily deployed to public June 14, 2025 21:29 — with GitHub Actions Inactive

rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from 938e3b8 to 8f49d4f Compare June 15, 2025 04:47

rhmukundan requested a review from malay-nagda June 15, 2025 04:48

artbataev temporarily deployed to public June 15, 2025 04:52 — with GitHub Actions Inactive

malay-nagda reviewed Jun 15, 2025

View reviewed changes

scripts/performance/llm/finetune_llama4_e128.py Outdated Show resolved Hide resolved

scripts/performance/utils.py Show resolved Hide resolved

scripts/performance/utils.py Outdated Show resolved Hide resolved

rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from bf8cfdd to 1113307 Compare June 16, 2025 03:49

rhmukundan and others added 8 commits June 17, 2025 20:03

User passing vocab_size while using the NullTokenizer for downloading…

619b167

… dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Adding model configs for finetune llama4

04e2364

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Rebase Introducing the fix to llama3 finetuning recipes as well

830dce5

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Setting default vocab_size to None in prepare_squad_dataset_experimen…

9976083

…t function Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Fix merge conflicts

d133fd8

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Fixing the search condition for the dataset

452399d

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Apply isort and black reformatting

3deddf4

Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Removing NullTokenizer from Finetuning scripts

766f71f

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from fa6f8c2 to 766f71f Compare June 17, 2025 14:34

artbataev temporarily deployed to public June 17, 2025 14:39 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Jun 17, 2025

View reviewed changes

scripts/performance/llm/finetune_llama31_405b.py Fixed Show fixed Hide fixed

scripts/performance/llm/finetune_llama4_e128.py Fixed Show fixed Hide fixed

Import cleanup

a95713b

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from f882925 to a95713b Compare June 17, 2025 15:16

Apply isort and black reformatting

0673cd2

Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>

artbataev temporarily deployed to public June 17, 2025 15:21 — with GitHub Actions Inactive

malay-nagda self-requested a review June 17, 2025 15:39

malay-nagda approved these changes Jun 17, 2025

View reviewed changes

Merge branch 'main' into rhmukundan/fix-squad-dataset-download

777b5ed

rhmukundan temporarily deployed to public June 17, 2025 16:08 — with GitHub Actions Inactive

malay-nagda added the Run CICD label Jun 17, 2025

github-actions bot removed the Run CICD label Jun 17, 2025

guyueh1 approved these changes Jun 17, 2025

View reviewed changes

rhmukundan merged commit 6a945ab into main Jun 17, 2025
71 of 89 checks passed

rhmukundan deleted the rhmukundan/fix-squad-dataset-download branch June 17, 2025 16:26

Conversation

rhmukundan commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Prevent Race Condition When Downloading SQuAD Dataset

Current Issue

Fix

Uh oh!

guyueh1 commented Jun 12, 2025

Uh oh!

rhmukundan commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guyueh1 commented Jun 12, 2025

Uh oh!

rhmukundan commented Jun 12, 2025

Uh oh!

guyueh1 commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

malay-nagda left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

guyueh1 commented Jun 17, 2025

Uh oh!

guyueh1 left a comment

Choose a reason for hiding this comment

Uh oh!

guyueh1 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

guyueh1 Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rhmukundan commented Jun 11, 2025 •

edited

Loading

rhmukundan commented Jun 12, 2025 •

edited

Loading