Skip to content

Fix for Squad Dataset Download#13893

Merged
rhmukundan merged 20 commits intomainfrom
rhmukundan/fix-squad-dataset-download
Jun 17, 2025
Merged

Fix for Squad Dataset Download#13893
rhmukundan merged 20 commits intomainfrom
rhmukundan/fix-squad-dataset-download

Conversation

@rhmukundan
Copy link
Collaborator

@rhmukundan rhmukundan commented Jun 11, 2025

Fix: Prevent Race Condition When Downloading SQuAD Dataset

Current Issue

When running with multiple GPUs per node and/or multiple nodes, downloading the SQuAD dataset can fail due to a race condition - multiple processes attempt to download the dataset simultaneously. This makes the dataset download to fail.

Fix

This fix ensures that the SQuAD dataset is downloaded in a separate SLURM job using only 1 node and 1 GPU per node. This prevents concurrent downloads and eliminates the race condition.

with run.Experiment(exp_name) as exp:
        if not SKIP_IMPORT:
            assert args.hf_token is not None, "HF token is required for importing checkpoint from HuggingFace"
            exp.add(*import_ckpt_experiment(executor, model(), source=f"hf://{HF_MODEL_URI}"))
            # Prepare SQuAD dataset before finetuning
            exp.add(*prepare_squad_dataset_experiment(executor, HF_MODEL_URI, seq_length=4096))

@rhmukundan rhmukundan self-assigned this Jun 11, 2025
@rhmukundan rhmukundan marked this pull request as ready for review June 11, 2025 19:00
@rhmukundan rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from 24d8b8b to 5143a37 Compare June 11, 2025 19:29
@guyueh1
Copy link
Collaborator

guyueh1 commented Jun 12, 2025

I have a few questions

  • where are these added functions used?
  • Is the purpose to avoid downloading squad from the network and use cached files?
  • if yes, I think existing squad dataset should be able to use cached files if you properly set HF_HOME or HF_HUB envvar, but I haven't tried it recently, i can confirm later.

@rhmukundan
Copy link
Collaborator Author

rhmukundan commented Jun 12, 2025

@guyueh1 Thanks for your comment.
If you noticed, it is using the same function in nemo to download the dataset. Currently it is part of the overall job which means (num_nodes > 1 and num_gpus_per_node > 1) which causes the race condition which prevents the download and throws an error.
In order to fix this, I have created prepare_squad_dataset and prepare_squad_dataset_experiment functions which create a separate slurm job (1 Node and 1 GPU Per node) to avoid the race condition and download the dataset successfully. The functions used are pretty much what is part of the current nemo implementation (https://github.com/NVIDIA/NeMo/blob/577b7fb77f6767e23dec2926ba9b9d2a3722713b/scripts/performance/utils.py#L523) to get the squad dataset. What has changed is the way how it is downloaded (1 node and 1 GPU-per-node) via a separate slurm job similar to the checkpoint download.

This implementation will be used in finetune recipes like the following:

with run.Experiment(exp_name) as exp:
        if not SKIP_IMPORT:
            assert args.hf_token is not None, "HF token is required for importing checkpoint from HuggingFace"
            exp.add(*import_ckpt_experiment(executor, model(), source=f"hf://{HF_MODEL_URI}"))
            # Prepare SQuAD dataset before finetuning
            exp.add(*prepare_squad_dataset_experiment(executor, HF_MODEL_URI, seq_length=4096))

@guyueh1
Copy link
Collaborator

guyueh1 commented Jun 12, 2025

@rhmukundan I see. Can you provide an example of how to use it in run script?

@rhmukundan rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from 577b7fb to c64ca64 Compare June 12, 2025 18:33
@rhmukundan rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from b92b424 to 83c3056 Compare June 12, 2025 18:41
@rhmukundan
Copy link
Collaborator Author

@guyueh1: I have pushed the Finetuning LLAMA4 Maverick recipe (that has the fix included) and also added the fix to LLAMA3 Finetuning 70b file.
cc: @ashbhandare

@guyueh1
Copy link
Collaborator

guyueh1 commented Jun 13, 2025

@malay-nagda can you review the added utility functions in util.py?

@rhmukundan rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from 3ccd871 to 26bdc30 Compare June 14, 2025 21:25
@rhmukundan rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from 938e3b8 to 8f49d4f Compare June 15, 2025 04:47
@rhmukundan rhmukundan requested a review from malay-nagda June 15, 2025 04:48
@rhmukundan rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from bf8cfdd to 1113307 Compare June 16, 2025 03:49
rhmukundan and others added 8 commits June 17, 2025 20:03
… dataset

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
…t function

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
@rhmukundan rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from fa6f8c2 to 766f71f Compare June 17, 2025 14:34
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
@rhmukundan rhmukundan force-pushed the rhmukundan/fix-squad-dataset-download branch from f882925 to a95713b Compare June 17, 2025 15:16
Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>
Copy link
Collaborator

@malay-nagda malay-nagda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

[🤖]: Hi @rhmukundan 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

@guyueh1
Copy link
Collaborator

guyueh1 commented Jun 17, 2025

@rhmukundan is this ready for merge?

Copy link
Collaborator

@guyueh1 guyueh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if not SKIP_IMPORT:
assert args.hf_token is not None, "HF token is required for importing checkpoint from HuggingFace"
exp.add(*import_ckpt_experiment(executor, model(), source=f"hf://{HF_MODEL_URI}"))
exp.add(*prepare_squad_dataset_experiment(executor, HF_MODEL_URI, seq_length=4096))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's best to add a command-line argument, for instance standalone_dataset_preparation, for users to indicate they want this or the default way; that argument should default to false

# downloaded from HuggingFace
SKIP_IMPORT = False

# Set this to True if dataset is already downloaded. If set to False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very clear; previously without this PR , are we not downloading the dataset from huggingface? My impression is it's still done somewhere in the dataset building process, just not explicitly, right? I think the difference is here you are separating it out as a new nemo-run experiment. If comment like this, users would think setting it to True without a local file will error out, but in reality it won't?

Could you further explain what different things are happening between False and True here;

@rhmukundan rhmukundan merged commit 6a945ab into main Jun 17, 2025
71 of 89 checks passed
@rhmukundan rhmukundan deleted the rhmukundan/fix-squad-dataset-download branch June 17, 2025 16:26
AmirHussein96 pushed a commit to AmirHussein96/NeMo that referenced this pull request Jul 23, 2025
* Fix for Squad Dataset Download

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Giving the option to pass the sequence length from the finetune script

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Rebase Pushing llama4 finetuning e128 script and llama3 70b finetuning to include the dataset download fix

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Finetune Llama4 Recipe with dataset download fix

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Address PR comments

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Tweaks to finetune_llama4_e128

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Addressing PR comments

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Giving an option to have either AutoTokenizer or NullTokenizer for preparing the dataset

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Fix kwargs

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* User passing vocab_size while using the NullTokenizer for downloading dataset

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Adding model configs for finetune llama4

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Rebase Introducing the fix to llama3 finetuning recipes as well

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Setting default vocab_size to None in prepare_squad_dataset_experiment function

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Fix merge conflicts

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Fixing the search condition for the dataset

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Removing NullTokenizer from Finetuning scripts

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Import cleanup

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>

---------

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>
Co-authored-by: rhmukundan <rhmukundan@users.noreply.github.com>
Signed-off-by: Amir Hussein <amhussein@nvidia.com>
AmirHussein96 pushed a commit to AmirHussein96/NeMo that referenced this pull request Aug 5, 2025
* Fix for Squad Dataset Download

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Giving the option to pass the sequence length from the finetune script

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Rebase Pushing llama4 finetuning e128 script and llama3 70b finetuning to include the dataset download fix

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Finetune Llama4 Recipe with dataset download fix

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Address PR comments

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Tweaks to finetune_llama4_e128

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Addressing PR comments

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Giving an option to have either AutoTokenizer or NullTokenizer for preparing the dataset

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Fix kwargs

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* User passing vocab_size while using the NullTokenizer for downloading dataset

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Adding model configs for finetune llama4

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Rebase Introducing the fix to llama3 finetuning recipes as well

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Setting default vocab_size to None in prepare_squad_dataset_experiment function

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Fix merge conflicts

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Fixing the search condition for the dataset

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Removing NullTokenizer from Finetuning scripts

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Import cleanup

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>

---------

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>
Co-authored-by: rhmukundan <rhmukundan@users.noreply.github.com>
Signed-off-by: Amir Hussein <amhussein@nvidia.com>
AmirHussein96 pushed a commit to AmirHussein96/NeMo that referenced this pull request Aug 5, 2025
* Fix for Squad Dataset Download

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Giving the option to pass the sequence length from the finetune script

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Rebase Pushing llama4 finetuning e128 script and llama3 70b finetuning to include the dataset download fix

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Finetune Llama4 Recipe with dataset download fix

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Address PR comments

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Tweaks to finetune_llama4_e128

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Addressing PR comments

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Giving an option to have either AutoTokenizer or NullTokenizer for preparing the dataset

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Fix kwargs

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* User passing vocab_size while using the NullTokenizer for downloading dataset

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Adding model configs for finetune llama4

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Rebase Introducing the fix to llama3 finetuning recipes as well

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Setting default vocab_size to None in prepare_squad_dataset_experiment function

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Fix merge conflicts

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Fixing the search condition for the dataset

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Removing NullTokenizer from Finetuning scripts

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Import cleanup

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>

---------

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>
Co-authored-by: rhmukundan <rhmukundan@users.noreply.github.com>
Signed-off-by: Amir Hussein <amhussein@nvidia.com>
nasretdinovr pushed a commit to nasretdinovr/NeMo that referenced this pull request Aug 8, 2025
* Fix for Squad Dataset Download

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Giving the option to pass the sequence length from the finetune script

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Rebase Pushing llama4 finetuning e128 script and llama3 70b finetuning to include the dataset download fix

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Finetune Llama4 Recipe with dataset download fix

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Address PR comments

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Tweaks to finetune_llama4_e128

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Addressing PR comments

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Giving an option to have either AutoTokenizer or NullTokenizer for preparing the dataset

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Fix kwargs

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* User passing vocab_size while using the NullTokenizer for downloading dataset

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Adding model configs for finetune llama4

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Rebase Introducing the fix to llama3 finetuning recipes as well

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Setting default vocab_size to None in prepare_squad_dataset_experiment function

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Fix merge conflicts

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Fixing the search condition for the dataset

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Removing NullTokenizer from Finetuning scripts

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Import cleanup

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>

---------

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>
Co-authored-by: rhmukundan <rhmukundan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants