Conversation
24d8b8b to
5143a37
Compare
|
I have a few questions
|
|
@guyueh1 Thanks for your comment. This implementation will be used in finetune recipes like the following: |
|
@rhmukundan I see. Can you provide an example of how to use it in run script? |
577b7fb to
c64ca64
Compare
b92b424 to
83c3056
Compare
|
@guyueh1: I have pushed the Finetuning LLAMA4 Maverick recipe (that has the fix included) and also added the fix to LLAMA3 Finetuning 70b file. |
|
@malay-nagda can you review the added utility functions in |
3ccd871 to
26bdc30
Compare
938e3b8 to
8f49d4f
Compare
bf8cfdd to
1113307
Compare
… dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
…t function Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
fa6f8c2 to
766f71f
Compare
Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
f882925 to
a95713b
Compare
Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com>
|
[🤖]: Hi @rhmukundan 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
|
@rhmukundan is this ready for merge? |
| if not SKIP_IMPORT: | ||
| assert args.hf_token is not None, "HF token is required for importing checkpoint from HuggingFace" | ||
| exp.add(*import_ckpt_experiment(executor, model(), source=f"hf://{HF_MODEL_URI}")) | ||
| exp.add(*prepare_squad_dataset_experiment(executor, HF_MODEL_URI, seq_length=4096)) |
There was a problem hiding this comment.
it's best to add a command-line argument, for instance standalone_dataset_preparation, for users to indicate they want this or the default way; that argument should default to false
| # downloaded from HuggingFace | ||
| SKIP_IMPORT = False | ||
|
|
||
| # Set this to True if dataset is already downloaded. If set to False, |
There was a problem hiding this comment.
This is not very clear; previously without this PR , are we not downloading the dataset from huggingface? My impression is it's still done somewhere in the dataset building process, just not explicitly, right? I think the difference is here you are separating it out as a new nemo-run experiment. If comment like this, users would think setting it to True without a local file will error out, but in reality it won't?
Could you further explain what different things are happening between False and True here;
* Fix for Squad Dataset Download Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Giving the option to pass the sequence length from the finetune script Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Rebase Pushing llama4 finetuning e128 script and llama3 70b finetuning to include the dataset download fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Finetune Llama4 Recipe with dataset download fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Address PR comments Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Tweaks to finetune_llama4_e128 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Addressing PR comments Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Giving an option to have either AutoTokenizer or NullTokenizer for preparing the dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix kwargs Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * User passing vocab_size while using the NullTokenizer for downloading dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Adding model configs for finetune llama4 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Rebase Introducing the fix to llama3 finetuning recipes as well Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Setting default vocab_size to None in prepare_squad_dataset_experiment function Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix merge conflicts Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fixing the search condition for the dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Removing NullTokenizer from Finetuning scripts Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Import cleanup Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> --------- Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Co-authored-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Amir Hussein <amhussein@nvidia.com>
* Fix for Squad Dataset Download Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Giving the option to pass the sequence length from the finetune script Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Rebase Pushing llama4 finetuning e128 script and llama3 70b finetuning to include the dataset download fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Finetune Llama4 Recipe with dataset download fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Address PR comments Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Tweaks to finetune_llama4_e128 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Addressing PR comments Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Giving an option to have either AutoTokenizer or NullTokenizer for preparing the dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix kwargs Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * User passing vocab_size while using the NullTokenizer for downloading dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Adding model configs for finetune llama4 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Rebase Introducing the fix to llama3 finetuning recipes as well Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Setting default vocab_size to None in prepare_squad_dataset_experiment function Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix merge conflicts Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fixing the search condition for the dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Removing NullTokenizer from Finetuning scripts Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Import cleanup Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> --------- Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Co-authored-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Amir Hussein <amhussein@nvidia.com>
* Fix for Squad Dataset Download Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Giving the option to pass the sequence length from the finetune script Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Rebase Pushing llama4 finetuning e128 script and llama3 70b finetuning to include the dataset download fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Finetune Llama4 Recipe with dataset download fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Address PR comments Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Tweaks to finetune_llama4_e128 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Addressing PR comments Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Giving an option to have either AutoTokenizer or NullTokenizer for preparing the dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix kwargs Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * User passing vocab_size while using the NullTokenizer for downloading dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Adding model configs for finetune llama4 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Rebase Introducing the fix to llama3 finetuning recipes as well Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Setting default vocab_size to None in prepare_squad_dataset_experiment function Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix merge conflicts Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fixing the search condition for the dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Removing NullTokenizer from Finetuning scripts Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Import cleanup Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> --------- Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Co-authored-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Amir Hussein <amhussein@nvidia.com>
* Fix for Squad Dataset Download Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Giving the option to pass the sequence length from the finetune script Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Rebase Pushing llama4 finetuning e128 script and llama3 70b finetuning to include the dataset download fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Finetune Llama4 Recipe with dataset download fix Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Address PR comments Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Tweaks to finetune_llama4_e128 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Addressing PR comments Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Giving an option to have either AutoTokenizer or NullTokenizer for preparing the dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix kwargs Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * User passing vocab_size while using the NullTokenizer for downloading dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Adding model configs for finetune llama4 Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Rebase Introducing the fix to llama3 finetuning recipes as well Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Setting default vocab_size to None in prepare_squad_dataset_experiment function Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fix merge conflicts Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Fixing the search condition for the dataset Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Removing NullTokenizer from Finetuning scripts Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Import cleanup Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> * Apply isort and black reformatting Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> --------- Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: rhmukundan <rhmukundan@users.noreply.github.com> Co-authored-by: rhmukundan <rhmukundan@users.noreply.github.com>
Fix: Prevent Race Condition When Downloading SQuAD Dataset
Current Issue
When running with multiple GPUs per node and/or multiple nodes, downloading the SQuAD dataset can fail due to a race condition - multiple processes attempt to download the dataset simultaneously. This makes the dataset download to fail.
Fix
This fix ensures that the SQuAD dataset is downloaded in a separate SLURM job using only 1 node and 1 GPU per node. This prevents concurrent downloads and eliminates the race condition.