force model preparation on eval if torch.compile enabled#1993
force model preparation on eval if torch.compile enabled#1993Luca-Calabria wants to merge 9 commits intohuggingface:mainfrom
Conversation
|
LGTM. I tested the following command as well: and the change makes sense. |
|
converted to draft because the fix fails running torch.compile + --use_regional_compilation |
|
example of failing commandline: PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \ |
|
@Luca-Calabria Sync this PR with main, please. |
|
Last fix works also for --use_regional_compilation and doesn't produce regressions between "train+eval" and "just eval" run |
I do not understand how does this fix help? |
|
Your last example run successfully with and without the fix >>> PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad1/ \
--use_habana \
--torch_compile_backend hpu_backend \
--torch_compile \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16 \
--use_regional_compilationon main commit 13df65f ***** eval metrics *****
epoch = 2.0
eval_exact_match = 0.3122
eval_f1 = 8.2287
eval_graph_compliation_duration = 1.2816
eval_runtime = 0:00:15.42
eval_samples = 10784
eval_samples_per_second = 699.327
eval_steps_per_second = 87.416
max_memory_allocated (GB) = 6.02
memory_allocated (GB) = 3.84
total_memory_available (GB) = 94.62with this PR ***** eval metrics *****
epoch = 2.0
eval_exact_match = 0.3122
eval_f1 = 8.2287
eval_graph_compliation_duration = 1.2824
eval_runtime = 0:00:15.42
eval_samples = 10784
eval_samples_per_second = 699.186
eval_steps_per_second = 87.398
max_memory_allocated (GB) = 6.03
memory_allocated (GB) = 3.84
total_memory_available (GB) = 94.62 |
|
This PR want to fix the regression on torch.compile without --use_regional_compilation whenyou run these two commands: === just Eval example BUT my first fix broken the regional compilation feature. The new fix keep working w/o regional compilation and remove the regression. |
|
@Luca-Calabria I understand the regression, but I do not think your fix is correct. |
|
I think if f you want to use a TorchDynamo-compiled model during evaluation, you would need to manually compile the model using torch.compile() before passing it to the Trainer |
|
last commit fixes the difference on self.model between w/o regional enabling. Now both cases modify a copy of self.model as expected. |
|
This portion of code on accelerator optimum-habana/optimum/habana/accelerate/accelerator.py at main · huggingface/optimum-habana if self.use_regional_compilation: show two different behaviors: This is why in the evaluation_loop this line optimum-habana/optimum/habana/transformers/trainer.py at main · huggingface/optimum-habana A solution for that could be something like this: if self.use_regional_compilation: @IlyasMoutawwakil I see you fixed a similar issue about regional_compilation here #2014. Is your fix just for deepspeed initialization? or it works also for single card training? |
|
Last commit removed the memory inefficiency |
| model = self._wrap_model(self.model, training=False, dataloader=dataloader) | ||
|
|
||
| if len(self.accelerator._models) == 0 and model is self.model: | ||
| if (len(self.accelerator._models) == 0 or self.args.torch_compile) and model is self.model: |
There was a problem hiding this comment.
I don't understand the point here, if len(self.accelerator._models) == 0 the model will be prepared.
why re-prepare it if torch.compile ?
There was a problem hiding this comment.
It is explained in the description of this PR: when you run "Train+eval" example, after the train step this condition is false len(self.accelerator._models) == 0 because the accelerator has the _models list populated with the current model; this means that when evaluation_loop is called, the model preparation is skipped but the model actually is not compiled.
when you run "just eval" example it works because the accelerator ha _models list empty.
Should I force the cleanup of accelerator._model list after train step maybe?
There was a problem hiding this comment.
this means that when evaluation_loop is called, the model preparation is skipped but the model actually is not compiled.
I mean yes that's to be expected, compilation should only be done once. and the returned model will be compiled for the rest of its life (except at checkpoint save time where its unwrapped to be saved)
There was a problem hiding this comment.
sorry if it's self evident but I'm asking because if what you're saying is true I'm wondering why it wasn't flagged in transformers before 😅 sounds like a universal issue
There was a problem hiding this comment.
@yafshar this is different from what you described as expected flow. Maybe I misunderstood?
There was a problem hiding this comment.
I removed this solution anyway. Please take a look at the new, and less intrusive, proposed solution I commit.
There was a problem hiding this comment.
it still is intrusive 😅 for example fsdp+ torch.compile will result in the evaluation loop running with an FSDP wrapped model which I believe doesn't even have a .generate method 🤔
There was a problem hiding this comment.
could you share please in which part of the code you see this?
There was a problem hiding this comment.
the line self.model = model propagates the effects of accelerator.prepare which wraps the modl in all kinds of distributed training wrappers. using this wrapped model during evaluation will break it in some scenarios.
There was a problem hiding this comment.
it is a single small line change but the impacts are gigantic
|
I updated to get self.model = model when run torch.compile(). Now during evaluation it will use the model compiled in training. |
…eval_must_compile_model
|
@IlyasMoutawwakil I double checked the transformers repo on nvidia here https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L4394 and it seems to work running torch.compile() reporting on both cases "train+eval" and "just eval" the same eval_samples_per_second value. The odd thing is that model, in case of train+eval, is not the compiled one ( the same for self.model) but self.model_wrapped is compiled and it seems not used by self.prediction_step. Running "just eval" case it prepare the model here https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L4394 and model is the compiled one Not c;lear how it can produce the same performance. NB: this result has been obtained using --torch_compile_backend inductor |
|
Hi @yafshar @IlyasMoutawwakil I double checked how the original transformers repo works on NVidia and I found the same Gaudi behavior running with The mismatch between "train+eval" and "just eval" on eval metrics can be reproduced also running on cudagraph backend. After this result I think we can close (no merging) this PR as no further action is required on our end. What do you think? |
|
@Luca-Calabria I agree. I do not think we need to do anything else here. Please close this PR. Thanks |



Running training example as torch.compile() mode, the evaluation step skips the model compilation reporting wrong metrics.
To fix it, we force to call accelerator.prepare_model for both training ed evaluation steps, allowing the model compilation.
EDIT 06/06:
New proposed solution is to update trainer.model with compiled model when torch.compile is enabled (and regional_compilation is disabled).
Current main already implement trainer.model updated as compiled model when torch.compile()+regional_compilation is enabled.
The fix has been tested also as Lazy and Eager Modes to check them are still working as expected.
System setup: Gaudi3 1HPU v1.22.0-279
install
git clone https://github.com/huggingface/optimum-habana
cd optimum-habana
pip install -e .
cd examples/question-answering/
pip install -r requirements.txt
commandline
torch.compile
=== Train+Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad1/ \
--use_habana \
--torch_compile_backend hpu_backend \
--torch_compile \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16
=== just Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_eval \
--per_device_eval_batch_size 8 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad1/ \
--use_habana \
--torch_compile_backend hpu_backend \
--torch_compile \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16
Lazy Mode
=== Train+Eval example
PT_HPU_LAZY_MODE=1 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad/ \
--use_habana \
--use_lazy_mode true \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16
=== Just Eval example
PT_HPU_LAZY_MODE=1 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_eval \
--per_device_eval_batch_size 8 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad/ \
--use_habana \
--use_lazy_mode true \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16
Eager Mode
=== Train+Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad2/ \
--use_habana \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16
=== Just Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_eval \
--per_device_eval_batch_size 8 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad2/ \
--use_habana \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16
Results
Train+Eval

Just Eval
