Skip to content

force model preparation on eval if torch.compile enabled#1993

Closed
Luca-Calabria wants to merge 9 commits intohuggingface:mainfrom
Luca-Calabria:lcalabri/train_and_eval_must_compile_model
Closed

force model preparation on eval if torch.compile enabled#1993
Luca-Calabria wants to merge 9 commits intohuggingface:mainfrom
Luca-Calabria:lcalabri/train_and_eval_must_compile_model

Conversation

@Luca-Calabria
Copy link
Contributor

@Luca-Calabria Luca-Calabria commented May 22, 2025

Running training example as torch.compile() mode, the evaluation step skips the model compilation reporting wrong metrics.
To fix it, we force to call accelerator.prepare_model for both training ed evaluation steps, allowing the model compilation.

EDIT 06/06:
New proposed solution is to update trainer.model with compiled model when torch.compile is enabled (and regional_compilation is disabled).
Current main already implement trainer.model updated as compiled model when torch.compile()+regional_compilation is enabled.

The fix has been tested also as Lazy and Eager Modes to check them are still working as expected.

System setup: Gaudi3 1HPU v1.22.0-279

install

git clone https://github.com/huggingface/optimum-habana
cd optimum-habana
pip install -e .
cd examples/question-answering/
pip install -r requirements.txt

commandline

torch.compile

=== Train+Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad1/ \
--use_habana \
--torch_compile_backend hpu_backend \
--torch_compile \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16

=== just Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_eval \
--per_device_eval_batch_size 8 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad1/ \
--use_habana \
--torch_compile_backend hpu_backend \
--torch_compile \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16

Lazy Mode

=== Train+Eval example
PT_HPU_LAZY_MODE=1 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad/ \
--use_habana \
--use_lazy_mode true \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16

=== Just Eval example
PT_HPU_LAZY_MODE=1 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_eval \
--per_device_eval_batch_size 8 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad/ \
--use_habana \
--use_lazy_mode true \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16

Eager Mode

=== Train+Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad2/ \
--use_habana \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16

=== Just Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_eval \
--per_device_eval_batch_size 8 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad2/ \
--use_habana \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16

Results

Train+Eval
image

Just Eval
image

@Luca-Calabria Luca-Calabria requested a review from regisss as a code owner May 22, 2025 09:57
@Luca-Calabria Luca-Calabria marked this pull request as draft May 22, 2025 10:36
@Luca-Calabria Luca-Calabria marked this pull request as ready for review May 23, 2025 08:52
@malkomes
Copy link
Contributor

malkomes commented Jun 4, 2025

LGTM. I tested the following command as well:

PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py   --model_name_or_path meta-llama/Llama-2-7b-chat-hf   --gaudi_config_name Habana/bert-large-uncased-whole-word-masking   --dataset_name squad   --do_eval   --per_device_eval_batch_size 8   --learning_rate 3e-5   --num_train_epochs 2   --max_seq_length 384   --doc_stride 128   --output_dir /tmp/squad_output/   --use_habana   --use_lazy_mode false   --torch_compile_backend hpu_backend --torch_compile   --throughput_warmup_steps 3   --bf16   --sdp_on_bf16

and the change makes sense.

@Luca-Calabria Luca-Calabria marked this pull request as draft June 4, 2025 14:00
@Luca-Calabria
Copy link
Contributor Author

converted to draft because the fix fails running torch.compile + --use_regional_compilation

@Luca-Calabria
Copy link
Contributor Author

example of failing commandline:

PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad1/ \
--use_habana \
--torch_compile_backend hpu_backend \
--torch_compile \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16 \
--use_regional_compilation

@yafshar
Copy link
Contributor

yafshar commented Jun 4, 2025

@Luca-Calabria Sync this PR with main, please.

@Luca-Calabria
Copy link
Contributor Author

Last fix works also for --use_regional_compilation and doesn't produce regressions between "train+eval" and "just eval" run

@yafshar
Copy link
Contributor

yafshar commented Jun 4, 2025

Last fix works also for --use_regional_compilation and doesn't produce regressions between "train+eval" and "just eval" run

I do not understand how does this fix help?
The self.model and self.model_wrapped are the same for your last example of failing commandline and the condition is False as well len(self.accelerator._models) == 0 and model is self.model=False since len(self.accelerator._models) > 0 here

@yafshar
Copy link
Contributor

yafshar commented Jun 4, 2025

Your last example run successfully with and without the fix

>>> PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad1/ \
--use_habana \
--torch_compile_backend hpu_backend \
--torch_compile \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16 \
--use_regional_compilation

on main commit 13df65f

***** eval metrics *****
  epoch                           =        2.0
  eval_exact_match                =     0.3122
  eval_f1                         =     8.2287
  eval_graph_compliation_duration =     1.2816
  eval_runtime                    = 0:00:15.42
  eval_samples                    =      10784
  eval_samples_per_second         =    699.327
  eval_steps_per_second           =     87.416
  max_memory_allocated (GB)       =       6.02
  memory_allocated (GB)           =       3.84
  total_memory_available (GB)     =      94.62

with this PR

***** eval metrics *****
  epoch                           =        2.0
  eval_exact_match                =     0.3122
  eval_f1                         =     8.2287
  eval_graph_compliation_duration =     1.2824
  eval_runtime                    = 0:00:15.42
  eval_samples                    =      10784
  eval_samples_per_second         =    699.186
  eval_steps_per_second           =     87.398
  max_memory_allocated (GB)       =       6.03
  memory_allocated (GB)           =       3.84
  total_memory_available (GB)     =      94.62

@Luca-Calabria
Copy link
Contributor Author

This PR want to fix the regression on torch.compile without --use_regional_compilation whenyou run these two commands:
=== Train+Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py
--model_name_or_path bert-large-uncased-whole-word-masking
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking
--dataset_name squad
--do_train
--do_eval
--per_device_train_batch_size 32
--per_device_eval_batch_size 8
--learning_rate 3e-5
--num_train_epochs 2
--max_seq_length 384
--doc_stride 128
--output_dir /tmp/squad1/
--use_habana
--torch_compile_backend hpu_backend
--torch_compile
--use_lazy_mode false
--throughput_warmup_steps 3
--bf16
--sdp_on_bf16

=== just Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py
--model_name_or_path bert-large-uncased-whole-word-masking
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking
--dataset_name squad
--do_eval
--per_device_eval_batch_size 8
--max_seq_length 384
--doc_stride 128
--output_dir /tmp/squad1/
--use_habana
--torch_compile_backend hpu_backend
--torch_compile
--use_lazy_mode false
--throughput_warmup_steps 3
--bf16
--sdp_on_bf16

BUT my first fix broken the regional compilation feature. The new fix keep working w/o regional compilation and remove the regression.

@yafshar
Copy link
Contributor

yafshar commented Jun 4, 2025

@Luca-Calabria I understand the regression, but I do not think your fix is correct.

@yafshar
Copy link
Contributor

yafshar commented Jun 4, 2025

I think if f you want to use a TorchDynamo-compiled model during evaluation, you would need to manually compile the model using torch.compile() before passing it to the Trainer

@Luca-Calabria
Copy link
Contributor Author

last commit fixes the difference on self.model between w/o regional enabling. Now both cases modify a copy of self.model as expected.
Also it forces the model preparation on evaluation_loop when torch.compile() is enabled

@Luca-Calabria
Copy link
Contributor Author

This portion of code on accelerator optimum-habana/optimum/habana/accelerate/accelerator.py at main · huggingface/optimum-habana

if self.use_regional_compilation:
compile_regions(model, compile_kwargs)
else:
model = torch.compile(model, **compile_kwargs)

show two different behaviors:
compile_regions modify the trainer instance of self.model directly
torch.compile() modify a copy of this self.model

This is why in the evaluation_loop this line optimum-habana/optimum/habana/transformers/trainer.py at main · huggingface/optimum-habana
works when run with --use_regional_compilation (because the model is already prepared and can be skipped) but doesn’t work (produce regression) when run torch.compile() without --use_regional_compilation (because it skips the model preparation).

A solution for that could be something like this:

if self.use_regional_compilation:
import copy
model_compiled = copy.deepcopy(model)
compile_regions(model_compiled, compile_kwargs)
else:
model_compiled = torch.compile(model, **compile_kwargs)

@IlyasMoutawwakil I see you fixed a similar issue about regional_compilation here #2014. Is your fix just for deepspeed initialization? or it works also for single card training?

@Luca-Calabria
Copy link
Contributor Author

Last commit removed the memory inefficiency

@Luca-Calabria Luca-Calabria marked this pull request as ready for review June 6, 2025 09:58
model = self._wrap_model(self.model, training=False, dataloader=dataloader)

if len(self.accelerator._models) == 0 and model is self.model:
if (len(self.accelerator._models) == 0 or self.args.torch_compile) and model is self.model:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the point here, if len(self.accelerator._models) == 0 the model will be prepared.
why re-prepare it if torch.compile ?

Copy link
Contributor Author

@Luca-Calabria Luca-Calabria Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is explained in the description of this PR: when you run "Train+eval" example, after the train step this condition is false len(self.accelerator._models) == 0 because the accelerator has the _models list populated with the current model; this means that when evaluation_loop is called, the model preparation is skipped but the model actually is not compiled.
when you run "just eval" example it works because the accelerator ha _models list empty.

Should I force the cleanup of accelerator._model list after train step maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means that when evaluation_loop is called, the model preparation is skipped but the model actually is not compiled.

I mean yes that's to be expected, compilation should only be done once. and the returned model will be compiled for the rest of its life (except at checkpoint save time where its unwrapped to be saved)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry if it's self evident but I'm asking because if what you're saying is true I'm wondering why it wasn't flagged in transformers before 😅 sounds like a universal issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yafshar this is different from what you described as expected flow. Maybe I misunderstood?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this solution anyway. Please take a look at the new, and less intrusive, proposed solution I commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still is intrusive 😅 for example fsdp+ torch.compile will result in the evaluation loop running with an FSDP wrapped model which I believe doesn't even have a .generate method 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you share please in which part of the code you see this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the line self.model = model propagates the effects of accelerator.prepare which wraps the modl in all kinds of distributed training wrappers. using this wrapped model during evaluation will break it in some scenarios.

Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a single small line change but the impacts are gigantic

@Luca-Calabria
Copy link
Contributor Author

I updated to get self.model = model when run torch.compile(). Now during evaluation it will use the model compiled in training.

@Luca-Calabria
Copy link
Contributor Author

Luca-Calabria commented Jun 17, 2025

@IlyasMoutawwakil I double checked the transformers repo on nvidia here https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L4394 and it seems to work running torch.compile() reporting on both cases "train+eval" and "just eval" the same eval_samples_per_second value. The odd thing is that model, in case of train+eval, is not the compiled one
model = BertForQuestionAnswering( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 1024, padding_idx=0) (position_embeddings): Embedding(512, 1024) (token_type_embeddings): Embedding(2, 1024) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0-23): 24 x BertLayer( (attention): BertAttention( (self): BertSdpaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (qa_outputs): Linear(in_features=1024, out_features=2, bias=True)

( the same for self.model) but self.model_wrapped is compiled and it seems not used by self.prediction_step. Running "just eval" case it prepare the model here https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L4394 and model is the compiled one
model = OptimizedModule( (_orig_mod): BertForQuestionAnswering( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 1024, padding_idx=0) (position_embeddings): Embedding(512, 1024) (token_type_embeddings): Embedding(2, 1024) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0-23): 24 x BertLayer( (attention): BertAttention( (self): BertSdpaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (qa_outputs): Linear(in_features=1024, out_features=2, bias=True) ) )

Not c;lear how it can produce the same performance.

NB: this result has been obtained using --torch_compile_backend inductor
https://docs.pytorch.org/docs/stable/torch.compiler.html

@Luca-Calabria
Copy link
Contributor Author

Hi @yafshar @IlyasMoutawwakil I double checked how the original transformers repo works on NVidia and I found the same Gaudi behavior running with
image

The mismatch between "train+eval" and "just eval" on eval metrics can be reproduced also running on cudagraph backend.
image
and
image

After this result I think we can close (no merging) this PR as no further action is required on our end. What do you think?

@yafshar
Copy link
Contributor

yafshar commented Jul 3, 2025

@Luca-Calabria I agree. I do not think we need to do anything else here. Please close this PR. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants