force model preparation on eval if torch.compile enabled by Luca-Calabria · Pull Request #1993 · huggingface/optimum-habana

Luca-Calabria · 2025-05-22T09:57:13Z

Running training example as torch.compile() mode, the evaluation step skips the model compilation reporting wrong metrics.
To fix it, we force to call accelerator.prepare_model for both training ed evaluation steps, allowing the model compilation.

EDIT 06/06:
New proposed solution is to update trainer.model with compiled model when torch.compile is enabled (and regional_compilation is disabled).
Current main already implement trainer.model updated as compiled model when torch.compile()+regional_compilation is enabled.

The fix has been tested also as Lazy and Eager Modes to check them are still working as expected.

System setup: Gaudi3 1HPU v1.22.0-279

install

git clone https://github.com/huggingface/optimum-habana
cd optimum-habana
pip install -e .
cd examples/question-answering/
pip install -r requirements.txt

commandline

torch.compile

=== Train+Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad1/ \
--use_habana \
--torch_compile_backend hpu_backend \
--torch_compile \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16

=== just Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_eval \
--per_device_eval_batch_size 8 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad1/ \
--use_habana \
--torch_compile_backend hpu_backend \
--torch_compile \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16

Lazy Mode

=== Train+Eval example
PT_HPU_LAZY_MODE=1 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad/ \
--use_habana \
--use_lazy_mode true \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16

=== Just Eval example
PT_HPU_LAZY_MODE=1 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_eval \
--per_device_eval_batch_size 8 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad/ \
--use_habana \
--use_lazy_mode true \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16

Eager Mode

=== Train+Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad2/ \
--use_habana \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16

=== Just Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_eval \
--per_device_eval_batch_size 8 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad2/ \
--use_habana \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16

Results

Train+Eval

Just Eval

malkomes · 2025-06-04T06:19:13Z

LGTM. I tested the following command as well:

PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py   --model_name_or_path meta-llama/Llama-2-7b-chat-hf   --gaudi_config_name Habana/bert-large-uncased-whole-word-masking   --dataset_name squad   --do_eval   --per_device_eval_batch_size 8   --learning_rate 3e-5   --num_train_epochs 2   --max_seq_length 384   --doc_stride 128   --output_dir /tmp/squad_output/   --use_habana   --use_lazy_mode false   --torch_compile_backend hpu_backend --torch_compile   --throughput_warmup_steps 3   --bf16   --sdp_on_bf16

and the change makes sense.

Luca-Calabria · 2025-06-04T14:01:30Z

converted to draft because the fix fails running torch.compile + --use_regional_compilation

Luca-Calabria · 2025-06-04T14:22:40Z

example of failing commandline:

PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad1/ \
--use_habana \
--torch_compile_backend hpu_backend \
--torch_compile \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16 \
--use_regional_compilation

yafshar · 2025-06-04T15:43:36Z

@Luca-Calabria Sync this PR with main, please.

…eval_must_compile_model

Luca-Calabria · 2025-06-04T16:19:25Z

Last fix works also for --use_regional_compilation and doesn't produce regressions between "train+eval" and "just eval" run

yafshar · 2025-06-04T17:45:26Z

Last fix works also for --use_regional_compilation and doesn't produce regressions between "train+eval" and "just eval" run

I do not understand how does this fix help?
The self.model and self.model_wrapped are the same for your last example of failing commandline and the condition is False as well len(self.accelerator._models) == 0 and model is self.model=False since len(self.accelerator._models) > 0 here

yafshar · 2025-06-04T17:47:16Z

Your last example run successfully with and without the fix

>>> PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/squad1/ \
--use_habana \
--torch_compile_backend hpu_backend \
--torch_compile \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16 \
--sdp_on_bf16 \
--use_regional_compilation

on main commit 13df65f

***** eval metrics *****
  epoch                           =        2.0
  eval_exact_match                =     0.3122
  eval_f1                         =     8.2287
  eval_graph_compliation_duration =     1.2816
  eval_runtime                    = 0:00:15.42
  eval_samples                    =      10784
  eval_samples_per_second         =    699.327
  eval_steps_per_second           =     87.416
  max_memory_allocated (GB)       =       6.02
  memory_allocated (GB)           =       3.84
  total_memory_available (GB)     =      94.62

with this PR

***** eval metrics *****
  epoch                           =        2.0
  eval_exact_match                =     0.3122
  eval_f1                         =     8.2287
  eval_graph_compliation_duration =     1.2824
  eval_runtime                    = 0:00:15.42
  eval_samples                    =      10784
  eval_samples_per_second         =    699.186
  eval_steps_per_second           =     87.398
  max_memory_allocated (GB)       =       6.03
  memory_allocated (GB)           =       3.84
  total_memory_available (GB)     =      94.62

Luca-Calabria · 2025-06-04T20:45:56Z

This PR want to fix the regression on torch.compile without --use_regional_compilation whenyou run these two commands:
=== Train+Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py
--model_name_or_path bert-large-uncased-whole-word-masking
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking
--dataset_name squad
--do_train
--do_eval
--per_device_train_batch_size 32
--per_device_eval_batch_size 8
--learning_rate 3e-5
--num_train_epochs 2
--max_seq_length 384
--doc_stride 128
--output_dir /tmp/squad1/
--use_habana
--torch_compile_backend hpu_backend
--torch_compile
--use_lazy_mode false
--throughput_warmup_steps 3
--bf16
--sdp_on_bf16

=== just Eval example
PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 python run_qa.py
--model_name_or_path bert-large-uncased-whole-word-masking
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking
--dataset_name squad
--do_eval
--per_device_eval_batch_size 8
--max_seq_length 384
--doc_stride 128
--output_dir /tmp/squad1/
--use_habana
--torch_compile_backend hpu_backend
--torch_compile
--use_lazy_mode false
--throughput_warmup_steps 3
--bf16
--sdp_on_bf16

BUT my first fix broken the regional compilation feature. The new fix keep working w/o regional compilation and remove the regression.

yafshar · 2025-06-04T21:05:14Z

@Luca-Calabria I understand the regression, but I do not think your fix is correct.

yafshar · 2025-06-04T21:35:32Z

I think if f you want to use a TorchDynamo-compiled model during evaluation, you would need to manually compile the model using torch.compile() before passing it to the Trainer

Luca-Calabria · 2025-06-05T10:16:10Z

last commit fixes the difference on self.model between w/o regional enabling. Now both cases modify a copy of self.model as expected.
Also it forces the model preparation on evaluation_loop when torch.compile() is enabled

Luca-Calabria · 2025-06-05T13:14:33Z

This portion of code on accelerator optimum-habana/optimum/habana/accelerate/accelerator.py at main · huggingface/optimum-habana

if self.use_regional_compilation:
compile_regions(model, compile_kwargs)
else:
model = torch.compile(model, **compile_kwargs)

show two different behaviors:
compile_regions modify the trainer instance of self.model directly
torch.compile() modify a copy of this self.model

This is why in the evaluation_loop this line optimum-habana/optimum/habana/transformers/trainer.py at main · huggingface/optimum-habana
works when run with --use_regional_compilation (because the model is already prepared and can be skipped) but doesn’t work (produce regression) when run torch.compile() without --use_regional_compilation (because it skips the model preparation).

A solution for that could be something like this:

if self.use_regional_compilation:
import copy
model_compiled = copy.deepcopy(model)
compile_regions(model_compiled, compile_kwargs)
else:
model_compiled = torch.compile(model, **compile_kwargs)

@IlyasMoutawwakil I see you fixed a similar issue about regional_compilation here #2014. Is your fix just for deepspeed initialization? or it works also for single card training?

optimum/habana/accelerate/accelerator.py

Luca-Calabria · 2025-06-06T07:36:39Z

Last commit removed the memory inefficiency

IlyasMoutawwakil · 2025-06-06T11:57:21Z

optimum/habana/transformers/trainer.py

        model = self._wrap_model(self.model, training=False, dataloader=dataloader)

-        if len(self.accelerator._models) == 0 and model is self.model:
+        if (len(self.accelerator._models) == 0 or self.args.torch_compile) and model is self.model:


I don't understand the point here, if len(self.accelerator._models) == 0 the model will be prepared.
why re-prepare it if torch.compile ?

It is explained in the description of this PR: when you run "Train+eval" example, after the train step this condition is false len(self.accelerator._models) == 0 because the accelerator has the _models list populated with the current model; this means that when evaluation_loop is called, the model preparation is skipped but the model actually is not compiled.
when you run "just eval" example it works because the accelerator ha _models list empty.

Should I force the cleanup of accelerator._model list after train step maybe?

this means that when evaluation_loop is called, the model preparation is skipped but the model actually is not compiled.

I mean yes that's to be expected, compilation should only be done once. and the returned model will be compiled for the rest of its life (except at checkpoint save time where its unwrapped to be saved)

sorry if it's self evident but I'm asking because if what you're saying is true I'm wondering why it wasn't flagged in transformers before 😅 sounds like a universal issue

@yafshar this is different from what you described as expected flow. Maybe I misunderstood?

I removed this solution anyway. Please take a look at the new, and less intrusive, proposed solution I commit.

it still is intrusive 😅 for example fsdp+ torch.compile will result in the evaluation loop running with an FSDP wrapped model which I believe doesn't even have a .generate method 🤔

could you share please in which part of the code you see this?

the line self.model = model propagates the effects of accelerator.prepare which wraps the modl in all kinds of distributed training wrappers. using this wrapped model during evaluation will break it in some scenarios.

it is a single small line change but the impacts are gigantic

optimum/habana/accelerate/accelerator.py

Luca-Calabria · 2025-06-06T14:00:29Z

I updated to get self.model = model when run torch.compile(). Now during evaluation it will use the model compiled in training.

…eval_must_compile_model

Luca-Calabria · 2025-06-17T13:31:02Z

@IlyasMoutawwakil I double checked the transformers repo on nvidia here https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L4394 and it seems to work running torch.compile() reporting on both cases "train+eval" and "just eval" the same eval_samples_per_second value. The odd thing is that model, in case of train+eval, is not the compiled one
model = BertForQuestionAnswering( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 1024, padding_idx=0) (position_embeddings): Embedding(512, 1024) (token_type_embeddings): Embedding(2, 1024) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0-23): 24 x BertLayer( (attention): BertAttention( (self): BertSdpaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (qa_outputs): Linear(in_features=1024, out_features=2, bias=True)

( the same for self.model) but self.model_wrapped is compiled and it seems not used by self.prediction_step. Running "just eval" case it prepare the model here https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L4394 and model is the compiled one
model = OptimizedModule( (_orig_mod): BertForQuestionAnswering( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 1024, padding_idx=0) (position_embeddings): Embedding(512, 1024) (token_type_embeddings): Embedding(2, 1024) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0-23): 24 x BertLayer( (attention): BertAttention( (self): BertSdpaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (qa_outputs): Linear(in_features=1024, out_features=2, bias=True) ) )

Not c;lear how it can produce the same performance.

NB: this result has been obtained using --torch_compile_backend inductor
https://docs.pytorch.org/docs/stable/torch.compiler.html

Luca-Calabria · 2025-07-03T13:09:09Z

Hi @yafshar @IlyasMoutawwakil I double checked how the original transformers repo works on NVidia and I found the same Gaudi behavior running with

The mismatch between "train+eval" and "just eval" on eval metrics can be reproduced also running on cudagraph backend.

and

After this result I think we can close (no merging) this PR as no further action is required on our end. What do you think?

yafshar · 2025-07-03T15:30:00Z

@Luca-Calabria I agree. I do not think we need to do anything else here. Please close this PR. Thanks

force model preparation on eval if torch.compile enabled

1dbdcc6

Luca-Calabria requested a review from regisss as a code owner May 22, 2025 09:57

Luca-Calabria marked this pull request as draft May 22, 2025 10:36

Luca-Calabria marked this pull request as ready for review May 23, 2025 08:52

Luca-Calabria marked this pull request as draft June 4, 2025 14:00

Luca-Calabria added 2 commits June 4, 2025 17:52

Merge remote-tracking branch 'upstream/main' into lcalabri/train_and_…

7ba8341

…eval_must_compile_model

fix for w/o regional_compilation

84028b5

torch.compile w/o regional_compilation modify a copy of model

b40bcf6

IlyasMoutawwakil reviewed Jun 5, 2025

View reviewed changes

optimum/habana/accelerate/accelerator.py Outdated Show resolved Hide resolved

enable compile_regions from accelerate.utils

3bda604

remove not needed import + make style

077573d

Luca-Calabria marked this pull request as ready for review June 6, 2025 09:58

IlyasMoutawwakil reviewed Jun 6, 2025

View reviewed changes

optimum/habana/accelerate/accelerator.py Outdated Show resolved Hide resolved

Luca-Calabria added 2 commits June 6, 2025 15:52

update self.model to get it compiled on evaluation

92f6b61

run make style

3aab30d

Merge remote-tracking branch 'upstream/main' into lcalabri/train_and_…

b33d830

…eval_must_compile_model

Luca-Calabria closed this Jul 3, 2025

Conversation

Luca-Calabria commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

install

commandline

torch.compile

Lazy Mode

Eager Mode

Results

Uh oh!

malkomes commented Jun 4, 2025

Uh oh!

Luca-Calabria commented Jun 4, 2025

Uh oh!

Luca-Calabria commented Jun 4, 2025

Uh oh!

yafshar commented Jun 4, 2025

Uh oh!

Luca-Calabria commented Jun 4, 2025

Uh oh!

yafshar commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yafshar commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Luca-Calabria commented Jun 4, 2025

Uh oh!

yafshar commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yafshar commented Jun 4, 2025

Uh oh!

Luca-Calabria commented Jun 5, 2025

Uh oh!

Luca-Calabria commented Jun 5, 2025

Uh oh!

Uh oh!

Luca-Calabria commented Jun 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Luca-Calabria Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Luca-Calabria commented Jun 6, 2025

Uh oh!

Luca-Calabria commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Luca-Calabria commented Jul 3, 2025

Uh oh!

yafshar commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Luca-Calabria commented May 22, 2025 •

edited

Loading

yafshar commented Jun 4, 2025 •

edited

Loading

yafshar commented Jun 4, 2025 •

edited

Loading

yafshar commented Jun 4, 2025 •

edited

Loading

Luca-Calabria Jun 6, 2025 •

edited

Loading

IlyasMoutawwakil Jun 6, 2025 •

edited

Loading

Luca-Calabria commented Jun 17, 2025 •

edited

Loading