Encoder-Decoder Gemma #38332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

ArthurZucker merged 20 commits into huggingface:main from bzhangGo:encdecgemma2

Jun 25, 2025

Contributor

bzhangGo commented May 23, 2025

What does this PR do?

Add support for encoder-decoder Gemma (https://arxiv.org/abs/2504.06225)

Before submitting

Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
- Documentation is not updated yet.
Did you write any new necessary tests?
- Nope. Instead, I tested the code locally.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

nareshrajkumar866 approved these changes

View reviewed changes

Rocketknight1 requested a review from ArthurZucker

May 26, 2025 14:43

ArthurZucker reviewed

View reviewed changes

Collaborator

ArthurZucker left a comment

Hey! thanks for the PR! Would be nice to split encoder and decoder. Decoder already exist pretty much is Gemma3 no? so simpler to write already!
Can you also add the tests! 🤗

src/transformers/models/encdecgemma2/modular_encdecgemma2.py Outdated

		super().__init__(config, device)


		class EncdecGemma2Attention(nn.Module):

Collaborator

ArthurZucker May 30, 2025

let's split cross and self please!

src/transformers/models/encdecgemma2/modular_encdecgemma2.py Outdated

		return attn_output, attn_weights


		def make_sliding_mask(

Collaborator

ArthurZucker May 30, 2025

can you use create_sliding_window_mask with potentially a or_mask for bidirectionnal? 🤗

src/transformers/models/encdecgemma2/modular_encdecgemma2.py Outdated

Comment on lines 556 to 564

+                      if self.is_decoder:
+                          # cross attention
+                          self.cross_attn = EncdecGemma2Attention(
+                              config=config,
+                              layer_idx=layer_idx,
+                              is_cross_attention=True,
+                          )
+                          self.pre_cross_attn_layernorm = EncdecGemma2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+                          self.post_cross_attn_layernorm = EncdecGemma2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)

Collaborator

ArthurZucker May 30, 2025

we should split encoder and decoder layers!

src/transformers/models/encdecgemma2/modular_encdecgemma2.py Outdated

Comment on lines 591 to 619

+                      # setup sliding window for self-attention mask
+                      if self.is_sliding and attention_mask is not None:  # efficient SDPA and no padding
+                          # In prefill, we may be larger than sliding window
+                          effective_seq_len = max(cache_position.shape[0], self.sliding_window)
+                          # For FA2, the mask is 2D and is of shape [bs, processed_tokens] (not [bs, max_cache_len]),
+                          # thus we must slice from the right (at most `effective_seq_len` elements)
+                          if self.config._attn_implementation == "flash_attention_2":
+                              attention_mask = attention_mask[:, -effective_seq_len:]
+                          # Otherwise, the mask is 4D of shape [bs, 1, query_len, max_cache_len] thus we must slice
+                          # from the left, with an offset if we are beyond the sliding window
+                          else:
+                              attention_mask = make_sliding_mask(
+                                  attention_mask,
+                                  self.sliding_window,
+                                  # Decoder self-attention: causal attention
+                                  # Encoder self-attention: bidirectional attention
+                                  bidirectional=not self.is_decoder
+                              )
+                              # In case we are beyond the sliding window, we need to correctly offset the mask slicing
+                              offset = cache_position[-1] - effective_seq_len + 1
+                              # Should only be used when beyond the sliding window (i.e. offset > 0)
+                              offset = torch.clamp(offset, min=0)
+                              # equivalent to: `attention_mask = attention_mask[:, :, :, offset : offset + effective_seq_len]`,
+                              # but without data-dependent slicing (i.e. torch.compile friendly)
+                              mask_indexes = torch.arange(
+                                  min(effective_seq_len, attention_mask.shape[-1]), device=attention_mask.device
+                              )
+                              mask_indexes += offset
+                              attention_mask = attention_mask[:, :, :, mask_indexes]

Collaborator

ArthurZucker May 30, 2025

not needed if you leverage the new causal mask API

src/transformers/models/encdecgemma2/modular_encdecgemma2.py Outdated

		return shifted_input_ids


		class EncdecGemma2Stack(EncdecGemma2PreTrainedModel):

Collaborator

ArthurZucker May 30, 2025

not a fan of stacks, let's split encoder and decoder! I know t5 has that but it's not a good precedent!

src/transformers/models/encdecgemma2/modular_encdecgemma2.py Outdated

+                          if output_hidden_states:
+                              all_hidden_states += (hidden_states,)
+                          if self.gradient_checkpointing and self.training:

Collaborator

ArthurZucker May 30, 2025

for convenience use GradientCheckpointingLayer!

Contributor Author

bzhangGo commented Jun 6, 2025

Hey! thanks for the PR! Would be nice to split encoder and decoder. Decoder already exist pretty much is Gemma3 no? so simpler to write already! Can you also add the tests! 🤗

@ArthurZucker thanks for your suggestions! I made several updates to this PR. please take another look!

bzhangGo force-pushed the encdecgemma2 branch from 77753c3 to 24c8edf Compare

June 10, 2025 20:19

bzhangGo force-pushed the encdecgemma2 branch from 972f48a to ba3625e Compare

June 17, 2025 20:42

bzhangGo marked this pull request as ready for review

June 18, 2025 00:17

Contributor Author

bzhangGo commented Jun 18, 2025

@ArthurZucker I just marked this PR as ready for review. could you please take another look?

ArthurZucker reviewed

View reviewed changes

Collaborator

ArthurZucker left a comment

Kudos it is very very nice! 🤗

src/transformers/models/auto/tokenization_auto.py Outdated Show resolved Hide resolved

src/transformers/models/t5gemma/modular_t5gemma.py Outdated Show resolved Hide resolved

src/transformers/models/t5gemma/modular_t5gemma.py Outdated Show resolved Hide resolved

src/transformers/models/t5gemma/modular_t5gemma.py Outdated Show resolved Hide resolved

src/transformers/models/t5gemma/modular_t5gemma.py Outdated Show resolved Hide resolved

src/transformers/models/t5gemma/modular_t5gemma.py Show resolved Hide resolved

src/transformers/models/t5gemma/modular_t5gemma.py Outdated Show resolved Hide resolved

src/transformers/models/t5gemma/modular_t5gemma.py Outdated Show resolved Hide resolved

src/transformers/models/t5gemma/modular_t5gemma.py Outdated Show resolved Hide resolved

src/transformers/models/t5gemma/modular_t5gemma.py Outdated Show resolved Hide resolved

bzhangGo force-pushed the encdecgemma2 branch 2 times, most recently from f5bfafe to 6605345 Compare

June 23, 2025 17:28

ArthurZucker approved these changes

View reviewed changes

Collaborator

ArthurZucker left a comment

Last 2 nits!

src/transformers/models/t5gemma/modular_t5gemma.py

Comment on lines +833 to +839

+                          mask_kwargs = {
+                              "config": self.config,
+                              "input_embeds": encoder_hidden_states,
+                              "attention_mask": encoder_attention_mask,
+                              "cache_position": cache_position,
+                              "past_key_values": None,
+                          }

Collaborator

ArthurZucker Jun 24, 2025

the mask kwargs are almos the same we can probably just re=use?

Contributor Author

bzhangGo Jun 24, 2025

sorry, you mean the same as the self attn mask kwargs? perhpas not to reuse becuase out of 5 kwargs, only 2 are the same; the other 3 are different.

Collaborator

ArthurZucker Jun 25, 2025

ah okay okay no worries

src/transformers/models/t5gemma/modular_t5gemma.py Outdated Show resolved Hide resolved

src/transformers/models/t5gemma/modular_t5gemma.py



		@auto_docstring
		class T5GemmaForSequenceClassification(T5GemmaPreTrainedModel):

Collaborator

ArthurZucker Jun 24, 2025

this class and the next I would rather add upon request!

Contributor Author

bzhangGo Jun 24, 2025

Since this model is T5Gemma and has the encoder-decoder architecture, researchers would naturally expect it follows the T5 usage. It would be weird if classification tasks are not supported.

Collaborator

ArthurZucker Jun 25, 2025

Ok makes sense

bzhangGo added 18 commits

June 24, 2025 13:04


          Initial submit

16b1d5c


          Fix bugs:

d280ba4

1. add __init__ file
2. tied word embedding
3. support flash/flex attention
4. model saving and loading


          Code refactor:

1227cbb

* Rename encdecgemma to t5gemma.
* Split attention into self- and cross-attention
* Split stack into encoder and decoder
* Add test cases
* Add auto configuration


          Update configurations.

65f39ba


          Fix bugs related to copy and attribute checks

688d505


          Fix type union

41ce655


          Fix merge errors

b090528


          run ruff format

301fef9


          Run make style and update tests.

1494c89


          Add t5gemma model doc.

7255f99


          ruff and style formatting.

b2b3b33


          Add missed module config.

e9fef63


          Add dummy checkpoint link to pass tests (need updated when real check…

…points are uplioaded.).


          Update model doc.

7c9ddf8


          Minor updates following Arthur's comments:

fb7ce6c

* replace docstrings with auto_docstrings
* remove checkpoint layers
* remove deprecate_kwargs


          fix rebase errors

c11f010


          Fix docstring issues.

8ad8460


          fix t5gemma doc issue.

fb0c059

bzhangGo added 2 commits

June 24, 2025 13:04


          run ruff format

3780d72


          Updates:

0d69980

* split encoder-only model out
* make t5gemmamodel encoder-decoder only
* update token and sequence classification
* update tests

bzhangGo force-pushed the encdecgemma2 branch from 1ea64f3 to 0d69980 Compare

June 24, 2025 17:05

ArthurZucker approved these changes

View reviewed changes

Collaborator

ArthurZucker left a comment

Thanks a lot for bearing with me!

ArthurZucker enabled auto-merge (squash)

June 25, 2025 08:52

ArthurZucker merged commit 3ef8896 into huggingface:main

20 checks passed

HuggingFaceDocBuilderDev commented Jun 25, 2025

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

bzhangGo deleted the encdecgemma2 branch

June 30, 2025 13:04

Collaborator

stefan-it commented Jul 12, 2025

The Token Classification implementation was not tested End-to-End:

[INFO|trainer.py:4352] 2025-07-12 12:27:59,071 >> 
***** Running Evaluation *****
[INFO|trainer.py:4354] 2025-07-12 12:27:59,071 >>   Num examples = 3250
[INFO|trainer.py:4357] 2025-07-12 12:27:59,071 >>   Batch size = 8
                                                                                                                                                                                              Traceback (most recent call last):██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 402/407 [00:06<00:00, 58.28it/s]
  File "/home/stefan/Repositories/transformers-t5gemma/examples/pytorch/token-classification/run_ner.py", line 653, in <module>
    main()
  File "/home/stefan/Repositories/transformers-t5gemma/examples/pytorch/token-classification/run_ner.py", line 584, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/Repositories/transformers-t5gemma/src/transformers/trainer.py", line 2206, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/Repositories/transformers-t5gemma/src/transformers/trainer.py", line 2656, in _inner_training_loop
    self._maybe_log_save_evaluate(
  File "/home/stefan/Repositories/transformers-t5gemma/src/transformers/trainer.py", line 3095, in _maybe_log_save_evaluate
    metrics = self._evaluate(trial, ignore_keys_for_eval)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/Repositories/transformers-t5gemma/src/transformers/trainer.py", line 3044, in _evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/Repositories/transformers-t5gemma/src/transformers/trainer.py", line 4198, in evaluate
    output = eval_loop(
             ^^^^^^^^^^
  File "/home/stefan/Repositories/transformers-t5gemma/src/transformers/trainer.py", line 4488, in evaluation_loop
    metrics = self.compute_metrics(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/Repositories/transformers-t5gemma/examples/pytorch/token-classification/run_ner.py", line 535, in compute_metrics
    predictions = np.argmax(predictions, axis=2)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/venvs/t5gemma/lib/python3.12/site-packages/numpy/_core/fromnumeric.py", line 1359, in argmax
    return _wrapfunc(a, 'argmax', axis=axis, out=out, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/venvs/t5gemma/lib/python3.12/site-packages/numpy/_core/fromnumeric.py", line 54, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/venvs/t5gemma/lib/python3.12/site-packages/numpy/_core/fromnumeric.py", line 42, in _wrapit
    conv = _array_converter(obj)
           ^^^^^^^^^^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
  5%|███████▌

Tested with latest 4.54.0.dev0 using the PyTorch token classification example via:

BATCH_SIZE=16
LR=5e-05
EPOCHS=20
SEED=1

python3 run_ner.py \
  --model_name_or_path google/t5gemma-s-s-ul2 \
  --dataset_name conll2003 \
  --output_dir ./t5gemma-bs${BATCH_SIZE}-lr${LR}-e${EPOCHS}-${SEED} \
  --eval_strategy epoch \
  --save_strategy epoch \
  --per_device_train_batch_size ${BATCH_SIZE} \
  --learning_rate ${LR} \
  --num_train_epochs ${EPOCHS} \
  --load_best_model_at_end=True \
  --bf16 \
  --do_train \
  --do_eval

Contributor Author

bzhangGo commented Jul 12, 2025

Hey, thanks for raising this! There seems some problem with eval as the model returns hidden states as well.

A temporary fix is to disable them in modeling_t5gemma.py, like

        return TokenClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=None, #hidden_states,
            attentions=None, #attentions,
        )

We'll make an update later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet