Skip to content

Conversation

@Cyrilvallez
Copy link
Member

What does this PR do?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 🤗

target_magnitude: torch.Tensor = torch.mean(hidden_states_0**2, dim=-1, keepdim=True) ** 0.5
epsilon_tensor = torch.tensor(torch.finfo().min)
target_magnitude = torch.mean(hidden_states_0**2, dim=-1, keepdim=True) ** 0.5
epsilon_tensor = torch.tensor(1e-5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a little bit unrelated to me, where does that come from?

Copy link
Member Author

@Cyrilvallez Cyrilvallez Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we can get NaN on those layers because of the line current_hidden_state = current_hidden_state * (target_magnitude / torch.maximum(new_magnitude, epsilon_tensor)) - and the max ops is completely useless if we compare it to the minimum possible for any given dtype. Given the epsilon name of the variable, I assumed it is a typo and was supposed to be a small value instead for numerical stability

@ydshieh
Copy link
Collaborator

ydshieh commented Jul 1, 2025

run-slow: gemma3n

@github-actions
Copy link
Contributor

github-actions bot commented Jul 1, 2025

This comment contains run-slow, running the specified jobs:

models: ['models/gemma3n']
quantizations: [] ...

@Cyrilvallez
Copy link
Member Author

All good, @ydshieh for the slow tests! (IntegrationTests are still skipped, I will check them soon)

@Cyrilvallez Cyrilvallez merged commit dbc9832 into main Jul 1, 2025
22 checks passed
@Cyrilvallez Cyrilvallez deleted the fix-gemma3n-tests branch July 1, 2025 08:34
@Cyrilvallez Cyrilvallez added the for patch Tag issues / labels that should be included in the next patch label Jul 1, 2025
@danielhanchen
Copy link
Contributor

Do you guys know why the training loss is exceptionally high? I don't think it's due to the gradient accumulation - it does quickly decrease, but it's very weird

@ArthurZucker
Copy link
Collaborator

👀 no idea!

@danielhanchen
Copy link
Contributor

Wait I mis-spoke - grad accumulation does in fact not work correctly.

But the losses are still suspiciously high

Cyrilvallez added a commit that referenced this pull request Jul 4, 2025
* remove the skips

* fix the epsilon to a small value (does not make sense otherwise)

* safeguard

* overload test_eager_matches_sdpa

* Update test_modeling_common.py

* skip appropriate tests

* correct no_split_layer

* fix all devices issue

* fix backward

* fix
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* remove the skips

* fix the epsilon to a small value (does not make sense otherwise)

* safeguard

* overload test_eager_matches_sdpa

* Update test_modeling_common.py

* skip appropriate tests

* correct no_split_layer

* fix all devices issue

* fix backward

* fix
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* remove the skips

* fix the epsilon to a small value (does not make sense otherwise)

* safeguard

* overload test_eager_matches_sdpa

* Update test_modeling_common.py

* skip appropriate tests

* correct no_split_layer

* fix all devices issue

* fix backward

* fix
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* remove the skips

* fix the epsilon to a small value (does not make sense otherwise)

* safeguard

* overload test_eager_matches_sdpa

* Update test_modeling_common.py

* skip appropriate tests

* correct no_split_layer

* fix all devices issue

* fix backward

* fix
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* remove the skips

* fix the epsilon to a small value (does not make sense otherwise)

* safeguard

* overload test_eager_matches_sdpa

* Update test_modeling_common.py

* skip appropriate tests

* correct no_split_layer

* fix all devices issue

* fix backward

* fix
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* remove the skips

* fix the epsilon to a small value (does not make sense otherwise)

* safeguard

* overload test_eager_matches_sdpa

* Update test_modeling_common.py

* skip appropriate tests

* correct no_split_layer

* fix all devices issue

* fix backward

* fix
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* remove the skips

* fix the epsilon to a small value (does not make sense otherwise)

* safeguard

* overload test_eager_matches_sdpa

* Update test_modeling_common.py

* skip appropriate tests

* correct no_split_layer

* fix all devices issue

* fix backward

* fix
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* remove the skips

* fix the epsilon to a small value (does not make sense otherwise)

* safeguard

* overload test_eager_matches_sdpa

* Update test_modeling_common.py

* skip appropriate tests

* correct no_split_layer

* fix all devices issue

* fix backward

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

for patch Tag issues / labels that should be included in the next patch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants