Skip to content

Conversation

@gante
Copy link
Contributor

@gante gante commented Feb 17, 2025

What does this PR do?

Reviewers: this PR applies the same pattern on all models. In essence, you only need to review one model with attention (main models: T5 and Whisper)


This PR removes the option to use the legacy cache format in encoder-decoder models, where it has been scheduled for removal in v4.48 or earlier.

The previous pattern, introduced in Whisper, also got updated: if the model is used in decoder-only mode, a Cache class is now accepted. The previous pattern was wasteful and (IMO) unintuitive: we were initializing a new cache, to then discard the cross-attention part 🤔 The updated pattern uses the cache as-is. On the other hand, when the models are encoder-decoder, only EncoderDecoderCache instances are accepted (a deprecation cycle was added for the old behavior).

The PR also:

  • Adds support for assisted generation with EncoderDecoderCache
  • Updates the docstrings for past_key_values in the touched models
  • Adds more comments so we can immediately understand what's going on
  • Standardizes a few minor related differences (mostly on Whisper, to match T5-based models which are more readable)

✅ slow T5 tests are all green

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@gante gante force-pushed the v4_48_deprecations branch from 2addcb6 to 515d266 Compare February 17, 2025 16:27
@gante gante changed the title [generate] remove legacy cache in encoder-decoder models (deprecated in v4.48) [generate] remove legacy cache in t5-based encoder-decoder models (deprecated in v4.48) Feb 17, 2025
@gante gante changed the title [generate] remove legacy cache in t5-based encoder-decoder models (deprecated in v4.48) [generate] remove legacy cache in t5-based encoder-decoder models (deprecated in v4.48) Feb 17, 2025
@gante gante force-pushed the v4_48_deprecations branch from 515d266 to fd365f0 Compare February 17, 2025 16:54
Comment on lines +261 to +262
# set flag that curr layer for cross-attn is already updated so we can re-use in subsequent calls
past_key_value.is_updated[self.layer_idx] = True
Copy link
Contributor Author

@gante gante Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took this pattern from T5 and modified everywhere: it is much clearer than the one originally in whisper, the flag is set AFTER the update is done

@gante gante marked this pull request as ready for review February 18, 2025 15:41
@gante gante requested review from eustlb and zucchini-nlp February 18, 2025 15:46
@gante gante changed the title [generate] remove legacy cache in t5-based encoder-decoder models (deprecated in v4.48) [generate] remove legacy cache in t5 and whisper-based encoder-decoder models (deprecated in v4.48) Feb 18, 2025
@gante gante changed the title [generate] remove legacy cache in t5 and whisper-based encoder-decoder models (deprecated in v4.48) [generate] remove legacy cache in t5 and whisper-based models (deprecated in v4.48) Feb 18, 2025
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for cleaning our repo, those warning are indeed a bit annoying! 🧼

I looked only at T5 and Qwen2Audio, left a few questions to be sure I understand it correctly. The logic for dispatching cache when None seems to be involved

Comment on lines +1447 to +1449
"You are passing a decoder-only cache to a model that is used as an encoder-decoder model. "
"This behavior is deprecated and will be removed in v4.52. To avoid this warning, please pass an "
"`EncoderDecoderCache` (e.g. `EncoderDecoderCache(past_key_values, DynamicCache())`)."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which situations can this happen? I believe if we are using generate(), this should never happen. If users are passing cache objects, they are supposed to use EncoderDecoderCache (as of prev deprecation warning msg)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agreed it should never happen. However, it was supported in the refactor (see here), copy-pasted from Whisper. Hence the deprecation cycle, as a good practice.

For reference, back when it was added to Whisper, the Whisper team was experimenting with new architectures -- a decoder-only model that could also receive the inputs from another encoder. In hindsight, using EncoderDecoderCache there would have been better, but I didn't see the issue when reviewing the original PR :)

Comment on lines +1457 to +1460
elif past_key_values is not None:
logger.warning_once(
"`use_cache` is set to `False` but `past_key_values` is passed. `past_key_values` will be ignored."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! we don't set cache to None in LLMs so would be cool to propagate this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

I've noticed the cache docstrings in LLMs are also outdated, so I'll open a PR with both :)

@gante
Copy link
Contributor Author

gante commented Feb 19, 2025

(@zucchini-nlp @eustlb -- moving Qwen2 audio changes into a separate PR that should be merged before this one, as the nature of the changes are different)

@gante
Copy link
Contributor Author

gante commented May 27, 2025

(superceded by the PRs @vasqu is working on)

@gante gante closed this May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants