Skip to content

Commit 7ae5ef1

Browse files
committed
just squash into one commit
1 parent e021bf6 commit 7ae5ef1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+104
-294
lines changed

docs/source/en/model_doc/albert.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). This
5757
- Embedding size E is different from hidden size H justified because the embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V being the vocab size). If E < H, it has less parameters.
5858
- Layers are split in groups that share parameters (to save memory).
5959
Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not.
60+
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
6061

6162
### Using Scaled Dot Product Attention (SDPA)
6263

docs/source/en/model_doc/bart.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The
5555
* mask a span of k tokens with a single mask token (a span of 0 tokens is an insertion of a mask token)
5656
* permute sentences
5757
* rotate the document to make it start at a specific token
58+
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
5859

5960
## Implementation Notes
6061

docs/source/en/model_doc/biogpt.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ This model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The
3636
- BioGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left.
3737
- BioGPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows BioGPT to generate syntactically coherent text as it can be observed in the run_generation.py example script.
3838
- The model can take the `past_key_values` (for PyTorch) as input, which is the previously computed key/value attention pairs. Using this (past_key_values or past) value prevents the model from re-computing pre-computed values in the context of text generation. For PyTorch, see past_key_values argument of the BioGptForCausalLM.forward() method for more information on its usage.
39+
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
3940

4041
### Using Scaled Dot Product Attention (SDPA)
4142

docs/source/en/model_doc/data2vec.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ The original code for vision can be found [here](https://github.com/facebookrese
5353
- For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction
5454
- For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization.
5555
- For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction.
56+
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
5657

5758
### Using Scaled Dot Product Attention (SDPA)
5859

docs/source/en/model_doc/gpt_bigcode.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,12 @@ The main differences compared to GPT2.
4646
- Merge the key and value caches into one (this changes the format of layer_past/ present, does it risk creating problems?)
4747
- Use the memory layout (self.num_heads, 3, self.head_dim) instead of `(3, self.num_heads, self.head_dim)` for the QKV tensor with MHA. (prevents an overhead with the merged key and values, but makes the checkpoints incompatible with the original openai-community/gpt2 model).
4848

49+
4950
You can read more about the optimizations in the [original pull request](https://github.com/huggingface/transformers/pull/22575)
5051

52+
> [!NOTE]
53+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
54+
5155
## Combining Starcoder and Flash Attention 2
5256

5357
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.

docs/source/en/model_doc/hubert.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
5050
- Hubert is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
5151
- Hubert model was fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
5252
using [`Wav2Vec2CTCTokenizer`].
53-
53+
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
5454

5555
## Using Flash Attention 2
5656

docs/source/en/model_doc/m2m_100.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,9 @@ multilingual it expects the sequences in a certain format: A special language id
5151
source and target text. The source text format is `[lang_code] X [eos]`, where `lang_code` is source language
5252
id for source text and target language id for target text, with `X` being the source or target text.
5353

54+
> [!NOTE]
55+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
56+
5457
The [`M2M100Tokenizer`] depends on `sentencepiece` so be sure to install it before running the
5558
examples. To install `sentencepiece` run `pip install sentencepiece`.
5659

docs/source/en/model_doc/mbart.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ You can find all the original mBART checkpoints under the [AI at Meta](https://h
3535
> [!TIP]
3636
> Click on the mBART models in the right sidebar for more examples of applying mBART to different language tasks.
3737
38+
> [!NOTE]
39+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
40+
3841
The example below demonstrates how to translate text with [`Pipeline`] or the [`AutoModel`] class.
3942

4043
<hfoptions id="usage">

docs/source/en/model_doc/musicgen.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,9 @@ python src/transformers/models/musicgen/convert_musicgen_transformers.py \
6262
--checkpoint small --pytorch_dump_folder /output/path --safe_serialization
6363
```
6464

65+
> [!NOTE]
66+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
67+
6568
## Generation
6669

6770
MusicGen is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly

docs/source/en/model_doc/musicgen_melody.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,9 @@ There are two key differences with MusicGen:
4444
1. The audio prompt is used here as a conditional signal for the generated audio sample, whereas it's used for audio continuation in [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen).
4545
2. Conditional text and audio signals are concatenated to the decoder's hidden states instead of being used as a cross-attention signal, as in MusicGen.
4646

47+
> [!NOTE]
48+
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
49+
4750
## Generation
4851

4952
MusicGen Melody is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly better results than greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by default, and can be explicitly specified by setting `do_sample=True` in the call to [`MusicgenMelodyForConditionalGeneration.generate`], or by overriding the model's generation config (see below).

0 commit comments

Comments
 (0)