Skip to content

Commit 793f90b

Browse files
authored
Merge branch 'main' into fix-add-new-model-like-tokenizer
2 parents af9e504 + 19224c3 commit 793f90b

File tree

75 files changed

+91
-91
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75 files changed

+91
-91
lines changed

docs/source/en/internal/import_utils.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ However, no method can be called on that object:
3838
```python
3939
>>> DetrImageProcessorFast.from_pretrained()
4040
ImportError:
41-
DetrImageProcessorFast requires the Torchvision library but it was not found in your environment. Checkout the instructions on the
41+
DetrImageProcessorFast requires the Torchvision library but it was not found in your environment. Check out the instructions on the
4242
installation page: https://pytorch.org/get-started/locally/ and follow the ones that match your environment.
4343
Please note that you may need to restart your runtime after installation.
4444
```

examples/flax/question-answering/run_qa.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -546,7 +546,7 @@ def main():
546546
# region Tokenizer check: this script requires a fast tokenizer.
547547
if not isinstance(tokenizer, PreTrainedTokenizerFast):
548548
raise ValueError(
549-
"This example script only works for models that have a fast tokenizer. Checkout the big table of models at"
549+
"This example script only works for models that have a fast tokenizer. Check out the big table of models at"
550550
" https://huggingface.co/transformers/index.html#supported-frameworks to find the model types that meet"
551551
" this requirement"
552552
)

examples/modular-transformers/configuration_my_new_model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ class MyNewModelConfig(PretrainedConfig):
3636
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
3737
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
3838
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
39-
by meanpooling all the original heads within that group. For more details checkout [this
39+
by meanpooling all the original heads within that group. For more details, check out [this
4040
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
4141
`num_attention_heads`.
4242
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):

examples/modular-transformers/configuration_new_model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ class NewModelConfig(PretrainedConfig):
3434
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
3535
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
3636
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
37-
by meanpooling all the original heads within that group. For more details checkout [this
37+
by meanpooling all the original heads within that group. For more details, check out [this
3838
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
3939
`num_attention_heads`.
4040
head_dim (`int`, *optional*, defaults to 256):

examples/pytorch/question-answering/run_qa.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -357,7 +357,7 @@ def main():
357357
# Tokenizer check: this script requires a fast tokenizer.
358358
if not isinstance(tokenizer, PreTrainedTokenizerFast):
359359
raise ValueError(
360-
"This example script only works for models that have a fast tokenizer. Checkout the big table of models at"
360+
"This example script only works for models that have a fast tokenizer. Check out the big table of models at"
361361
" https://huggingface.co/transformers/index.html#supported-frameworks to find the model types that meet"
362362
" this requirement"
363363
)

examples/pytorch/token-classification/run_ner.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -399,7 +399,7 @@ def get_label_list(labels):
399399
# Tokenizer check: this script requires a fast tokenizer.
400400
if not isinstance(tokenizer, PreTrainedTokenizerFast):
401401
raise ValueError(
402-
"This example script only works for models that have a fast tokenizer. Checkout the big table of models at"
402+
"This example script only works for models that have a fast tokenizer. Check out the big table of models at"
403403
" https://huggingface.co/transformers/index.html#supported-frameworks to find the model types that meet"
404404
" this requirement"
405405
)

examples/tensorflow/question-answering/run_qa.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -378,7 +378,7 @@ def main():
378378
# region Tokenizer check: this script requires a fast tokenizer.
379379
if not isinstance(tokenizer, PreTrainedTokenizerFast):
380380
raise ValueError(
381-
"This example script only works for models that have a fast tokenizer. Checkout the big table of models at"
381+
"This example script only works for models that have a fast tokenizer. Check out the big table of models at"
382382
" https://huggingface.co/transformers/index.html#supported-frameworks to find the model types that meet"
383383
" this requirement"
384384
)

src/transformers/models/aria/configuration_aria.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ class AriaTextConfig(PretrainedConfig):
4949
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
5050
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
5151
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
52-
by meanpooling all the original heads within that group. For more details checkout [this
52+
by meanpooling all the original heads within that group. For more details, check out [this
5353
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
5454
`num_attention_heads`.
5555
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):

src/transformers/models/aria/modular_aria.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ class AriaTextConfig(LlamaConfig):
120120
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
121121
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
122122
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
123-
by meanpooling all the original heads within that group. For more details checkout [this
123+
by meanpooling all the original heads within that group. For more details, check out [this
124124
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
125125
`num_attention_heads`.
126126
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):

src/transformers/models/bamba/configuration_bamba.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ class BambaConfig(PretrainedConfig):
5353
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
5454
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
5555
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
56-
by meanpooling all the original heads within that group. For more details checkout [this
56+
by meanpooling all the original heads within that group. For more details, check out [this
5757
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `8`.
5858
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
5959
The non-linear activation function (function or string) in the decoder.

src/transformers/models/bitnet/configuration_bitnet.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ class BitNetConfig(PretrainedConfig):
4848
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
4949
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
5050
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
51-
by meanpooling all the original heads within that group. For more details checkout [this
51+
by meanpooling all the original heads within that group. For more details, check out [this
5252
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
5353
`num_attention_heads`.
5454
hidden_act (`str` or `function`, *optional*, defaults to `"relu2"`):

src/transformers/models/chameleon/configuration_chameleon.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@ class ChameleonConfig(PretrainedConfig):
125125
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
126126
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
127127
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
128-
by meanpooling all the original heads within that group. For more details checkout [this
128+
by meanpooling all the original heads within that group. For more details, check out [this
129129
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
130130
`num_attention_heads`.
131131
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):

src/transformers/models/chameleon/convert_chameleon_weights_to_hf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -446,7 +446,7 @@ def main():
446446
"--model_size",
447447
choices=["7B", "30B"],
448448
help=""
449-
" models correspond to the finetuned versions, and are specific to the Chameleon official release. For more details on Chameleon, checkout the original repo: https://github.com/facebookresearch/chameleon",
449+
" models correspond to the finetuned versions, and are specific to the Chameleon official release. For more details on Chameleon, check out the original repo: https://github.com/facebookresearch/chameleon",
450450
)
451451
parser.add_argument(
452452
"--output_dir",

src/transformers/models/cohere/configuration_cohere.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ class CohereConfig(PretrainedConfig):
5656
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
5757
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
5858
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
59-
by meanpooling all the original heads within that group. For more details checkout [this
59+
by meanpooling all the original heads within that group. For more details, check out [this
6060
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
6161
`num_attention_heads`.
6262
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):

src/transformers/models/cohere2/configuration_cohere2.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ class Cohere2Config(PretrainedConfig):
5252
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
5353
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
5454
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
55-
by meanpooling all the original heads within that group. For more details checkout [this
55+
by meanpooling all the original heads within that group. For more details, check out [this
5656
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
5757
`num_attention_heads`.
5858
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):

src/transformers/models/cohere2/modular_cohere2.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ class Cohere2Config(PretrainedConfig):
7474
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
7575
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
7676
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
77-
by meanpooling all the original heads within that group. For more details checkout [this
77+
by meanpooling all the original heads within that group. For more details, check out [this
7878
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
7979
`num_attention_heads`.
8080
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):

src/transformers/models/csm/configuration_csm.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ class CsmDepthDecoderConfig(PretrainedConfig):
5454
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
5555
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
5656
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
57-
by meanpooling all the original heads within that group. For more details checkout [this
57+
by meanpooling all the original heads within that group. For more details, check out [this
5858
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
5959
`num_attention_heads`.
6060
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
@@ -235,7 +235,7 @@ class CsmConfig(PretrainedConfig):
235235
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
236236
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
237237
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
238-
by meanpooling all the original heads within that group. For more details checkout [this
238+
by meanpooling all the original heads within that group. For more details, check out [this
239239
paper](https://arxiv.org/pdf/2305.13245.pdf).
240240
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
241241
The non-linear activation function (function or string) in the backbone model Transformer decoder.

src/transformers/models/deepseek_v3/configuration_deepseek_v3.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ class DeepseekV3Config(PretrainedConfig):
5252
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
5353
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
5454
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
55-
by meanpooling all the original heads within that group. For more details checkout [this
55+
by meanpooling all the original heads within that group. For more details, check out [this
5656
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
5757
`num_attention_heads`.
5858
n_shared_experts (`int`, *optional*, defaults to 1):

src/transformers/models/diffllama/configuration_diffllama.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ class DiffLlamaConfig(PretrainedConfig):
4848
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
4949
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
5050
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
51-
by meanpooling all the original heads within that group. For more details checkout [this
51+
by meanpooling all the original heads within that group. For more details, check out [this
5252
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
5353
`num_attention_heads`.
5454
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):

src/transformers/models/emu3/configuration_emu3.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@ class Emu3TextConfig(PretrainedConfig):
138138
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
139139
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
140140
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
141-
by meanpooling all the original heads within that group. For more details checkout [this
141+
by meanpooling all the original heads within that group. For more details, check out [this
142142
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
143143
`num_attention_heads`.
144144
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):

src/transformers/models/falcon_h1/configuration_falcon_h1.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ class FalconH1Config(PretrainedConfig):
5050
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
5151
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
5252
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
53-
by meanpooling all the original heads within that group. For more details checkout [this
53+
by meanpooling all the original heads within that group. For more details, check out [this
5454
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `8`.
5555
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
5656
The non-linear activation function (function or string) in the decoder.

src/transformers/models/gemma/configuration_gemma.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ class GemmaConfig(PretrainedConfig):
4747
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
4848
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
4949
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
50-
by meanpooling all the original heads within that group. For more details checkout [this
50+
by meanpooling all the original heads within that group. For more details, check out [this
5151
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
5252
`num_attention_heads`.
5353
head_dim (`int`, *optional*, defaults to 256):

src/transformers/models/gemma/convert_gemma_weights_to_hf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ def main():
151151
"--model_size",
152152
default="7B",
153153
choices=["2B", "7B", "tokenizer_only"],
154-
help="'f' models correspond to the finetuned versions, and are specific to the Gemma2 official release. For more details on Gemma2, checkout the original repo: https://huggingface.co/google/gemma-7b",
154+
help="'f' models correspond to the finetuned versions, and are specific to the Gemma2 official release. For more details on Gemma2, check out the original repo: https://huggingface.co/google/gemma-7b",
155155
)
156156
parser.add_argument(
157157
"--output_dir",

src/transformers/models/gemma/modular_gemma.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ class GemmaConfig(PretrainedConfig):
7474
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
7575
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
7676
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
77-
by meanpooling all the original heads within that group. For more details checkout [this
77+
by meanpooling all the original heads within that group. For more details, check out [this
7878
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
7979
`num_attention_heads`.
8080
head_dim (`int`, *optional*, defaults to 256):

src/transformers/models/gemma2/configuration_gemma2.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ class Gemma2Config(PretrainedConfig):
4747
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
4848
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
4949
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
50-
by meanpooling all the original heads within that group. For more details checkout [this
50+
by meanpooling all the original heads within that group. For more details, check out [this
5151
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
5252
`num_attention_heads`.
5353
head_dim (`int`, *optional*, defaults to 256):

src/transformers/models/gemma2/convert_gemma2_weights_to_hf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ def main():
184184
"--model_size",
185185
default="9B",
186186
choices=["9B", "27B", "tokenizer_only"],
187-
help="'f' models correspond to the finetuned versions, and are specific to the Gemma22 official release. For more details on Gemma2, checkout the original repo: https://huggingface.co/google/gemma-7b",
187+
help="'f' models correspond to the finetuned versions, and are specific to the Gemma22 official release. For more details on Gemma2, check out the original repo: https://huggingface.co/google/gemma-7b",
188188
)
189189
parser.add_argument(
190190
"--output_dir",

0 commit comments

Comments
 (0)