RFC for `tokenization` in v5

Sharing here our plans for v5 ! 

Right now, the distinction between tokenizers (e.g. Bart, Albert) isn’t explicit. We don’t know the actual 
algorithm (Unigram, WordPiece, etc). The current ConvertSlow mechanism hides this detail. Instead of relying on “convert slow,” we want to make tokenizer definitions explicit (use tokenizers as we use torch for model definition), providing a single source of truth.

:white_check_mark: Plan for v5
1. Remove legacy artifacts
- Remove saving special_tokens_map.json and added_tokens.json
- Remove _eventually_correct_t5_max_length and related old code
- Drop old warnings, legacy flags, unused tf / jax support
- Trim unnecessary deps (protobuf, sentencepiece) and imports from core files. (added tokens decoder etc only needed for sentencepiece)
2. Clean up outdated models
Remove redundant tokenizer definitions (starting with those in convert_slow_tokenizer.py ) + use modular to isolate diffs. Use tokenizers for explicit definition. LlamaTokenizer will be: (giving an OOB trainable tokenizer that IS like llama, you can pass empty stuff and it would be anew):
```python
class LlamaTokenizer(PreTrainedTokenizerFast):
 def __init__(self, vocab=None, merges=None, unk_token="", bos_token="", eos_token="<e>"):
 self._tokenizer = Tokenizer(BPE(
 bpe_vocab,
 merges,
 unk_token=unk_token,
 fuse_unk=True,
 byte_fallback=True,
 dropout=None,
 ))
 self._tokenizer.normalizer = [
 normalizers.Strip(left=False, right=True),
 normalizers.Replace(Regex(" {2,}"), "▁"),
 ]
 self._tokenizer.decoder = decoders.Metaspace(replacement=replacement, prepend_scheme=prepend_scheme)
 self._tokenizer.post_processor = processors.TemplateProcessing(
 single=":0 $A:0 <e>:0",
 pair=":0 $A:0 0 $B:1 :1",
 special_tokens=[
 ("", self.original_tokenizer.convert_tokens_to_ids("")),
 ("<e>", self.original_tokenizer.convert_tokens_to_ids("<e>")),
 ],
 )
```
The `PreTrainedTokenizerFast` will have all the nice logic for `add_eos_token`, `add_bos_token` etc.
4. Simplify call hierarchy
Current flow:
 __call__ -> _call_one -> encode -> batch_encode_plus | encode_plus -> _encode_plus (slow) → tokenize + convert_tokens_to_ids -> prepare_for_model (slow)
Too complex → simplify to be reusable & maintainable
Make it friendlier for simple tokenizers (e.g. blt)
- Batch encoding/decoding:encode already supports batching. Update decode to support batch decoding
- Reduce bloat in loading tokenizers: Today we have many calls, rely on the config etc. Make it simpler
5. Update tests: Freeze integration tests, then fully rewrite & simplify given that we expect tokenizers to "work".
6. Migration guide: Provide clear instructions for moving to v5 (conversion)
7. Docs: more and better docs about training tokenizers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC for `tokenization` in v5 #40938

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC for tokenization in v5 #40938

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

RFC for `tokenization` in v5 #40938