-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Description
Sharing here our plans for v5 !
Right now, the distinction between tokenizers (e.g. Bart, Albert) isn’t explicit. We don’t know the actual
algorithm (Unigram, WordPiece, etc). The current ConvertSlow mechanism hides this detail. Instead of relying on “convert slow,” we want to make tokenizer definitions explicit (use tokenizers as we use torch for model definition), providing a single source of truth.
✅ Plan for v5
- Remove legacy artifacts
- Remove saving special_tokens_map.json and added_tokens.json
- Remove _eventually_correct_t5_max_length and related old code
- Drop old warnings, legacy flags, unused tf / jax support
- Trim unnecessary deps (protobuf, sentencepiece) and imports from core files. (added tokens decoder etc only needed for sentencepiece)
- Clean up outdated models
Remove redundant tokenizer definitions (starting with those in convert_slow_tokenizer.py ) + use modular to isolate diffs. Use tokenizers for explicit definition. LlamaTokenizer will be: (giving an OOB trainable tokenizer that IS like llama, you can pass empty stuff and it would be anew):
class LlamaTokenizer(PreTrainedTokenizerFast):
def __init__(self, vocab=None, merges=None, unk_token="<u>", bos_token="<b>", eos_token="<e>"):
self._tokenizer = Tokenizer(BPE(
bpe_vocab,
merges,
unk_token=unk_token,
fuse_unk=True,
byte_fallback=True,
dropout=None,
))
self._tokenizer.normalizer = [
normalizers.Strip(left=False, right=True),
normalizers.Replace(Regex(" {2,}"), "▁"),
]
self._tokenizer.decoder = decoders.Metaspace(replacement=replacement, prepend_scheme=prepend_scheme)
self._tokenizer.post_processor = processors.TemplateProcessing(
single="<b>:0 $A:0 <e>:0",
pair="<b>:0 $A:0 <b>0 $B:1 <b>:1",
special_tokens=[
("<b>", self.original_tokenizer.convert_tokens_to_ids("<b>")),
("<e>", self.original_tokenizer.convert_tokens_to_ids("<e>")),
],
)The PreTrainedTokenizerFast will have all the nice logic for add_eos_token, add_bos_token etc.
4. Simplify call hierarchy
Current flow:
call -> _call_one -> encode -> batch_encode_plus | encode_plus -> _encode_plus (slow) → tokenize + convert_tokens_to_ids -> prepare_for_model (slow)
Too complex → simplify to be reusable & maintainable
Make it friendlier for simple tokenizers (e.g. blt)
- Batch encoding/decoding:encode already supports batching. Update decode to support batch decoding
- Reduce bloat in loading tokenizers: Today we have many calls, rely on the config etc. Make it simpler
- Update tests: Freeze integration tests, then fully rewrite & simplify given that we expect tokenizers to "work".
- Migration guide: Provide clear instructions for moving to v5 (conversion)
- Docs: more and better docs about training tokenizers!