support bpe tokenizer in convert #2228

ftgreat · 2023-07-15T06:20:36Z

Our released Aquila models used bpe tokenizer, so in convert.py we just add one branch for preprocessing bpe tokenizer vocab into sentencepiece in order to use following modules like inference or int4. we have make sure all encoding ids are all the same and have no impact other modules.

Could you please review this pr, thanks.
Related issue: #2093

Signed-off-by: ldwang <[email protected]>

howard0su · 2023-07-17T12:16:00Z

Can you provide test instruction so that I can verify the change?

Signed-off-by: ldwang <[email protected]>

ftgreat · 2023-07-18T03:20:37Z

Can you provide test instruction so that I can verify the change?

instruction:
python convert.py models/7B --vocab-only --outfile models/aquila-vocab.bin --vocabtype bpe

requirements:
put vocab.json in models dir, vocab.json from Aquila-tokenizer https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/Aquila-tokenizer-hf/vocab.json

klosax · 2023-07-19T22:46:15Z

Note: Using an llama model with gpt2 tokenizer will be fully supported in the new ggml file format. ggml-org/ggml#302

ftgreat · 2023-07-25T10:37:31Z

Note: Using an llama model with gpt2 tokenizer will be fully supported in the new ggml file format. ggerganov/ggml#302

Could you please give me the support schedule?
And how to add our released models, thanks.

ldwang added 2 commits July 15, 2023 14:12

support bpe tokenizer in convert

d7aab2e

Signed-off-by: ldwang <[email protected]>

support bpe tokenizer in convert

ee6bc14

Signed-off-by: ldwang <[email protected]>

ftgreat mentioned this pull request Jul 15, 2023

add support of Aquila 7B models #2093

Closed

support bpe tokenizer in convert, fix

64b8aaf

Signed-off-by: ldwang <[email protected]>

ggerganov approved these changes Jul 25, 2023

View reviewed changes

ggerganov merged commit fce48ca into ggml-org:master Jul 25, 2023

This was referenced Jul 27, 2023

Enable support more diverse tokenizers #2418

Closed

supporting more diverse tokenizers #2420

Merged

ftgreat mentioned this pull request Aug 2, 2023

support Aquila-7B model series #2487

Merged

klosax mentioned this pull request Aug 3, 2023

[User] Producing tokenizer.model from transformers tokenizers.json #2443

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support bpe tokenizer in convert #2228

support bpe tokenizer in convert #2228

ftgreat commented Jul 15, 2023

howard0su commented Jul 17, 2023

ftgreat commented Jul 18, 2023

klosax commented Jul 19, 2023

ftgreat commented Jul 25, 2023

support bpe tokenizer in convert #2228

support bpe tokenizer in convert #2228

Conversation

ftgreat commented Jul 15, 2023

howard0su commented Jul 17, 2023

ftgreat commented Jul 18, 2023

klosax commented Jul 19, 2023

ftgreat commented Jul 25, 2023