Skip to content

support bpe tokenizer in convert #2228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 25, 2023
Merged

support bpe tokenizer in convert #2228

merged 3 commits into from
Jul 25, 2023

Conversation

ftgreat
Copy link
Contributor

@ftgreat ftgreat commented Jul 15, 2023

Our released Aquila models used bpe tokenizer, so in convert.py we just add one branch for preprocessing bpe tokenizer vocab into sentencepiece in order to use following modules like inference or int4. we have make sure all encoding ids are all the same and have no impact other modules.

Could you please review this pr, thanks.
Related issue: #2093

@howard0su
Copy link
Collaborator

Can you provide test instruction so that I can verify the change?

@ftgreat
Copy link
Contributor Author

ftgreat commented Jul 18, 2023

Can you provide test instruction so that I can verify the change?

instruction:
python convert.py models/7B --vocab-only --outfile models/aquila-vocab.bin --vocabtype bpe

requirements:
put vocab.json in models dir, vocab.json from Aquila-tokenizer https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/Aquila-tokenizer-hf/vocab.json

@klosax
Copy link
Contributor

klosax commented Jul 19, 2023

Note: Using an llama model with gpt2 tokenizer will be fully supported in the new ggml file format. ggml-org/ggml#302

@ftgreat
Copy link
Contributor Author

ftgreat commented Jul 25, 2023

Note: Using an llama model with gpt2 tokenizer will be fully supported in the new ggml file format. ggerganov/ggml#302

Could you please give me the support schedule?
And how to add our released models, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants