Skip to content

[User] Producing tokenizer.model from transformers tokenizers.json #2443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BaiqingL opened this issue Jul 29, 2023 · 6 comments
Closed

[User] Producing tokenizer.model from transformers tokenizers.json #2443

BaiqingL opened this issue Jul 29, 2023 · 6 comments

Comments

@BaiqingL
Copy link

I have a Llama 2 7b model fine tuned for a downstream task and stored in transformers format, i.e. my model file structure looks like this:

-a---           7/28/2023  4:30 PM            623 config.json
-a---           7/28/2023  4:30 PM            160 generation_config.json
-a---           7/27/2023  4:00 AM     9976672446 pytorch_model-00001-of-00002.bin
-a---           7/27/2023  4:00 AM     3500355411 pytorch_model-00002-of-00002.bin
-a---           7/27/2023  4:00 AM          26788 pytorch_model.bin.index.json
-a---           7/27/2023  4:00 AM            576 special_tokens_map.json
-a---           7/27/2023  4:00 AM            698 tokenizer_config.json
-a---           7/27/2023  4:00 AM        1843709 tokenizer.json
-a---           7/27/2023  4:00 AM           6011 training_args.bin

I know the convert.py file expects the original Llama 2 structure, how would I modify it to make this work? I'm not too sure what the tokenizer.model file format is like, or how to convert the tokenizer.json file into it.

@BaiqingL
Copy link
Author

Also, just curious if anyone has compared latency between inferencing using this project compared to a simple transformers text-generation pipeline?

@klosax
Copy link
Contributor

klosax commented Jul 29, 2023

I guess you could make it work by copying the original tokenizer.model to the folder.

@BaiqingL
Copy link
Author

Hey @klosax I added some special tokens for the downstream task so I don't think I can do that unfortunately

@klosax
Copy link
Contributor

klosax commented Aug 3, 2023

It looks like a solution was added in PR #2228

You could try something like:
python convert.py <model-dir> --vocab-only --outfile <model-dir>/tokenizer.model --vocabtype bpe

@BaiqingL BaiqingL closed this as completed Aug 9, 2023
@akramIOT
Copy link

akramIOT commented Mar 8, 2024

raise FileNotFoundError(f"Could not find any of {[self._FILES[vt] for vt in vocab_types]}")
FileNotFoundError: Could not find any of ['tokenizer.model', 'tokenizer.json']

@jferments
Copy link

I am running into this same issue trying to convert llama 70B into GGUF ... there is no tokenizer.model file, only tokenizer.json. When I run python3 convert-hf-to-gguf.py --outfile ~/code/models/llm/llama-3-70B/llama-3-70B.gguf --outtype f16 ~/code/models/llm/llama-3-70B/, I get the following error: raise FileNotFoundError('Cannot find Llama BPE tokenizer')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants