[User] Producing tokenizer.model from transformers tokenizers.json #2443

BaiqingL · 2023-07-29T00:20:37Z

I have a Llama 2 7b model fine tuned for a downstream task and stored in transformers format, i.e. my model file structure looks like this:

-a---           7/28/2023  4:30 PM            623 config.json
-a---           7/28/2023  4:30 PM            160 generation_config.json
-a---           7/27/2023  4:00 AM     9976672446 pytorch_model-00001-of-00002.bin
-a---           7/27/2023  4:00 AM     3500355411 pytorch_model-00002-of-00002.bin
-a---           7/27/2023  4:00 AM          26788 pytorch_model.bin.index.json
-a---           7/27/2023  4:00 AM            576 special_tokens_map.json
-a---           7/27/2023  4:00 AM            698 tokenizer_config.json
-a---           7/27/2023  4:00 AM        1843709 tokenizer.json
-a---           7/27/2023  4:00 AM           6011 training_args.bin

I know the convert.py file expects the original Llama 2 structure, how would I modify it to make this work? I'm not too sure what the tokenizer.model file format is like, or how to convert the tokenizer.json file into it.

The text was updated successfully, but these errors were encountered:

BaiqingL · 2023-07-29T00:21:33Z

Also, just curious if anyone has compared latency between inferencing using this project compared to a simple transformers text-generation pipeline?

klosax · 2023-07-29T01:51:16Z

I guess you could make it work by copying the original tokenizer.model to the folder.

BaiqingL · 2023-07-30T04:53:10Z

Hey @klosax I added some special tokens for the downstream task so I don't think I can do that unfortunately

klosax · 2023-08-03T00:23:07Z

It looks like a solution was added in PR #2228

You could try something like:
python convert.py <model-dir> --vocab-only --outfile <model-dir>/tokenizer.model --vocabtype bpe

akramIOT · 2024-03-08T18:53:20Z

raise FileNotFoundError(f"Could not find any of {[self._FILES[vt] for vt in vocab_types]}")
FileNotFoundError: Could not find any of ['tokenizer.model', 'tokenizer.json']

jferments · 2024-04-19T06:12:31Z

I am running into this same issue trying to convert llama 70B into GGUF ... there is no tokenizer.model file, only tokenizer.json. When I run python3 convert-hf-to-gguf.py --outfile ~/code/models/llm/llama-3-70B/llama-3-70B.gguf --outtype f16 ~/code/models/llm/llama-3-70B/, I get the following error: raise FileNotFoundError('Cannot find Llama BPE tokenizer')

BaiqingL closed this as completed Aug 9, 2023

hahuyhoang411 mentioned this issue Oct 23, 2024

Epic: Auto add tokenizer.model menloresearch/models#34

Closed

2 tasks

CypherpunkSamurai mentioned this issue Jan 30, 2025

Reproduce/enable DeepSeek R1 Distill Llama 8B pytorch/executorch#7981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User] Producing tokenizer.model from transformers tokenizers.json #2443

[User] Producing tokenizer.model from transformers tokenizers.json #2443

BaiqingL commented Jul 29, 2023

BaiqingL commented Jul 29, 2023

klosax commented Jul 29, 2023

BaiqingL commented Jul 30, 2023

klosax commented Aug 3, 2023

akramIOT commented Mar 8, 2024

jferments commented Apr 19, 2024

[User] Producing tokenizer.model from transformers tokenizers.json #2443

[User] Producing tokenizer.model from transformers tokenizers.json #2443

Comments

BaiqingL commented Jul 29, 2023

BaiqingL commented Jul 29, 2023

klosax commented Jul 29, 2023

BaiqingL commented Jul 30, 2023

klosax commented Aug 3, 2023

akramIOT commented Mar 8, 2024

jferments commented Apr 19, 2024