-
Notifications
You must be signed in to change notification settings - Fork 11.5k
No special token handling in imatrix, beam-search and others #6804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hmm, this might be the reason I'm seeing reports from people saying imatrix doesn't work properly with llama 3 models yet. (Low quality output) |
Not sure if this is related to this issue specifically, but iQ3 quants of L3 are definitely broken right now. Strangely, iQ4 quants seem OK. Here's some PPL calcs I ran on a few different quants a couple days ago. As you can see, iQ3xs is totally borked. Llama 3 70b
Llama 3 8b
|
@ikawrakow Any idea what could cause this? Have you done any tests so far in regards to imatrix and IQ quants for Llama 3? |
Try to quantize with flag: |
I have moved on to other stuff, so the Having said that, I'm of course not completely oblivious to the hype around L3, so did some quick tests myself. Everything was done with build 8b1b1f4. I cannot confirm the PPL values being reported here. For instance, for L3-70B-Instruct and a context of 512, I get these results for
No imatrix for My observation is that quantization errors as measured by PPL (i.e., |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Some models are extremely sensitive to the prompt format being correct. Without it they generate gibberish.
beam-search calls llama_tokenize with parse_special = false. Once I switched that to true the special tokens in my prompts were parsed correctly and it would generate reasonable output.
It is also set to false in the imatrix generation. Thus, all sample data generated from common chat and instruction datasets in the prompt format of the model will not be tokenized in the same way that the model will see during regular inference. Shouldn't it have an impact that zero real prompt formats were evaluated for the imatrix generation?
To get a better idea of the impact I've tested this with the perplexity measurement which also does not parse special tokens. In my quick ChatML test with CodeQwen-1.5 the perplexity went up by 40% once special tokens were parsed. Maybe that's due to the raw chunking that evaluates multiple prompts at once and also breaks them in the middle?
Side note: The tokenization took 500x longer with parse_special = true.
Maybe that's something to be investigated why the PPL went up when special tokens were enabled, and if special token parsing could improve imatrix results? A reason why it might be disabled is stated here.
The text was updated successfully, but these errors were encountered: