Skip to content

No special token handling in imatrix, beam-search and others #6804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BrickBee opened this issue Apr 21, 2024 · 6 comments
Closed

No special token handling in imatrix, beam-search and others #6804

BrickBee opened this issue Apr 21, 2024 · 6 comments

Comments

@BrickBee
Copy link

Some models are extremely sensitive to the prompt format being correct. Without it they generate gibberish.
beam-search calls llama_tokenize with parse_special = false. Once I switched that to true the special tokens in my prompts were parsed correctly and it would generate reasonable output.

It is also set to false in the imatrix generation. Thus, all sample data generated from common chat and instruction datasets in the prompt format of the model will not be tokenized in the same way that the model will see during regular inference. Shouldn't it have an impact that zero real prompt formats were evaluated for the imatrix generation?

To get a better idea of the impact I've tested this with the perplexity measurement which also does not parse special tokens. In my quick ChatML test with CodeQwen-1.5 the perplexity went up by 40% once special tokens were parsed. Maybe that's due to the raw chunking that evaluates multiple prompts at once and also breaks them in the middle?
Side note: The tokenization took 500x longer with parse_special = true.

Maybe that's something to be investigated why the PPL went up when special tokens were enabled, and if special token parsing could improve imatrix results? A reason why it might be disabled is stated here.

@Dampfinchen
Copy link

Dampfinchen commented Apr 21, 2024

Hmm, this might be the reason I'm seeing reports from people saying imatrix doesn't work properly with llama 3 models yet. (Low quality output)

@candre23
Copy link

Not sure if this is related to this issue specifically, but iQ3 quants of L3 are definitely broken right now. Strangely, iQ4 quants seem OK. Here's some PPL calcs I ran on a few different quants a couple days ago. As you can see, iQ3xs is totally borked.

Llama 3 70b

Q2k       Final estimate: PPL = 12.8166 +/- 0.22308
iQ3xs     Final estimate: PPL = 552.0451 +/- 12.31402
Q3km      Final estimate: PPL = 10.0883 +/- 0.17180
iQ4xs     Final estimate: PPL = 9.4791 +/- 0.15818
Q4km      Final estimate: PPL = 6.2366 +/- 0.09157
Q5km      Final estimate: PPL = 6.1289 +/- 0.08950

Llama 3 8b

Q2k       Final estimate: PPL = 12.9246 +/- 0.20490
iQ3xs     Final estimate: PPL = 109.0483 +/- 2.12390
Q3km      Final estimate: PPL = 10.2122 +/- 0.16656
iQ4xs     Final estimate: PPL = 9.7191 +/- 0.15669
Q4km      Final estimate: PPL = 9.6414 +/- 0.15494
Q5km      Final estimate: PPL = 9.5472 +/- 0.15408
Q6k       Final estimate: PPL = 9.5235 +/- 0.15396
Q8        Final estimate: PPL = 9.5181 +/- 0.15410
fp16      Final estimate: PPL = 9.5158 +/- 0.15418

@Dampfinchen
Copy link

@ikawrakow Any idea what could cause this? Have you done any tests so far in regards to imatrix and IQ quants for Llama 3?

@David-AU-github
Copy link

Try to quantize with flag:
--leave-output-tensor
For iq3xs ... may help?
This flag will raise file size slightly, but keep output tensors at original fp16/fp32 regardless of imat or reg quant.
Are the IQ2ish also broken?

@ikawrakow
Copy link
Contributor

ikawrakow commented Apr 23, 2024

@ikawrakow Any idea what could cause this? Have you done any tests so far in regards to imatrix and IQ quants for Llama 3?

@Dampfinchen

I have moved on to other stuff, so the llama.cpp community will have to sort it out.

Having said that, I'm of course not completely oblivious to the hype around L3, so did some quick tests myself. Everything was done with build 8b1b1f4. I cannot confirm the PPL values being reported here. For instance, for L3-70B-Instruct and a context of 512, I get these results for wiki.test.raw (based on PPL values quoted above, I assume we are looking at the instruct tuned rather than the base models):

Quantization PPL
Q8_0 5.6973 +/- 0.03905
Q5_K_S 5.8738 +/- 0.04056
IQ4_XS 6.0607 +/- 0.04211
IQ3_XS 6.4845 +/- 0.04530

No imatrix for Q8_0, imatrix from wiki.train.raw using 100 chunks at a context of 4096 for the others. The L3-8B-Instruct results I get are also different: PPL = 9.1759 +/- 0.06994 for fp16, and PPL = 9.9989 +/- 0.07473 for IQ3_XS.

My observation is that quantization errors as measured by PPL (i.e., PPL(Q)/PPL(fp16)-1) are significantly higher for L3 compared to all other models I have experimented with in the past. My hand wavy explanation is that this is likely due to the much larger vocabulary. But, as far as I can tell, the larger PPL quantization errors do not translate into a worse performance in practice. For instance, the HellaSwag score of IQ3_XS is only 0.7 percentage points lower than Q8_0 after 2000 tasks, despite the ~14% higher perplexity.

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants