No special token handling in imatrix, beam-search and others #6804

BrickBee · 2024-04-21T08:35:35Z

Some models are extremely sensitive to the prompt format being correct. Without it they generate gibberish.
beam-search calls llama_tokenize with parse_special = false. Once I switched that to true the special tokens in my prompts were parsed correctly and it would generate reasonable output.

It is also set to false in the imatrix generation. Thus, all sample data generated from common chat and instruction datasets in the prompt format of the model will not be tokenized in the same way that the model will see during regular inference. Shouldn't it have an impact that zero real prompt formats were evaluated for the imatrix generation?

To get a better idea of the impact I've tested this with the perplexity measurement which also does not parse special tokens. In my quick ChatML test with CodeQwen-1.5 the perplexity went up by 40% once special tokens were parsed. Maybe that's due to the raw chunking that evaluates multiple prompts at once and also breaks them in the middle?
Side note: The tokenization took 500x longer with parse_special = true.

Maybe that's something to be investigated why the PPL went up when special tokens were enabled, and if special token parsing could improve imatrix results? A reason why it might be disabled is stated here.

Dampfinchen · 2024-04-21T08:48:13Z

Hmm, this might be the reason I'm seeing reports from people saying imatrix doesn't work properly with llama 3 models yet. (Low quality output)

candre23 · 2024-04-22T17:20:51Z

Not sure if this is related to this issue specifically, but iQ3 quants of L3 are definitely broken right now. Strangely, iQ4 quants seem OK. Here's some PPL calcs I ran on a few different quants a couple days ago. As you can see, iQ3xs is totally borked.

Llama 3 70b

Q2k       Final estimate: PPL = 12.8166 +/- 0.22308
iQ3xs     Final estimate: PPL = 552.0451 +/- 12.31402
Q3km      Final estimate: PPL = 10.0883 +/- 0.17180
iQ4xs     Final estimate: PPL = 9.4791 +/- 0.15818
Q4km      Final estimate: PPL = 6.2366 +/- 0.09157
Q5km      Final estimate: PPL = 6.1289 +/- 0.08950

Llama 3 8b

Q2k       Final estimate: PPL = 12.9246 +/- 0.20490
iQ3xs     Final estimate: PPL = 109.0483 +/- 2.12390
Q3km      Final estimate: PPL = 10.2122 +/- 0.16656
iQ4xs     Final estimate: PPL = 9.7191 +/- 0.15669
Q4km      Final estimate: PPL = 9.6414 +/- 0.15494
Q5km      Final estimate: PPL = 9.5472 +/- 0.15408
Q6k       Final estimate: PPL = 9.5235 +/- 0.15396
Q8        Final estimate: PPL = 9.5181 +/- 0.15410
fp16      Final estimate: PPL = 9.5158 +/- 0.15418

Dampfinchen · 2024-04-22T21:17:48Z

@ikawrakow Any idea what could cause this? Have you done any tests so far in regards to imatrix and IQ quants for Llama 3?

David-AU-github · 2024-04-23T01:07:42Z

Try to quantize with flag:
--leave-output-tensor
For iq3xs ... may help?
This flag will raise file size slightly, but keep output tensors at original fp16/fp32 regardless of imat or reg quant.
Are the IQ2ish also broken?

ikawrakow · 2024-04-23T09:51:39Z

@ikawrakow Any idea what could cause this? Have you done any tests so far in regards to imatrix and IQ quants for Llama 3?

@Dampfinchen

I have moved on to other stuff, so the llama.cpp community will have to sort it out.

Having said that, I'm of course not completely oblivious to the hype around L3, so did some quick tests myself. Everything was done with build 8b1b1f4. I cannot confirm the PPL values being reported here. For instance, for L3-70B-Instruct and a context of 512, I get these results for wiki.test.raw (based on PPL values quoted above, I assume we are looking at the instruct tuned rather than the base models):

Quantization	PPL
Q8_0	5.6973 +/- 0.03905
Q5_K_S	5.8738 +/- 0.04056
IQ4_XS	6.0607 +/- 0.04211
IQ3_XS	6.4845 +/- 0.04530

No imatrix for Q8_0, imatrix from wiki.train.raw using 100 chunks at a context of 4096 for the others. The L3-8B-Instruct results I get are also different: PPL = 9.1759 +/- 0.06994 for fp16, and PPL = 9.9989 +/- 0.07473 for IQ3_XS.

My observation is that quantization errors as measured by PPL (i.e., PPL(Q)/PPL(fp16)-1) are significantly higher for L3 compared to all other models I have experimented with in the past. My hand wavy explanation is that this is likely due to the much larger vocabulary. But, as far as I can tell, the larger PPL quantization errors do not translate into a worse performance in practice. For instance, the HellaSwag score of IQ3_XS is only 0.7 percentage points lower than Q8_0 after 2000 tasks, despite the ~14% higher perplexity.

github-actions · 2024-06-28T01:42:14Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

BrickBee added the bug-unconfirmed label Apr 21, 2024

Nekotekina mentioned this issue Apr 23, 2024

Is it normal that ROCm+HIPBLAS produces different results than on CPU or breaks completely? #6841

Closed

github-actions bot added the stale label Jun 13, 2024

github-actions bot closed this as completed Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No special token handling in imatrix, beam-search and others #6804

No special token handling in imatrix, beam-search and others #6804

BrickBee commented Apr 21, 2024

Dampfinchen commented Apr 21, 2024 •

edited

Loading

candre23 commented Apr 22, 2024

Dampfinchen commented Apr 22, 2024

David-AU-github commented Apr 23, 2024

ikawrakow commented Apr 23, 2024 •

edited

Loading

github-actions bot commented Jun 28, 2024

No special token handling in imatrix, beam-search and others #6804

No special token handling in imatrix, beam-search and others #6804

Comments

BrickBee commented Apr 21, 2024

Dampfinchen commented Apr 21, 2024 • edited Loading

candre23 commented Apr 22, 2024

Dampfinchen commented Apr 22, 2024

David-AU-github commented Apr 23, 2024

ikawrakow commented Apr 23, 2024 • edited Loading

github-actions bot commented Jun 28, 2024

Dampfinchen commented Apr 21, 2024 •

edited

Loading

ikawrakow commented Apr 23, 2024 •

edited

Loading