Qwen 2 inference problem #493

Sadeghi85 · 2024-06-07T09:06:09Z

Discussed in #492

^{Originally posted by Sadeghi85 June 7, 2024}
I tried ExLlamaV2 of Qwen 2 7b, but the output is gibberish. There is a discussion over at llama.cpp: ggml-org/llama.cpp#7805

Sadeghi85 · 2024-06-07T09:06:51Z

There is also an issue on Qwen repo:

QwenLM/Qwen3#485

turboderp · 2024-06-07T12:49:42Z

So I've narrowed it down to the attention function, and I've committed a possible solution to the dev branch. I say possible because it comes down to some internal switching logic in PyTorch that I'm not entirely sure about, but basically, since Torch 2.3.0 now supports lower-right causal masking (finally!) ExLlama can use SDPA instead of matmul attention. SDPA uses upcasting in the fused attention kernel which prevents the overflow and at least Qwen2-7B seems to be working without flash-attn.

I'm not able to test xformers since I can't find a prebuilt wheel and the dependencies are broken at the moment.

There still seem to be some issues with the Q4 cache, working on those.

bartowski1182 · 2024-06-07T14:04:25Z

@turboderp any chance that that upcasting would benefit P40 performance by using FP32? 👀

what do you mean xformers dependencies are broken specifically? I'm using torch 2.3.0 and xformers with "no" issue, but maybe i'm missing something

turboderp · 2024-06-07T14:55:22Z

xformers was working, but I currently don't have it installed and I can't install it because Arch updated me to CUDA 12.5 and gcc13, and I can't downgrade because earlier CUDA versions need gcc12 which I can't install alongside gcc13. xformers refuses to compile because of incompatibilities with CUDA 12.5, so I'm kinda stuck unless I want to spend the next however many hours trying to get all the right versions of everything synced up. I guess with timeshift I'm pretty sure I won't completely brick my desktop, but it's still not a very appealing thought.

As for upscaling, no, now that it will default to SDPA on Torch 2.3.0, attention should run smoother and there are other places I could switch over to FP32 compute, but specifically the matmul kernels need some special attention to work in FP32. Perhaps it could be done.. eh... so much else on the list too though.

bartowski1182 · 2024-06-07T15:09:43Z

Have you considered docker? I run CUDA 12.2 in docker with torch 2.3.0 and xformers, I can walk you through it. Probably wouldn't be your endgame solution but would help you figure this out

If the p40 performance isn't basically free I wouldn't bother, GGUF performance is good enough for that specific card, exllamav2 should just stay as the SOTA for SOTA cards rather than bend over backwards to get tiny amounts of gains from ancient cards lol, was just curious

Ph0rk0z · 2024-06-08T16:48:32Z

I compile it in a conda environment to avoid this issue. For the P40, just xformers may be enough. I only have one pascal card left in use at the moment so I should try it and see what happens. For SD it automatically sped up inference regardless of my compute setting. But lots of other non "SOTA" cards benefit from xformers too.

turboderp · 2024-06-08T17:01:57Z

Added Q8 cache mode now which seems to work great with Qwen2-7B.

bartowski1182 · 2024-06-08T18:18:35Z

Oh hell yes, been looking forward to Q8

turboderp · 2024-06-09T16:04:43Z

Q6 also works well with this model, available in v0.1.5 now

remichu-ai · 2024-06-10T01:07:24Z

So at the moment, Q4 is not working but Q6 and Q8 working?

turboderp · 2024-06-10T01:51:59Z

Correct. Though, it's worth noting that Qwen2-7B already has a very small cache. With FP16 precision it's 56 kB per token, vs. 128 kB per token for Llama3-8B (or 512 kB per token for Llama2-7B (!))

So overall, Qwen2-7B with Q6 cache still uses about 30% less VRAM per token than Llama3-8B with Q4 cache. For precision, I did some quick HumanEval tests, and it's within the margin of error from Q6 and up:

Model	Quant	Cache	pass@1	pass@10	Wikitext 5x1k
Qwen2-7B	FP16	Q4	19.74%	46.34%	40.72
Qwen2-7B	FP16	Q6	61.65%	81.70%	15.20
Qwen2-7B	FP16	Q8	62.37%	81.09%	15.18
Qwen2-7B	FP16	FP16	61.16%	82.31%	15.16

Sadeghi85 · 2024-06-10T02:52:03Z

I tested v0.1.5 and it's working, thanks.

What is the difference between 8bit cache and Q caches? Because in v0.1.5 only 8bit doesn't work for me, all Q caches are working.

turboderp · 2024-06-10T09:01:28Z

The 8-bit mode is FP8, and it's deprecated. It performs worse than Q4 in every respect. But Q4 is very unreliable for Qwen2-7B.

remichu-ai · 2024-06-10T09:11:30Z

Can i ask if Qwen2 72B is working with Q4 cache? Just now i tried but it seems like it was having non stop generation. It could also be issue on my end.

turboderp · 2024-06-10T09:14:10Z

The 72B version seems to work fine with Q4

Model	Quant	Cache	pass@1	pass@10	Wikitext 5x1k
Qwen2-72B	6.0bpw	Q4	70.36	87.19	10.31
Qwen2-72B	6.0bpw	Q6	69.32	85.36	10.26
Qwen2-72B	6.0bpw	Q8	71.28	85.36	10.23
Qwen2-72B	6.0bpw	FP16	70.8	83.5	10.17

turboderp closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen 2 inference problem #493

Qwen 2 inference problem #493

Sadeghi85 commented Jun 7, 2024

Sadeghi85 commented Jun 7, 2024

turboderp commented Jun 7, 2024

bartowski1182 commented Jun 7, 2024

turboderp commented Jun 7, 2024

bartowski1182 commented Jun 7, 2024

Ph0rk0z commented Jun 8, 2024

turboderp commented Jun 8, 2024

bartowski1182 commented Jun 8, 2024

turboderp commented Jun 9, 2024

remichu-ai commented Jun 10, 2024

turboderp commented Jun 10, 2024 •

edited

Loading

Sadeghi85 commented Jun 10, 2024

turboderp commented Jun 10, 2024

remichu-ai commented Jun 10, 2024

turboderp commented Jun 10, 2024

Qwen 2 inference problem #493

Qwen 2 inference problem #493

Comments

Sadeghi85 commented Jun 7, 2024

Discussed in #492

Sadeghi85 commented Jun 7, 2024

turboderp commented Jun 7, 2024

bartowski1182 commented Jun 7, 2024

turboderp commented Jun 7, 2024

bartowski1182 commented Jun 7, 2024

Ph0rk0z commented Jun 8, 2024

turboderp commented Jun 8, 2024

bartowski1182 commented Jun 8, 2024

turboderp commented Jun 9, 2024

remichu-ai commented Jun 10, 2024

turboderp commented Jun 10, 2024 • edited Loading

Sadeghi85 commented Jun 10, 2024

turboderp commented Jun 10, 2024

remichu-ai commented Jun 10, 2024

turboderp commented Jun 10, 2024

turboderp commented Jun 10, 2024 •

edited

Loading