Skip to content

Qwen 2 inference problem #493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Sadeghi85 opened this issue Jun 7, 2024 Discussed in #492 · 15 comments
Closed

Qwen 2 inference problem #493

Sadeghi85 opened this issue Jun 7, 2024 Discussed in #492 · 15 comments

Comments

@Sadeghi85
Copy link

Discussed in #492

Originally posted by Sadeghi85 June 7, 2024
I tried ExLlamaV2 of Qwen 2 7b, but the output is gibberish. There is a discussion over at llama.cpp: ggml-org/llama.cpp#7805

@Sadeghi85
Copy link
Author

There is also an issue on Qwen repo:

QwenLM/Qwen3#485

@turboderp
Copy link
Member

So I've narrowed it down to the attention function, and I've committed a possible solution to the dev branch. I say possible because it comes down to some internal switching logic in PyTorch that I'm not entirely sure about, but basically, since Torch 2.3.0 now supports lower-right causal masking (finally!) ExLlama can use SDPA instead of matmul attention. SDPA uses upcasting in the fused attention kernel which prevents the overflow and at least Qwen2-7B seems to be working without flash-attn.

I'm not able to test xformers since I can't find a prebuilt wheel and the dependencies are broken at the moment.

There still seem to be some issues with the Q4 cache, working on those.

@bartowski1182
Copy link
Contributor

@turboderp any chance that that upcasting would benefit P40 performance by using FP32? 👀

what do you mean xformers dependencies are broken specifically? I'm using torch 2.3.0 and xformers with "no" issue, but maybe i'm missing something

@turboderp
Copy link
Member

xformers was working, but I currently don't have it installed and I can't install it because Arch updated me to CUDA 12.5 and gcc13, and I can't downgrade because earlier CUDA versions need gcc12 which I can't install alongside gcc13. xformers refuses to compile because of incompatibilities with CUDA 12.5, so I'm kinda stuck unless I want to spend the next however many hours trying to get all the right versions of everything synced up. I guess with timeshift I'm pretty sure I won't completely brick my desktop, but it's still not a very appealing thought.

As for upscaling, no, now that it will default to SDPA on Torch 2.3.0, attention should run smoother and there are other places I could switch over to FP32 compute, but specifically the matmul kernels need some special attention to work in FP32. Perhaps it could be done.. eh... so much else on the list too though.

@bartowski1182
Copy link
Contributor

Have you considered docker? I run CUDA 12.2 in docker with torch 2.3.0 and xformers, I can walk you through it. Probably wouldn't be your endgame solution but would help you figure this out

If the p40 performance isn't basically free I wouldn't bother, GGUF performance is good enough for that specific card, exllamav2 should just stay as the SOTA for SOTA cards rather than bend over backwards to get tiny amounts of gains from ancient cards lol, was just curious

@Ph0rk0z
Copy link

Ph0rk0z commented Jun 8, 2024

I compile it in a conda environment to avoid this issue. For the P40, just xformers may be enough. I only have one pascal card left in use at the moment so I should try it and see what happens. For SD it automatically sped up inference regardless of my compute setting. But lots of other non "SOTA" cards benefit from xformers too.

@turboderp
Copy link
Member

Added Q8 cache mode now which seems to work great with Qwen2-7B.

@bartowski1182
Copy link
Contributor

Oh hell yes, been looking forward to Q8

@turboderp
Copy link
Member

Q6 also works well with this model, available in v0.1.5 now

@remichu-ai
Copy link

So at the moment, Q4 is not working but Q6 and Q8 working?

@turboderp
Copy link
Member

turboderp commented Jun 10, 2024

Correct. Though, it's worth noting that Qwen2-7B already has a very small cache. With FP16 precision it's 56 kB per token, vs. 128 kB per token for Llama3-8B (or 512 kB per token for Llama2-7B (!))

So overall, Qwen2-7B with Q6 cache still uses about 30% less VRAM per token than Llama3-8B with Q4 cache. For precision, I did some quick HumanEval tests, and it's within the margin of error from Q6 and up:

Model Quant Cache pass@1 pass@10 Wikitext 5x1k
Qwen2-7B FP16 Q4 19.74% 46.34% 40.72
Qwen2-7B FP16 Q6 61.65% 81.70% 15.20
Qwen2-7B FP16 Q8 62.37% 81.09% 15.18
Qwen2-7B FP16 FP16 61.16% 82.31% 15.16

@Sadeghi85
Copy link
Author

I tested v0.1.5 and it's working, thanks.

What is the difference between 8bit cache and Q caches? Because in v0.1.5 only 8bit doesn't work for me, all Q caches are working.

@turboderp
Copy link
Member

The 8-bit mode is FP8, and it's deprecated. It performs worse than Q4 in every respect. But Q4 is very unreliable for Qwen2-7B.

@remichu-ai
Copy link

Can i ask if Qwen2 72B is working with Q4 cache? Just now i tried but it seems like it was having non stop generation. It could also be issue on my end.

@turboderp
Copy link
Member

The 72B version seems to work fine with Q4

Model Quant Cache pass@1 pass@10 Wikitext 5x1k
Qwen2-72B 6.0bpw Q4 70.36 87.19 10.31
Qwen2-72B 6.0bpw Q6 69.32 85.36 10.26
Qwen2-72B 6.0bpw Q8 71.28 85.36 10.23
Qwen2-72B 6.0bpw FP16 70.8 83.5 10.17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants