-
-
Notifications
You must be signed in to change notification settings - Fork 312
Qwen 2 inference problem #493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There is also an issue on Qwen repo: |
So I've narrowed it down to the attention function, and I've committed a possible solution to the dev branch. I say possible because it comes down to some internal switching logic in PyTorch that I'm not entirely sure about, but basically, since Torch 2.3.0 now supports lower-right causal masking (finally!) ExLlama can use SDPA instead of matmul attention. SDPA uses upcasting in the fused attention kernel which prevents the overflow and at least Qwen2-7B seems to be working without flash-attn. I'm not able to test xformers since I can't find a prebuilt wheel and the dependencies are broken at the moment. There still seem to be some issues with the Q4 cache, working on those. |
@turboderp any chance that that upcasting would benefit P40 performance by using FP32? 👀 what do you mean xformers dependencies are broken specifically? I'm using torch 2.3.0 and xformers with "no" issue, but maybe i'm missing something |
xformers was working, but I currently don't have it installed and I can't install it because Arch updated me to CUDA 12.5 and gcc13, and I can't downgrade because earlier CUDA versions need gcc12 which I can't install alongside gcc13. xformers refuses to compile because of incompatibilities with CUDA 12.5, so I'm kinda stuck unless I want to spend the next however many hours trying to get all the right versions of everything synced up. I guess with timeshift I'm pretty sure I won't completely brick my desktop, but it's still not a very appealing thought. As for upscaling, no, now that it will default to SDPA on Torch 2.3.0, attention should run smoother and there are other places I could switch over to FP32 compute, but specifically the matmul kernels need some special attention to work in FP32. Perhaps it could be done.. eh... so much else on the list too though. |
Have you considered docker? I run CUDA 12.2 in docker with torch 2.3.0 and xformers, I can walk you through it. Probably wouldn't be your endgame solution but would help you figure this out If the p40 performance isn't basically free I wouldn't bother, GGUF performance is good enough for that specific card, exllamav2 should just stay as the SOTA for SOTA cards rather than bend over backwards to get tiny amounts of gains from ancient cards lol, was just curious |
I compile it in a conda environment to avoid this issue. For the P40, just xformers may be enough. I only have one pascal card left in use at the moment so I should try it and see what happens. For SD it automatically sped up inference regardless of my compute setting. But lots of other non "SOTA" cards benefit from xformers too. |
Added Q8 cache mode now which seems to work great with Qwen2-7B. |
Oh hell yes, been looking forward to Q8 |
Q6 also works well with this model, available in v0.1.5 now |
So at the moment, Q4 is not working but Q6 and Q8 working? |
Correct. Though, it's worth noting that Qwen2-7B already has a very small cache. With FP16 precision it's 56 kB per token, vs. 128 kB per token for Llama3-8B (or 512 kB per token for Llama2-7B (!)) So overall, Qwen2-7B with Q6 cache still uses about 30% less VRAM per token than Llama3-8B with Q4 cache. For precision, I did some quick HumanEval tests, and it's within the margin of error from Q6 and up:
|
I tested v0.1.5 and it's working, thanks. What is the difference between 8bit cache and Q caches? Because in v0.1.5 only 8bit doesn't work for me, all Q caches are working. |
The 8-bit mode is FP8, and it's deprecated. It performs worse than Q4 in every respect. But Q4 is very unreliable for Qwen2-7B. |
Can i ask if Qwen2 72B is working with Q4 cache? Just now i tried but it seems like it was having non stop generation. It could also be issue on my end. |
The 72B version seems to work fine with Q4
|
Discussed in #492
Originally posted by Sadeghi85 June 7, 2024
I tried ExLlamaV2 of Qwen 2 7b, but the output is gibberish. There is a discussion over at llama.cpp: ggml-org/llama.cpp#7805
The text was updated successfully, but these errors were encountered: