-
Notifications
You must be signed in to change notification settings - Fork 13k
CUDA: fastdiv, launch bounds for mmvq + q8_1 quant #15802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: fastdiv, launch bounds for mmvq + q8_1 quant #15802
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the initial efforts! Spotted two more instances where we could apply fastdiv/fastmodulo as well if I'm not mistaken.
ggml/src/ggml-cuda/mmvq.cu
Outdated
const int sample_y = sample_dst; | ||
const uint32_t channel_dst = blockIdx.y; | ||
const uint32_t channel_x = ncols_dst == 1 && ids ? ids[channel_dst] : fastdiv(channel_dst, channel_ratio); | ||
const uint32_t channel_y = ncols_dst == 1 && ids ? channel_dst % nchannels_y : channel_dst; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tried applying fastmodulo
to nchannels_y
?
const uint32_t ncols_x, const uint32_t nchannels_y, const uint32_t stride_row_x, const uint32_t stride_col_y, | ||
const uint32_t stride_col_dst, const uint3 channel_ratio, const uint32_t stride_channel_x, | ||
const uint32_t stride_channel_y, const uint32_t stride_channel_dst, const uint3 sample_ratio, | ||
const uint32_t stride_sample_x, const uint32_t stride_sample_y, const uint32_t stride_sample_dst) { | ||
|
||
constexpr int qk = ggml_cuda_type_traits<type>::qk; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qk
, qi
, vdr
and QK8_1
are all compile time constants. We should therefore be able to replace the divisons and modulo used to determine kbx
, kby
and kqs
in the first for loop with fastdiv
/fastmodulo
as well (lines 178-182)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are all powers of 2 though, shouldn't the compiler be able to replace the divisions/modulos with shifts/bitwise ands? Is the use of signed integers in this context problematic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are all powers of 2 though, shouldn't the compiler be able to replace the divisions/modulos with shifts/bitwise ands? Is the use of signed integers in this context problematic?
If they are indeed powers of 2 the compiler will do the correct optimizations for us at compile-time (see godbolt). While it requires 2 more instructions to do signed integer division/modulo compared to unsigned (see this stack-overflow post if interested in details), that should be negligible for our use-case
I also enabled Performance
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort, looks good from my side!
…g-model-disabled-agent-prefill * origin/master: (84 commits) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) ...
…upport * origin/master: Thinking model disabled assistant prefill (ggml-org#15404) Implement --log-colors with always/never/auto (ggml-org#15792) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798)
* CUDA: fastdiv, launch bounds for mmvq + q8_1 quant
This PR uses the new
fastdiv
code from #15715 for MMVQ + q8_1 quantization, also adds__launch_bounds__
to the quantization.Performance
fastdiv
alone is consistently faster, the addition of__launch_bounds__
is faster on average but always for batch size 1 which is the most important use case.