CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ #12098
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes LostRuins#1390 .
The logic for the combination of V100s and
GGML_CUDA_FORCE_MMQ
seems to be wrong on master. By default, when compiling withoutGGML_CUDA_FORCE_MMQ
, the MMQ kernels should only be compiled for batch sizes up toMMQ_DP4A_MAX_BATCH_SIZE
if FP16 tensor core hardware is available but int8 tensor core hardware is not (basically only V100s). Template specializations for higher batch sizes will never be used. However, the condition for this seems to have been inverted. WithoutGGML_CUDA_FORCE_MMQ
unneeded template specializations were being compiled and with it the host code could attempt to run nonexistent kernels.