Summary
Using --flash-attn on with --spec-type draft-mtp causes a fatal crash in
ggml_cuda_flash_attn_ext_mma_f16 on SM 8.7 (Jetson Orin Nano). The crash
occurs during common_context_can_seq_rm after the MTP draft context
initializes. Workaround: --flash-attn off — MTP works correctly without FA.
Environment
- Hardware: Jetson Orin Nano Developer Kit
- CUDA SM: 8.7
- JetPack: 7.2 / CUDA 13.2
- llama.cpp: b9592
- Build flags:
GGML_CUDA=ON GGML_CUDA_F16=ON GGML_CPU_AARCH64=ON CMAKE_CUDA_ARCHITECTURES=87
- Model: Gemma 4 E4B (unsloth GGUF, gemma4 architecture)
- Draft model: MTP head GGUF (
mtp-gemma-4-E4B-it.gguf)
Reproduction
llama-server \
-m gemma-4-E4B-it-qat-UD-Q4_K_XL.gguf \
--model-draft mtp-gemma-4-E4B-it.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--n-gpu-layers 999 \
--ctx-size 16384 \
--parallel 1 \
--no-mmap \
--batch-size 128 \
--ubatch-size 128 \
--flash-attn on \
--cache-type-k q4_0 \
--fit off \
--no-warmup \
--swa-full \
--cache-prompt \
--context-shift \
--keep 256 \
--cache-ram 512 \
--host 0.0.0.0 \
--port 8081
Observed behavior
Fatal abort in fattn.cu:110 during startup. MTP draft context loads and logs
shared KV cache layers (see below), then common_context_can_seq_rm triggers
a decode that hits the MMA F16 flash attention kernel and crashes.
Notable pre-crash warnings:
W llama_kv_cache: layer 3: sharing with layer 41.
W llama_kv_cache: layer 0: sharing with layer 40. ← same pointer
W llama_kv_cache: layer 1: sharing with layer 40. ← same pointer
W llama_kv_cache: layer 2: sharing with layer 40. ← same pointer
Stack trace
#3 ggml_cuda_flash_attn_ext_mma_f16(ggml_backend_cuda_context&, ggml_tensor*)
#4 ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*)
#5 ggml_backend_sched_graph_compute_async
#6 llama_context::graph_compute(ggml_cgraph*, bool)
#7 llama_context::process_ubatch(...)
#8 llama_context::decode(llama_batch const&)
#9 llama_decode
#10 common_context_can_seq_rm(llama_context*)
#11 server_context_impl::load_model(common_params&)
Expected behavior
FA should work alongside MTP, or fail gracefully rather than aborting.
Workaround
--flash-attn off — MTP initializes and runs correctly. Confirmed ~1.4×
speedup (36–42% draft acceptance, ~24 t/s effective vs ~17 t/s baseline)
on this hardware despite FA being disabled.
Notes
The crash is reproducible 100% of the time with FA on + MTP. --fit off and
--no-warmup do not prevent it — the crash occurs in common_context_can_seq_rm
which runs unconditionally after draft context init.
Summary
Using
--flash-attn onwith--spec-type draft-mtpcauses a fatal crash inggml_cuda_flash_attn_ext_mma_f16on SM 8.7 (Jetson Orin Nano). The crashoccurs during
common_context_can_seq_rmafter the MTP draft contextinitializes. Workaround:
--flash-attn off— MTP works correctly without FA.Environment
GGML_CUDA=ON GGML_CUDA_F16=ON GGML_CPU_AARCH64=ON CMAKE_CUDA_ARCHITECTURES=87mtp-gemma-4-E4B-it.gguf)Reproduction
Observed behavior
Fatal abort in
fattn.cu:110during startup. MTP draft context loads and logsshared KV cache layers (see below), then
common_context_can_seq_rmtriggersa decode that hits the MMA F16 flash attention kernel and crashes.
Notable pre-crash warnings:
W llama_kv_cache: layer 3: sharing with layer 41.
W llama_kv_cache: layer 0: sharing with layer 40. ← same pointer
W llama_kv_cache: layer 1: sharing with layer 40. ← same pointer
W llama_kv_cache: layer 2: sharing with layer 40. ← same pointer
Stack trace
#3 ggml_cuda_flash_attn_ext_mma_f16(ggml_backend_cuda_context&, ggml_tensor*)
#4 ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*)
#5 ggml_backend_sched_graph_compute_async
#6 llama_context::graph_compute(ggml_cgraph*, bool)
#7 llama_context::process_ubatch(...)
#8 llama_context::decode(llama_batch const&)
#9 llama_decode
#10 common_context_can_seq_rm(llama_context*)
#11 server_context_impl::load_model(common_params&)
Expected behavior
FA should work alongside MTP, or fail gracefully rather than aborting.
Workaround
--flash-attn off— MTP initializes and runs correctly. Confirmed ~1.4×speedup (36–42% draft acceptance, ~24 t/s effective vs ~17 t/s baseline)
on this hardware despite FA being disabled.
Notes
The crash is reproducible 100% of the time with FA on + MTP.
--fit offand--no-warmupdo not prevent it — the crash occurs incommon_context_can_seq_rmwhich runs unconditionally after draft context init.