Skip to content

[CUDA] FA+MTP crash: ggml_cuda_flash_attn_ext_mma_f16 fatal error in common_context_can_seq_rm on SM87 (Jetson Orin) #24457

Description

@rivascd

Summary

Using --flash-attn on with --spec-type draft-mtp causes a fatal crash in
ggml_cuda_flash_attn_ext_mma_f16 on SM 8.7 (Jetson Orin Nano). The crash
occurs during common_context_can_seq_rm after the MTP draft context
initializes. Workaround: --flash-attn off — MTP works correctly without FA.

Environment

  • Hardware: Jetson Orin Nano Developer Kit
  • CUDA SM: 8.7
  • JetPack: 7.2 / CUDA 13.2
  • llama.cpp: b9592
  • Build flags: GGML_CUDA=ON GGML_CUDA_F16=ON GGML_CPU_AARCH64=ON CMAKE_CUDA_ARCHITECTURES=87
  • Model: Gemma 4 E4B (unsloth GGUF, gemma4 architecture)
  • Draft model: MTP head GGUF (mtp-gemma-4-E4B-it.gguf)

Reproduction

llama-server \
  -m gemma-4-E4B-it-qat-UD-Q4_K_XL.gguf \
  --model-draft mtp-gemma-4-E4B-it.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --n-gpu-layers 999 \
  --ctx-size 16384 \
  --parallel 1 \
  --no-mmap \
  --batch-size 128 \
  --ubatch-size 128 \
  --flash-attn on \
  --cache-type-k q4_0 \
  --fit off \
  --no-warmup \
  --swa-full \
  --cache-prompt \
  --context-shift \
  --keep 256 \
  --cache-ram 512 \
  --host 0.0.0.0 \
  --port 8081

Observed behavior

Fatal abort in fattn.cu:110 during startup. MTP draft context loads and logs
shared KV cache layers (see below), then common_context_can_seq_rm triggers
a decode that hits the MMA F16 flash attention kernel and crashes.

Notable pre-crash warnings:

W llama_kv_cache: layer 3: sharing with layer 41.
W llama_kv_cache: layer 0: sharing with layer 40. ← same pointer
W llama_kv_cache: layer 1: sharing with layer 40. ← same pointer
W llama_kv_cache: layer 2: sharing with layer 40. ← same pointer

Stack trace

#3 ggml_cuda_flash_attn_ext_mma_f16(ggml_backend_cuda_context&, ggml_tensor*)
#4 ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*)
#5 ggml_backend_sched_graph_compute_async
#6 llama_context::graph_compute(ggml_cgraph*, bool)
#7 llama_context::process_ubatch(...)
#8 llama_context::decode(llama_batch const&)
#9 llama_decode
#10 common_context_can_seq_rm(llama_context*)
#11 server_context_impl::load_model(common_params&)

Expected behavior

FA should work alongside MTP, or fail gracefully rather than aborting.

Workaround

--flash-attn off — MTP initializes and runs correctly. Confirmed ~1.4×
speedup (36–42% draft acceptance, ~24 t/s effective vs ~17 t/s baseline)
on this hardware despite FA being disabled.

Notes

The crash is reproducible 100% of the time with FA on + MTP. --fit off and
--no-warmup do not prevent it — the crash occurs in common_context_can_seq_rm
which runs unconditionally after draft context init.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions