[CUDA] FA+MTP crash: ggml_cuda_flash_attn_ext_mma_f16 fatal error in common_context_can_seq_rm on SM87 (Jetson Orin)

## Summary

Using `--flash-attn on` with `--spec-type draft-mtp` causes a fatal crash in
`ggml_cuda_flash_attn_ext_mma_f16` on SM 8.7 (Jetson Orin Nano). The crash
occurs during `common_context_can_seq_rm` after the MTP draft context
initializes. Workaround: `--flash-attn off` — MTP works correctly without FA.

## Environment

- **Hardware:** Jetson Orin Nano Developer Kit
- **CUDA SM:** 8.7
- **JetPack:** 7.2 / CUDA 13.2
- **llama.cpp:** b9592
- **Build flags:** `GGML_CUDA=ON GGML_CUDA_F16=ON GGML_CPU_AARCH64=ON CMAKE_CUDA_ARCHITECTURES=87`
- **Model:** Gemma 4 E4B (unsloth GGUF, gemma4 architecture)
- **Draft model:** MTP head GGUF (`mtp-gemma-4-E4B-it.gguf`)

## Reproduction

```bash
llama-server \
  -m gemma-4-E4B-it-qat-UD-Q4_K_XL.gguf \
  --model-draft mtp-gemma-4-E4B-it.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --n-gpu-layers 999 \
  --ctx-size 16384 \
  --parallel 1 \
  --no-mmap \
  --batch-size 128 \
  --ubatch-size 128 \
  --flash-attn on \
  --cache-type-k q4_0 \
  --fit off \
  --no-warmup \
  --swa-full \
  --cache-prompt \
  --context-shift \
  --keep 256 \
  --cache-ram 512 \
  --host 0.0.0.0 \
  --port 8081
```

## Observed behavior

Fatal abort in `fattn.cu:110` during startup. MTP draft context loads and logs
shared KV cache layers (see below), then `common_context_can_seq_rm` triggers
a decode that hits the MMA F16 flash attention kernel and crashes.

Notable pre-crash warnings:

W llama_kv_cache: layer 3: sharing with layer 41.
W llama_kv_cache: layer 0: sharing with layer 40.  ← same pointer
W llama_kv_cache: layer 1: sharing with layer 40.  ← same pointer
W llama_kv_cache: layer 2: sharing with layer 40.  ← same pointer

## Stack trace
#3  ggml_cuda_flash_attn_ext_mma_f16(ggml_backend_cuda_context&, ggml_tensor*)
#4  ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*)
#5  ggml_backend_sched_graph_compute_async
#6  llama_context::graph_compute(ggml_cgraph*, bool)
#7  llama_context::process_ubatch(...)
#8  llama_context::decode(llama_batch const&)
#9  llama_decode
#10 common_context_can_seq_rm(llama_context*)
#11 server_context_impl::load_model(common_params&)

## Expected behavior

FA should work alongside MTP, or fail gracefully rather than aborting.

## Workaround

`--flash-attn off` — MTP initializes and runs correctly. Confirmed ~1.4× 
speedup (36–42% draft acceptance, ~24 t/s effective vs ~17 t/s baseline)
on this hardware despite FA being disabled.

## Notes

The crash is reproducible 100% of the time with FA on + MTP. `--fit off` and
`--no-warmup` do not prevent it — the crash occurs in `common_context_can_seq_rm`
which runs unconditionally after draft context init.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDA] FA+MTP crash: ggml_cuda_flash_attn_ext_mma_f16 fatal error in common_context_can_seq_rm on SM87 (Jetson Orin) #24457

Summary

Environment

Reproduction

Observed behavior

Stack trace

Expected behavior

Workaround

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[CUDA] FA+MTP crash: ggml_cuda_flash_attn_ext_mma_f16 fatal error in common_context_can_seq_rm on SM87 (Jetson Orin) #24457

Description

Summary

Environment

Reproduction

Observed behavior

Stack trace

Expected behavior

Workaround

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions