Skip to content

Severe prompt processing slowdown on CUDA backend after NVIDIA driver 610.62 (RTX 5070 Ti, llama.cpp backend) #2071

Description

@Frusciante79

LM Studio: 0.4.17 (Build 3)
operating system: Windows 11

After updating to NVIDIA driver 610.62, I observe a significant regression in prompt processing performance in LM Studio using the CUDA backend (llama.cpp).

The model executes correctly and GPU is properly detected. There is no CPU fallback or crash. However, prompt processing (prefill phase) is significantly slower compared to previous behavior.

Generation speed remains relatively stable after prompt processing completes, indicating the issue is isolated to the prefill/prompt processing stage rather than full inference throughput.

A key observation is that GPU utilization remains active, but efficiency during prompt processing appears degraded, suggesting a possible regression in CUDA kernel execution or KV cache handling.

Steps to reproduce the behavior:

Open LM Studio
Load model: nex-agi_Nex-N2-mini-Q4_K_M.gguf
Select CUDA or CUDA12 backend (llama.cpp)
Use a long context prompt (4K–8K+ tokens)
Start generation
Observe prompt processing timing and token throughput
Expected behavior

Consistent prompt processing performance similar to prior behavior (~29–30 tokens/s stable), without degradation over increasing context size.

Actual behavior
Prompt processing starts relatively fast (~150 tokens/s initial estimate)
Performance degrades progressively during longer prompts (~154 → 147 → 134 tokens/s)
Noticeable slowdown compared to previous expected baseline (~29–30 tok/s stable in prior conditions)
GPU remains active throughout execution
No CPU fallback observed
Screenshots

Logs
slot prompt_clear: clearing prompt with 0 tokens
slot process_sing: saving idle slot to prompt cache
slot print_timing: prompt processing, n_tokens = 2048 → 4096 → 6144
performance trend: 154 → 147 → 134 tokens/s (degrading)

Additional context
GPU: NVIDIA RTX 5070 Ti
Driver: 610.62
Backend: CUDA (llama.cpp)
Same model shows significantly better prompt processing behavior prior to driver update
Issue is reproducible across multiple runs and sessions
Hypothesis

Possible regression in NVIDIA 610.62 affecting CUDA execution behavior in llama.cpp backend, specifically:

KV cache reuse efficiency
CUDA memory allocation patterns
kernel scheduling / prefill optimization path
potential loss of CUDA graph optimizations during prompt processing
Request

Could you please confirm whether recent NVIDIA driver changes (610.x branch) may impact prompt processing performance in CUDA backend (llama.cpp), and whether any LM Studio configuration changes are required for optimal performance?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions