Severe prompt processing slowdown on CUDA backend after NVIDIA driver 610.62 (RTX 5070 Ti, llama.cpp backend)

LM Studio: 0.4.17 (Build 3)
operating system:  Windows 11 

After updating to NVIDIA driver 610.62, I observe a significant regression in prompt processing performance in LM Studio using the CUDA backend (llama.cpp).

The model executes correctly and GPU is properly detected. There is no CPU fallback or crash. However, prompt processing (prefill phase) is significantly slower compared to previous behavior.

Generation speed remains relatively stable after prompt processing completes, indicating the issue is isolated to the prefill/prompt processing stage rather than full inference throughput.

A key observation is that GPU utilization remains active, but efficiency during prompt processing appears degraded, suggesting a possible regression in CUDA kernel execution or KV cache handling.


Steps to reproduce the behavior:

Open LM Studio
Load model: nex-agi_Nex-N2-mini-Q4_K_M.gguf
Select CUDA or CUDA12 backend (llama.cpp)
Use a long context prompt (4K–8K+ tokens)
Start generation
Observe prompt processing timing and token throughput
Expected behavior

Consistent prompt processing performance similar to prior behavior (~29–30 tokens/s stable), without degradation over increasing context size.

Actual behavior
Prompt processing starts relatively fast (~150 tokens/s initial estimate)
Performance degrades progressively during longer prompts (~154 → 147 → 134 tokens/s)
Noticeable slowdown compared to previous expected baseline (~29–30 tok/s stable in prior conditions)
GPU remains active throughout execution
No CPU fallback observed
Screenshots


Logs
slot prompt_clear: clearing prompt with 0 tokens
slot process_sing: saving idle slot to prompt cache
slot print_timing: prompt processing, n_tokens = 2048 → 4096 → 6144
performance trend: 154 → 147 → 134 tokens/s (degrading)

Additional context
GPU: NVIDIA RTX 5070 Ti
Driver: 610.62
Backend: CUDA (llama.cpp)
Same model shows significantly better prompt processing behavior prior to driver update
Issue is reproducible across multiple runs and sessions
Hypothesis

Possible regression in NVIDIA 610.62 affecting CUDA execution behavior in llama.cpp backend, specifically:

KV cache reuse efficiency
CUDA memory allocation patterns
kernel scheduling / prefill optimization path
potential loss of CUDA graph optimizations during prompt processing
Request

Could you please confirm whether recent NVIDIA driver changes (610.x branch) may impact prompt processing performance in CUDA backend (llama.cpp), and whether any LM Studio configuration changes are required for optimal performance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Severe prompt processing slowdown on CUDA backend after NVIDIA driver 610.62 (RTX 5070 Ti, llama.cpp backend) #2071

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Severe prompt processing slowdown on CUDA backend after NVIDIA driver 610.62 (RTX 5070 Ti, llama.cpp backend) #2071

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions