Skip to content

Bug: Garbled output with --split-mode layer on asymmetric multi-GPU setup (V100 32G+16G) #1500

@netqer

Description

@netqer

What happened?

Hello,

I am encountering an issue where llama-server produces garbled/invalid output when using --split-mode layer on a dual-GPU setup with asymmetric VRAM. However, using --split-mode graph works perfectly with the same hardware and model.

Environment:

OS: Linux (Ubuntu)
GPUs: 2x NVIDIA V100 SXM2
GPU 0: 16GB VRAM
GPU 1: 32GB VRAM
CUDA Version: 12.8
Model:

Path: /home/xx/models/qwen35_gguf/unsloth/Qwen3___5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf
Type: GGUF (Q6_K_XL)
Steps to Reproduce:

Working Command (--split-mode graph):
Running the following command produces normal, coherent text output.

./llama-server -m "/home/xx/models/qwen35_gguf/unsloth/Qwen3___5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf" --jinja -ngl 99 --threads 50 --ctx-size 32684 --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20 --presence-penalty 1.0 --host 0.0.0.0 --split-mode graph

========================================================================

Failing Command (--split-mode layer):
Running the following command starts the server successfully, but the generated output is completely garbled (mojibake).

./llama-server -m "/home/xx/models/qwen35_gguf/unsloth/Qwen3___5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf" --jinja -ngl 99 --threads 50 --ctx-size 32684 --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20 --presence-penalty 1.0 --host 0.0.0.0 --split-mode layer
Actual Output (Garbled):
When using --split-mode layer, the response looks like this:

%#,&151&-".)4.35,2-2#,!*#+2'43%(13"#&)-20#*50,)*"%&'1#,%(&4+2%.#(5.-5!&'2-352+!32&'"2,05.&+3(1# Expected Output: Normal natural language text similar to what is produced when using --split-mode graph.

Additional Context:

The VRAM configuration is asymmetric (32GB + 16GB). I suspect the layer splitting logic might be miscalculating memory usage or tensor distribution across the uneven GPUs when --split-mode layer is selected.
The server does not crash; it simply generates invalid tokens.
-ngl 99 is used to offload all layers to GPU.

Thank you.

Name and Version

version: 4347 (233225d)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    wontfixThis will not be worked on

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions