Misc. bug: Speed degradation in `bin-win-cpu-x64` compared to `bin-win-avx2-x64` on Intel Core i7-12700H #13664

howlger · 2025-05-20T16:25:02Z

Name and Version

First bad version:

llama-cli --version
load_backend: loaded RPC backend from C:\Program Files\llama.cpp\models\llama-b5276-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Program Files\llama.cpp\models\llama-b5276-bin-win-cpu-x64\ggml-cpu-alderlake.dll
version: 5276 (9f2da587)
built with clang version 18.1.8 for x86_64-pc-windows-msvc

Current version, still affected:

llama-cli --version
load_backend: loaded RPC backend from C:\Program Files\llama.cpp\models\llama-b5432-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Program Files\llama.cpp\models\llama-b5432-bin-win-cpu-x64\ggml-cpu-alderlake.dll
version: 5432 (4245e622)
built with clang version 18.1.8 for x86_64-pc-windows-msvc

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

llama-cli -m gemma-3-1b-it-Q4_K_M.gguf -no-cnv -p "Tell me about the capital of France" --seed 3

Problem description & steps to reproduce

I see a degradation of the eval time on Windows 11 with an Intel Core i7-12700H CPU since commit 9f2da58 with llama-b5276-bin-win-cpu-x64.zip (release b5276): the eval time is now about ten times slower than in the previous build b5275 with llama-b5275-bin-win-avx2-x64.zip.

Also in the current release b5432, providing a llama-b5432-bin-win-cpu-x64.zip and not a ...-win-avx2-x64.zip the eval time is slow.

Tested with gemma-3-1b-it-Q4_K_M.gguf (also verified with Qwen3-30B-A3B-UD-Q2_K_XL.gguf and other models):

llama-b5275-bin-win-avx2-x64: eval time = 61.90 ms per token

llama-cli -m gemma-3-1b-it-Q4_K_M.gguf -no-cnv -p "Tell me about the capital of France" --seed 3
build: 5275 (93c4e239) with MSVC 19.43.34808.0 for x64
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 38 key-value pairs and 340 tensors from ..\gemma-3-1b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3 1b It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 1b Pt
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 32768
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 1152
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 26
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 6912
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 4
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:            gemma3.attention.sliding_window u32              = 512
llama_model_loader: - kv  22:             gemma3.attention.head_count_kv u32              = 1
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  26:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  33:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  35:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - kv  37:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  157 tensors
llama_model_loader: - type q5_0:  117 tensors
llama_model_loader: - type q8_0:   14 tensors
llama_model_loader: - type q4_K:   39 tensors
llama_model_loader: - type q6_K:   13 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 762.49 MiB (6.40 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 1152
print_info: n_layer          = 26
print_info: n_head           = 4
print_info: n_head_kv        = 1
print_info: n_rot            = 256
print_info: n_swa            = 512
print_info: n_swa_pattern    = 6
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 6912
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 999.89 M
print_info: general.name     = Gemma 3 1b It
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/27 layers to GPU
load_tensors:  CPU_AARCH64 model buffer size =    71.98 MiB
load_tensors:   CPU_Mapped model buffer size =   762.49 MiB
.............................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 26, can_shift = 1, padding = 32
llama_kv_cache_unified:        CPU KV buffer size =   104.00 MiB
llama_kv_cache_unified: KV self size  =  104.00 MiB, K (f16):   52.00 MiB, V (f16):   52.00 MiB
llama_context:        CPU compute buffer size =   514.25 MiB
llama_context: graph nodes  = 1099
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 14

system_info: n_threads = 14 (n_threads_batch = 14) / 20 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 3
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

Tell me about the capital of France?

The capital of France is Paris.

It's a beautiful and historic city known for:

*   **Iconic Landmarks:** The Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, Arc de Triomphe, etc.
*   **Art and Culture:** World-class museums, theaters, and a vibrant arts scene.
*   **Food:** Delicious pastries, croissants, steak frites, and fine dining.
*   **Fashion:** A global center for style and design.
*   **History:** A long and fascinating history, from Roman rule to the French Revolution.

Do you want to know more about any specific aspect, like its history, or something else?**
 [end of text]


llama_perf_sampler_print:    sampling time =      62.78 ms /   154 runs   (    0.41 ms per token,  2452.89 tokens per second)
llama_perf_context_print:        load time =     965.96 ms
llama_perf_context_print: prompt eval time =      83.68 ms /     8 tokens (   10.46 ms per token,    95.61 tokens per second)
llama_perf_context_print:        eval time =    8975.52 ms /   145 runs   (   61.90 ms per token,    16.16 tokens per second)
llama_perf_context_print:       total time =    9225.57 ms /   153 tokens

llama-b5276-bin-win-cpu-x64: eval time = 1043.79 ms per token

llama-cli -m gemma-3-1b-it-Q4_K_M.gguf -no-cnv -p "Tell me about the capital of France" --seed 3
load_backend: loaded RPC backend from C:\Program Files\llama.cpp\llama-b5276-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Program Files\llama.cpp\llama-b5276-bin-win-cpu-x64\ggml-cpu-alderlake.dll
build: 5276 (9f2da587) with clang version 18.1.8 for x86_64-pc-windows-msvc
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 38 key-value pairs and 340 tensors from ..\gemma-3-1b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3 1b It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 1b Pt
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 32768
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 1152
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 26
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 6912
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 4
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:            gemma3.attention.sliding_window u32              = 512
llama_model_loader: - kv  22:             gemma3.attention.head_count_kv u32              = 1
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  26:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  33:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  35:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - kv  37:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  157 tensors
llama_model_loader: - type q5_0:  117 tensors
llama_model_loader: - type q8_0:   14 tensors
llama_model_loader: - type q4_K:   39 tensors
llama_model_loader: - type q6_K:   13 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 762.49 MiB (6.40 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 1152
print_info: n_layer          = 26
print_info: n_head           = 4
print_info: n_head_kv        = 1
print_info: n_rot            = 256
print_info: n_swa            = 512
print_info: n_swa_pattern    = 6
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 6912
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 999.89 M
print_info: general.name     = Gemma 3 1b It
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/27 layers to GPU
load_tensors:  CPU_AARCH64 model buffer size =    71.98 MiB
load_tensors:   CPU_Mapped model buffer size =   762.49 MiB
.............................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 26, can_shift = 1, padding = 32
llama_kv_cache_unified:        CPU KV buffer size =   104.00 MiB
llama_kv_cache_unified: KV self size  =  104.00 MiB, K (f16):   52.00 MiB, V (f16):   52.00 MiB
llama_context:        CPU compute buffer size =   514.25 MiB
llama_context: graph nodes  = 1099
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 14

system_info: n_threads = 14 (n_threads_batch = 14) / 20 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |

sampler seed: 3
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

Tell me about the capital of France?

The capital of France is Paris.

It's a beautiful and historic city known for:

*   **Iconic Landmarks:** The Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, Arc de Triomphe, etc.
*   **Art and Culture:** A center for world-class museums, theaters, and music.
*   **Food:** Delicious pastries, croissants, steak frites, and wine.
*   **Fashion:** A global fashion hub.
*   **History:** Rich and complex history, dating back to Roman times.
*   **Parks and Gardens:** Numerous beautiful parks and gardens.
*   **Shopping:** From luxury boutiques to vintage stores.

Do you want to know more about any specific aspect of Paris?  For example, would you like to know about its history, its food, or its attractions?**
 [end of text]


llama_perf_sampler_print:    sampling time =      77.47 ms /   188 runs   (    0.41 ms per token,  2426.78 tokens per second)
llama_perf_context_print:        load time =     933.92 ms
llama_perf_context_print: prompt eval time =      93.72 ms /     8 tokens (   11.72 ms per token,    85.36 tokens per second)
llama_perf_context_print:        eval time =  186837.74 ms /   179 runs   ( 1043.79 ms per token,     0.96 tokens per second)
llama_perf_context_print:       total time =  187151.78 ms /   187 tokens

llama-b5432-bin-win-cpu-x64: eval time = 1132.96 ms per token

llama-cli -m gemma-3-1b-it-Q4_K_M.gguf -no-cnv -p "Tell me about the capital of France" --seed 3
load_backend: loaded RPC backend from C:\Program Files\llama.cpp\llama-b5432-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Program Files\llama.cpp\llama-b5432-bin-win-cpu-x64\ggml-cpu-alderlake.dll
build: 5432 (4245e622) with clang version 18.1.8 for x86_64-pc-windows-msvc
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 38 key-value pairs and 340 tensors from ..\gemma-3-1b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3 1b It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 1b Pt
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 32768
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 1152
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 26
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 6912
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 4
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:            gemma3.attention.sliding_window u32              = 512
llama_model_loader: - kv  22:             gemma3.attention.head_count_kv u32              = 1
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  26:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  33:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  35:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - kv  37:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  157 tensors
llama_model_loader: - type q5_0:  117 tensors
llama_model_loader: - type q8_0:   14 tensors
llama_model_loader: - type q4_K:   39 tensors
llama_model_loader: - type q6_K:   13 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 762.49 MiB (6.40 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 1152
print_info: n_layer          = 26
print_info: n_head           = 4
print_info: n_head_kv        = 1
print_info: n_rot            = 256
print_info: n_swa            = 512
print_info: n_swa_pattern    = 6
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 6912
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 999.89 M
print_info: general.name     = Gemma 3 1b It
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/27 layers to GPU
load_tensors:  CPU_AARCH64 model buffer size =    71.98 MiB
load_tensors:   CPU_Mapped model buffer size =   762.49 MiB
.............................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified:        CPU KV buffer size =    16.00 MiB
llama_kv_cache_unified: size =   16.00 MiB (  4096 cells,   4 layers), K (f16):    8.00 MiB, V (f16):    8.00 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 2560 cells
llama_kv_cache_unified:        CPU KV buffer size =    55.00 MiB
llama_kv_cache_unified: size =   55.00 MiB (  2560 cells,  22 layers), K (f16):   27.50 MiB, V (f16):   27.50 MiB
llama_context:        CPU compute buffer size =   514.25 MiB
llama_context: graph nodes  = 1151
llama_context: graph splits = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 14

system_info: n_threads = 14 (n_threads_batch = 14) / 20 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |

sampler seed: 3
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

Tell me about the capital of France?

The capital of France is Paris.

It's a beautiful and historic city known for:

*   **Iconic Landmarks:** The Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, Arc de Triomphe, etc.
*   **Art and Culture:** A center for world-class museums, theaters, and music.
*   **Food:** Delicious pastries, croissants, steak frites, and wine.
*   **Fashion:** A global fashion hub.
*   **History:** Rich and complex history, dating back to Roman times.
*   **Parks and Gardens:** Numerous beautiful parks and gardens.
*   **Shopping:** From luxury boutiques to vintage stores.

Do you want to know more about any specific aspect of Paris?  For example, would you like to know about its history, its food, or its attractions?**
 [end of text]


llama_perf_sampler_print:    sampling time =      98.59 ms /   188 runs   (    0.52 ms per token,  1906.83 tokens per second)
llama_perf_context_print:        load time =    1010.25 ms
llama_perf_context_print: prompt eval time =      96.47 ms /     8 tokens (   12.06 ms per token,    82.93 tokens per second)
llama_perf_context_print:        eval time =  202800.39 ms /   179 runs   ( 1132.96 ms per token,     0.88 tokens per second)
llama_perf_context_print:       total time =  203144.88 ms /   187 tokens

First Bad Commit

9f2da58

Relevant log output

The text was updated successfully, but these errors were encountered:

slaren · 2025-05-20T16:35:48Z

Does it happen with fewer threads? Try running llama-bench -m gemma-3-1b-it-Q4_K_M.gguf -p 64 -n 32 -t 6,8,10,14

howlger · 2025-05-20T17:06:02Z

No speed degradation for llama-cli with -t 6! Thx.

Results of llama-bench -m gemma-3-1b-it-Q4_K_M.gguf -p 64 -n 32 -t 6,8,10,14:

model	size	params	backend	ngl	threads	test	t/s
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	6	pp64	228.89 ± 8.37
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	6	tg32	21.63 ± 20.14
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	8	pp64	26.25 ± 1.53
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	8	tg32	6.19 ± 3.63
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	10	pp64	20.02 ± 1.34
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	10	tg32	0.92 ± 0.07
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	14	pp64	13.87 ± 0.51
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	14	tg32	0.41 ± 0.00

build: 4245e62 (5432)

pwilkin · 2025-05-22T21:04:26Z

Don't know if this is related, but I've noticed a noticeable decrease in generation speed since I updated my local build to the current version (can't pinpoint the exact previous commit, but it was about 3-4 days past). So I ran llama-bench and in fact the generation on 8 threads (I have i7-9700K, which has 8 cores) is degraded:

(dev-venv) ilintar@LinuksowaJaskinia:/mnt/win/k/models/unsloth/Qwen3-30B-A3B-GGUF$ llama-bench -fa 1 -ctk q8_0 -ctv q8_0 -m Qwen3-30B-A3B-UD-Q4_K_XL.gguf -ot "(up_exps|down_exps)=CPU" -t 2,4,6,8 -p 512 -n 512 -r 10 -d 4096
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	threads	type_k	type_v	fa	ot	test	t/s
qwen3moe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA,BLAS	2	q8_0	q8_0	1	(up_exps\|down_exps)=CPU	pp512 @ d4096	247.08 ± 20.81
qwen3moe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA,BLAS	2	q8_0	q8_0	1	(up_exps\|down_exps)=CPU	tg512 @ d4096	14.04 ± 0.43
qwen3moe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA,BLAS	4	q8_0	q8_0	1	(up_exps\|down_exps)=CPU	pp512 @ d4096	260.58 ± 20.95
qwen3moe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA,BLAS	4	q8_0	q8_0	1	(up_exps\|down_exps)=CPU	tg512 @ d4096	16.86 ± 1.07
qwen3moe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA,BLAS	6	q8_0	q8_0	1	(up_exps\|down_exps)=CPU	pp512 @ d4096	278.37 ± 6.08
qwen3moe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA,BLAS	6	q8_0	q8_0	1	(up_exps\|down_exps)=CPU	tg512 @ d4096	16.90 ± 2.14
qwen3moe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA,BLAS	8	q8_0	q8_0	1	(up_exps\|down_exps)=CPU	pp512 @ d4096	259.70 ± 12.45
qwen3moe 30B.A3B Q4_K - Medium	16.49 GiB	30.53 B	CUDA,BLAS	8	q8_0	q8_0	1	(up_exps\|down_exps)=CPU	tg512 @ d4096	4.90 ± 0.76

slaren · 2025-05-22T21:10:07Z

It's not the same issue, this is related to a change in the windows releases. Open a new issue and try to find which commit introduced your issue.

pwilkin · 2025-05-22T21:20:19Z

Okay, I'll try to do a binary search tomorrow to narrow it down and add an issue afterwards.

slaren · 2025-05-24T21:37:38Z

The latest release should have fixed this issue.

howlger · 2025-05-24T22:20:18Z

The latest release should have fixed this issue.

Unfortunately, the latest release (build b5476: 17fc817) does not work for me. When running llama-bench -m gemma-3-1b-it-Q4_K_M.gguf -p 64 -n 32 -t 6,8,10,14 I get the following error:

load_backend: loaded RPC backend from C:\Program Files\llama.cpp\llama-b5476-bin-win-cpu-x64\ggml-rpc.dll
main: error: CPU backend is not loaded

Note, that the following line as in previous releases is missing even though the file ggml-cpu-alderlake.dll exists (llama-b5476-bin-win-cpu-x64.zip contains the same set of files than e.g. llama-b5451-bin-win-cpu-x64.zip):

load_backend: loaded CPU backend from C:\Program Files\llama.cpp\llama-b5451-bin-win-cpu-x64\ggml-cpu-alderlake.dll

When using llama-cli, the error looks like this:

...
load_tensors: loading model tensors, this can take a while... (mmap = true)
llama_model_load: error loading model: make_cpu_buft_list: no CPU backend found
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'Qwen3-30B-A3B-UD-Q2_K_XL.gguf'
main: error: unable to load model

slaren · 2025-05-24T22:29:08Z

The latest release uses OpenMP, so if you don't have it installed it may fail to load. ~~Try installing the latest version of the Visual C++ redistributable.~~

Apparently this library is not included in the VC redistributable. I have added it now and should be bundled with the next llama.cpp release.

howlger · 2025-05-25T07:36:47Z

I confirm that this issue is fixed in the current release b5478 (llama-b5478-bin-win-cpu-x64.zip). Thank you for all your work, not only in this case!

Results of llama-bench -m gemma-3-1b-it-Q4_K_M.gguf -p 64 -n 32 -t 6,8,10,14:

model	size	params	backend	ngl	threads	test	t/s
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	6	pp64	212.88 ± 2.43
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	6	tg32	17.86 ± 12.00
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	8	pp64	35.62 ± 0.47
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	8	tg32	6.65 ± 0.46
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	10	pp64	33.51 ± 0.28
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	10	tg32	5.17 ± 0.04
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	14	pp64	34.54 ± 0.35
gemma3 1B Q4_K - Medium	762.49 MiB	999.89 M	RPC	99	14	tg32	4.49 ± 0.14

build: f5cd27b (5478)

For the fastest speed, it seems best to run llama-cli with the -t n option, where n is the number of performance cores, right? So in my case, with an Intel Core i7-12700H CPU with -t 6 (six performance cores).

howlger · 2025-05-25T15:34:18Z

Closing this issue as fixed via #13758, #13756 and #13763. Thx @slaren for fixing it.

howlger added the bug-unconfirmed label May 20, 2025

slaren added bug Something isn't working and removed bug-unconfirmed labels May 21, 2025

slaren mentioned this issue May 24, 2025

Misc. bug: llama-server token per second slow down sigificant after release b5450 (#13642) #13735

Closed

slaren marked this as a duplicate of #13735 May 24, 2025

slaren marked this as a duplicate of #13738 May 24, 2025

howlger closed this as completed May 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Speed degradation in `bin-win-cpu-x64` compared to `bin-win-avx2-x64` on Intel Core i7-12700H #13664

Misc. bug: Speed degradation in `bin-win-cpu-x64` compared to `bin-win-avx2-x64` on Intel Core i7-12700H #13664

howlger commented May 20, 2025

slaren commented May 20, 2025

Uh oh!

howlger commented May 20, 2025

Uh oh!

pwilkin commented May 22, 2025 •

edited

Loading

Uh oh!

slaren commented May 22, 2025

Uh oh!

pwilkin commented May 22, 2025

Uh oh!

slaren commented May 24, 2025

Uh oh!

howlger commented May 24, 2025

Uh oh!

slaren commented May 24, 2025 •

edited

Loading

Uh oh!

howlger commented May 25, 2025

Uh oh!

howlger commented May 25, 2025

Uh oh!

Misc. bug: Speed degradation in bin-win-cpu-x64 compared to bin-win-avx2-x64 on Intel Core i7-12700H #13664

Misc. bug: Speed degradation in bin-win-cpu-x64 compared to bin-win-avx2-x64 on Intel Core i7-12700H #13664

Comments

howlger commented May 20, 2025

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

slaren commented May 20, 2025

Uh oh!

howlger commented May 20, 2025

Uh oh!

pwilkin commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented May 22, 2025

Uh oh!

pwilkin commented May 22, 2025

Uh oh!

slaren commented May 24, 2025

Uh oh!

howlger commented May 24, 2025

Uh oh!

slaren commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

howlger commented May 25, 2025

Uh oh!

howlger commented May 25, 2025

Uh oh!

Misc. bug: Speed degradation in `bin-win-cpu-x64` compared to `bin-win-avx2-x64` on Intel Core i7-12700H #13664

Misc. bug: Speed degradation in `bin-win-cpu-x64` compared to `bin-win-avx2-x64` on Intel Core i7-12700H #13664

pwilkin commented May 22, 2025 •

edited

Loading

slaren commented May 24, 2025 •

edited

Loading