Phi-2 q4_km generating gibberish on ARM devices #4618

LostRuins · 2023-12-24T08:50:10Z

Running the latest commit, testing the model https://huggingface.co/afrideva/phi-2-uncensored-GGUF/blob/main/phi-2-uncensored.q4_k_m.gguf in Termux.

./main -m ../phi-2-uncensored.q4_k_m.gguf -n 10 -p "Hi, my name is"

main: build = 1691 (7082d24)
main: built with clang version 17.0.6 for aarch64-unknown-linux-android24
main: seed  = 1703404966
llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from ../phi-2-uncensored.q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = Phi2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240
llama_model_loader: - kv   5:                           phi2.block_count u32              = 32
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["&#288; t", "&#288; a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  195 tensors
llama_model_loader: - type q4_K:   81 tensors
llama_model_loader: - type q5_K:   32 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 10240
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 2.78 B
llm_load_print_meta: model size       = 1.66 GiB (5.14 BPW)
llm_load_print_meta: general.name     = Phi2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'ï¿½'
llm_load_tensors: ggml ctx size       =    0.12 MiB
llm_load_tensors: system memory used  = 1704.75 MiB
....................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_build_graph: non-view tensors processed: 774/774
llama_new_context_with_model: compute buffer total size = 113.19 MiB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 10, n_keep = 0

Hi, my name is vocwich Reeves TABLEeco Feahar Reeves sill Reeves

It generates complete gibberish.

The same model works fine on my x86_64 windows device.
Also, q2_k works fine on both systems.

Is it possible that that intrinsic ARM FMA or ARM NEON is responsible for this issue?

Also tagging @ebeyabraham

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-12-24T09:09:15Z

Not sure - works on M2 Ultra

What is the output from:

make tests && ./tests/test-backend-ops

Any fails?

LostRuins · 2023-12-24T09:58:29Z

Sorry, that took really long to compile on my Android.

The tests crash halfway, consistently at the same point, right after MOE(n_experts=8,n_experts_per_tok=2,n_tokens=1,n_embd=4096,n_ff=8192) .
I am not sure if it's due to going OOM, a segfault or something else. Here's the terminal output:

tests.txt

I modified the cpp file to remove https://github.com/ggerganov/llama.cpp/blob/master/tests/test-backend-ops.cpp#L1580 and recompiled.

tests_2.txt

And all other tests seem to be passing for test-backend-ops. I think the MOE test is unrelated though, Phi is not a MOE model. Also to clarify, the main binary is made from the makefile without modifications, just running make main. The environment is Termux on Android 9.

LostRuins · 2023-12-25T09:53:51Z

Is there an easy way to modify the makefile to skip either NEON or FMA? I'd like to see if I can pinpoint which one is causing issues.

Also, I did check with another user over discord, they used Termux with identical compile and run settings to me and their output was coherent. So it might be a device specific thing? Slightly confused and wondering if anybody else has the same issue.

For reference, my device (that didn't work)

SoC: Samsung Exynos 9810 (S5E9810)
CPU: 4x 2.9 GHz Exynos M3 Meerkat, 4x 1.9 GHz ARM Cortex-A55, Cores: 8

ggerganov · 2023-12-25T10:03:06Z

Does it work with this patch:

diff --git a/ggml-quants.c b/ggml-quants.c
index a15a2404..b5c76f00 100644
--- a/ggml-quants.c
+++ b/ggml-quants.c
@@ -5602,7 +5602,7 @@ void ggml_vec_dot_q4_K_q8_K(const int n, float * restrict s, const void * restri
 
     const int nb = n / QK_K;
 
-#ifdef __ARM_NEON
+#ifdef __ARM_NEON_XXX
 
     const uint8x16_t m4b = vdupq_n_u8(0xf);

LostRuins · 2023-12-25T10:18:16Z

Unfortunately not, it's still generating rubbish.

LostRuins · 2023-12-25T11:13:03Z

Another update: I have done a full search and replace in all files replacing all instances of __ARM_NEON to __ARM_NEON_XXX and it is working correctly now. So whatever the bug was, it was indeed with NEON.

I will try to slowly replace each instance until I find the one responsible, unless you have a better approach to suggest.

LostRuins · 2023-12-25T13:00:25Z

By trial and error, I have narrowed it down to ggml_vec_dot_q5_K_q8_K()

By skipping https://github.com/ggerganov/llama.cpp/blob/master/ggml-quants.c#L5879 , this model works perfectly on my device.

ggerganov · 2023-12-25T15:19:43Z

Could you verify that #4630 also fixes the issue on that device?

LostRuins · 2023-12-25T15:51:35Z

Yes that seems to have fixed it! Awesome.

Though I am wondering why the unit tests didn't catch that.

lts-rad · 2023-12-25T21:30:51Z

I checked q4_k_m on my build with #4630 and thats fine, but I think there's something still off with q5_k_m. https://huggingface.co/afrideva/phi-2-uncensored-GGUF/blob/main/phi-2-uncensored.q5_k_m.gguf

ggerganov · 2023-12-27T09:03:55Z

Though I am wondering why the unit tests didn't catch that.

The tests cannot catch integer overflows. Should be fixed now

LostRuins added the bug-unconfirmed label Dec 24, 2023

Ar57m mentioned this issue Dec 24, 2023

Using a word like limón on the prompt makes llama.cpp crash #4544

Closed

ggerganov closed this as completed Dec 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phi-2 q4_km generating gibberish on ARM devices #4618

Phi-2 q4_km generating gibberish on ARM devices #4618

LostRuins commented Dec 24, 2023 •

edited

Loading

ggerganov commented Dec 24, 2023

LostRuins commented Dec 24, 2023 •

edited

Loading

LostRuins commented Dec 25, 2023 •

edited

Loading

ggerganov commented Dec 25, 2023

LostRuins commented Dec 25, 2023

LostRuins commented Dec 25, 2023

LostRuins commented Dec 25, 2023

ggerganov commented Dec 25, 2023

LostRuins commented Dec 25, 2023 •

edited

Loading

lts-rad commented Dec 25, 2023

ggerganov commented Dec 27, 2023

Phi-2 q4_km generating gibberish on ARM devices #4618

Phi-2 q4_km generating gibberish on ARM devices #4618

Comments

LostRuins commented Dec 24, 2023 • edited Loading

ggerganov commented Dec 24, 2023

LostRuins commented Dec 24, 2023 • edited Loading

LostRuins commented Dec 25, 2023 • edited Loading

ggerganov commented Dec 25, 2023

LostRuins commented Dec 25, 2023

LostRuins commented Dec 25, 2023

LostRuins commented Dec 25, 2023

ggerganov commented Dec 25, 2023

LostRuins commented Dec 25, 2023 • edited Loading

lts-rad commented Dec 25, 2023

ggerganov commented Dec 27, 2023

LostRuins commented Dec 24, 2023 •

edited

Loading

LostRuins commented Dec 24, 2023 •

edited

Loading

LostRuins commented Dec 25, 2023 •

edited

Loading

LostRuins commented Dec 25, 2023 •

edited

Loading