Skip to content

Phi-2 q4_km generating gibberish on ARM devices #4618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
LostRuins opened this issue Dec 24, 2023 · 11 comments
Closed

Phi-2 q4_km generating gibberish on ARM devices #4618

LostRuins opened this issue Dec 24, 2023 · 11 comments

Comments

@LostRuins
Copy link
Collaborator

LostRuins commented Dec 24, 2023

Running the latest commit, testing the model https://huggingface.co/afrideva/phi-2-uncensored-GGUF/blob/main/phi-2-uncensored.q4_k_m.gguf in Termux.

./main -m ../phi-2-uncensored.q4_k_m.gguf -n 10 -p "Hi, my name is"

main: build = 1691 (7082d24)
main: built with clang version 17.0.6 for aarch64-unknown-linux-android24
main: seed  = 1703404966
llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from ../phi-2-uncensored.q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = Phi2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240
llama_model_loader: - kv   5:                           phi2.block_count u32              = 32
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  195 tensors
llama_model_loader: - type q4_K:   81 tensors
llama_model_loader: - type q5_K:   32 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 10240
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 2.78 B
llm_load_print_meta: model size       = 1.66 GiB (5.14 BPW)
llm_load_print_meta: general.name     = Phi2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 '�'
llm_load_tensors: ggml ctx size       =    0.12 MiB
llm_load_tensors: system memory used  = 1704.75 MiB
....................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_build_graph: non-view tensors processed: 774/774
llama_new_context_with_model: compute buffer total size = 113.19 MiB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 10, n_keep = 0

Hi, my name is vocwich Reeves TABLEeco Feahar Reeves sill Reeves

It generates complete gibberish.

The same model works fine on my x86_64 windows device.
Also, q2_k works fine on both systems.

Is it possible that that intrinsic ARM FMA or ARM NEON is responsible for this issue?

Also tagging @ebeyabraham

@ggerganov
Copy link
Member

Not sure - works on M2 Ultra

What is the output from:

make tests && ./tests/test-backend-ops

Any fails?

@LostRuins
Copy link
Collaborator Author

LostRuins commented Dec 24, 2023

Sorry, that took really long to compile on my Android.

The tests crash halfway, consistently at the same point, right after MOE(n_experts=8,n_experts_per_tok=2,n_tokens=1,n_embd=4096,n_ff=8192) .
I am not sure if it's due to going OOM, a segfault or something else. Here's the terminal output:

tests.txt

I modified the cpp file to remove https://github.com/ggerganov/llama.cpp/blob/master/tests/test-backend-ops.cpp#L1580 and recompiled.

tests_2.txt

And all other tests seem to be passing for test-backend-ops. I think the MOE test is unrelated though, Phi is not a MOE model. Also to clarify, the main binary is made from the makefile without modifications, just running make main. The environment is Termux on Android 9.

@LostRuins
Copy link
Collaborator Author

LostRuins commented Dec 25, 2023

Is there an easy way to modify the makefile to skip either NEON or FMA? I'd like to see if I can pinpoint which one is causing issues.

Also, I did check with another user over discord, they used Termux with identical compile and run settings to me and their output was coherent. So it might be a device specific thing? Slightly confused and wondering if anybody else has the same issue.

For reference, my device (that didn't work)

SoC: Samsung Exynos 9810 (S5E9810)
CPU: 4x 2.9 GHz Exynos M3 Meerkat, 4x 1.9 GHz ARM Cortex-A55, Cores: 8

@ggerganov
Copy link
Member

Does it work with this patch:

diff --git a/ggml-quants.c b/ggml-quants.c
index a15a2404..b5c76f00 100644
--- a/ggml-quants.c
+++ b/ggml-quants.c
@@ -5602,7 +5602,7 @@ void ggml_vec_dot_q4_K_q8_K(const int n, float * restrict s, const void * restri
 
     const int nb = n / QK_K;
 
-#ifdef __ARM_NEON
+#ifdef __ARM_NEON_XXX
 
     const uint8x16_t m4b = vdupq_n_u8(0xf);
 

@LostRuins
Copy link
Collaborator Author

Unfortunately not, it's still generating rubbish.

@LostRuins
Copy link
Collaborator Author

Another update: I have done a full search and replace in all files replacing all instances of __ARM_NEON to __ARM_NEON_XXX and it is working correctly now. So whatever the bug was, it was indeed with NEON.

I will try to slowly replace each instance until I find the one responsible, unless you have a better approach to suggest.

@LostRuins
Copy link
Collaborator Author

By trial and error, I have narrowed it down to ggml_vec_dot_q5_K_q8_K()

By skipping https://github.com/ggerganov/llama.cpp/blob/master/ggml-quants.c#L5879 , this model works perfectly on my device.

@ggerganov
Copy link
Member

Could you verify that #4630 also fixes the issue on that device?

@LostRuins
Copy link
Collaborator Author

LostRuins commented Dec 25, 2023

Yes that seems to have fixed it! Awesome.

Though I am wondering why the unit tests didn't catch that.

@lts-rad
Copy link

lts-rad commented Dec 25, 2023

I checked q4_k_m on my build with #4630 and thats fine, but I think there's something still off with q5_k_m. https://huggingface.co/afrideva/phi-2-uncensored-GGUF/blob/main/phi-2-uncensored.q5_k_m.gguf

@ggerganov
Copy link
Member

Though I am wondering why the unit tests didn't catch that.

The tests cannot catch integer overflows. Should be fixed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants