Skip to content

Got error when running llama.cpp with OpenMPI #2827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hunter-xue opened this issue Aug 27, 2023 · 1 comment
Closed

Got error when running llama.cpp with OpenMPI #2827

hunter-xue opened this issue Aug 27, 2023 · 1 comment

Comments

@hunter-xue
Copy link

Ubuntu 22.04 with OpenMPI installed and working well. git branch is: b1079
Compile with command below:

make CC=mpicc CXX=mpicxx LLAMA_MPI=1

then start with command:

mpirun -hostfile ./hostfile -n 8 /home/ubuntu/llama.cpp/main -m /home/ubuntu/llama.cpp/models/chinese-alpaca-2-7b-q4_0.gguf -n 128 -p "hello. "

but got error and run into failure:

llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size =  117.41 MB
GGML_ASSERT: llama.cpp:2834: n_threads > 0
[vm10-100-1-215:376355] *** Process received signal ***
llama_new_context_with_model: compute buffer total size =  117.41 MB
llama_new_context_with_model: compute buffer total size =  117.41 MB
llama_new_context_with_model: compute buffer total size =  117.41 MB
llama_new_context_with_model: compute buffer total size =  117.41 MB
GGML_ASSERT: llama.cpp:2834: n_threads > 0
GGML_ASSERT: llama.cpp:2834: n_threads > 0
GGML_ASSERT: llama.cpp:2834: n_threads > 0
GGML_ASSERT: llama.cpp:2834: n_threads > 0
[vm10-100-1-215:376354] *** Process received signal ***
[vm10-100-1-215:376358] *** Process received signal ***
[vm10-100-1-215:376353] *** Process received signal ***
[vm10-100-1-215:376356] *** Process received signal ***
[vm10-100-1-215:376355] Signal: Aborted (6)
[vm10-100-1-215:376355] Signal code:  (-6)
[vm10-100-1-215:376354] Signal: Aborted (6)
[vm10-100-1-215:376354] Signal code:  (-6)
[vm10-100-1-215:376358] Signal: Aborted (6)
[vm10-100-1-215:376358] Signal code:  (-6)
[vm10-100-1-215:376353] Signal: Aborted (6)
[vm10-100-1-215:376353] Signal code:  (-6)
[vm10-100-1-215:376356] Signal: Aborted (6)
[vm10-100-1-215:376356] Signal code:  (-6)
[vm10-100-1-215:376355] [ 0] [vm10-100-1-215:376358] [ 0] [vm10-100-1-215:376353] [ 0] [vm10-100-1-215:376354] [ 0] [vm10-100-1-215:376356] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7b00be2520]
[vm10-100-1-215:376355] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f6dfb9a2520]
[vm10-100-1-215:376353] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fcc2fc59520]
[vm10-100-1-215:376354] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f96ef10f520]
[vm10-100-1-215:376356] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f765279a520]
[vm10-100-1-215:376358] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f7b00c36a7c]
[vm10-100-1-215:376355] [ 2] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f76527eea7c]
[vm10-100-1-215:376358] [ 2] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fcc2fcada7c]
[vm10-100-1-215:376354] [ 2] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f96ef163a7c]
[vm10-100-1-215:376356] [ 2] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f6dfb9f6a7c]
[vm10-100-1-215:376353] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f7b00be2476]
[vm10-100-1-215:376355] [ 3] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f96ef10f476]
[vm10-100-1-215:376356] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f96ef0f57f3]
[vm10-100-1-215:376356] [ 4] /home/ubuntu/llama.cpp/main(+0x5a314)[0x5640622f1314]
[vm10-100-1-215:376356] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f7b00bc87f3]
[vm10-100-1-215:376355] [ 4] /home/ubuntu/llama.cpp/main(+0x5a314)[0x55ff14970314]
[vm10-100-1-215:376355] [ 5] /home/ubuntu/llama.cpp/main(+0x5a47f)[0x55ff1497047f]
[vm10-100-1-215:376355] [ 6] /home/ubuntu/llama.cpp/main(+0x5ab92)[0x55ff14970b92]
[vm10-100-1-215:376355] [ 7] /home/ubuntu/llama.cpp/main(+0x90762)[0x55ff149a6762]
[vm10-100-1-215:376355] [ 8] /home/ubuntu/llama.cpp/main(+0xf7da)[0x55ff149257da]
[vm10-100-1-215:376355] [ 9] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f765279a476]
[vm10-100-1-215:376358] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f76527807f3]
[vm10-100-1-215:376358] [ 4] /home/ubuntu/llama.cpp/main(+0x5a314)[0x558379897314]
[vm10-100-1-215:376358] [ 5] /home/ubuntu/llama.cpp/main(+0x5a47f)[0x55837989747f]
[vm10-100-1-215:376358] [ 6] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fcc2fc59476]
[vm10-100-1-215:376354] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fcc2fc3f7f3]
[vm10-100-1-215:376354] [ 4] /home/ubuntu/llama.cpp/main(+0x5a314)[0x55736efb1314]
[vm10-100-1-215:376354] [ 5] /home/ubuntu/llama.cpp/main(+0x5a47f)[0x55736efb147f]
[vm10-100-1-215:376354] [ 6] /home/ubuntu/llama.cpp/main(+0x5ab92)[0x55736efb1b92]
[vm10-100-1-215:376354] [ 7] /home/ubuntu/llama.cpp/main(+0x90762)[0x55736efe7762]
[vm10-100-1-215:376354] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f6dfb9a2476]
[vm10-100-1-215:376353] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f6dfb9887f3]
[vm10-100-1-215:376353] [ 4] /home/ubuntu/llama.cpp/main(+0x5a314)[0x55af7db6d314]
[vm10-100-1-215:376353] [ 5] /home/ubuntu/llama.cpp/main(+0x5a47f)[0x55af7db6d47f]
[vm10-100-1-215:376353] [ 6] /home/ubuntu/llama.cpp/main(+0x5ab92)[0x55af7db6db92]
[vm10-100-1-215:376353] [ 7] /home/ubuntu/llama.cpp/main(+0x90762)[0x55af7dba3762]
[vm10-100-1-215:376353] /home/ubuntu/llama.cpp/main(+0x5ab92)[0x558379897b92]
[vm10-100-1-215:376358] [ 7] /home/ubuntu/llama.cpp/main(+0x90762)[0x5583798cd762]
[vm10-100-1-215:376358] [ 8] /home/ubuntu/llama.cpp/main(+0xf7da)[0x55837984c7da]
[vm10-100-1-215:376358] [ 9] [ 8] /home/ubuntu/llama.cpp/main(+0xf7da)[0x55736ef667da]
[vm10-100-1-215:376354] [ 9] [ 5] /home/ubuntu/llama.cpp/main(+0x5a47f)[0x5640622f147f]
[vm10-100-1-215:376356] [ 6] /home/ubuntu/llama.cpp/main(+0x5ab92)[0x5640622f1b92]
[vm10-100-1-215:376356] [ 7] /home/ubuntu/llama.cpp/main(+0x90762)[0x564062327762]
[vm10-100-1-215:376356] [ 8] /home/ubuntu/llama.cpp/main(+0xf7da)[0x5640622a67da]
[vm10-100-1-215:376356] [ 9] [ 8] /home/ubuntu/llama.cpp/main(+0xf7da)[0x55af7db227da]
[vm10-100-1-215:376353] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f6dfb989d90]
[vm10-100-1-215:376353] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f6dfb989e40]
[vm10-100-1-215:376353] [11] /home/ubuntu/llama.cpp/main(+0x12fc5)[0x55af7db25fc5]
[vm10-100-1-215:376353] *** End of error message ***
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f7b00bc9d90]
[vm10-100-1-215:376355] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f7b00bc9e40]
[vm10-100-1-215:376355] [11] /home/ubuntu/llama.cpp/main(+0x12fc5)[0x55ff14928fc5]
[vm10-100-1-215:376355] *** End of error message ***
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f96ef0f6d90]
[vm10-100-1-215:376356] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f96ef0f6e40]
[vm10-100-1-215:376356] [11] /home/ubuntu/llama.cpp/main(+0x12fc5)[0x5640622a9fc5]
[vm10-100-1-215:376356] *** End of error message ***
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f7652781d90]
[vm10-100-1-215:376358] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f7652781e40]
[vm10-100-1-215:376358] [11] /home/ubuntu/llama.cpp/main(+0x12fc5)[0x55837984ffc5]
[vm10-100-1-215:376358] *** End of error message ***
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fcc2fc40d90]
[vm10-100-1-215:376354] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fcc2fc40e40]
[vm10-100-1-215:376354] [11] /home/ubuntu/llama.cpp/main(+0x12fc5)[0x55736ef69fc5]
[vm10-100-1-215:376354] *** End of error message ***
llama_new_context_with_model: compute buffer total size =  117.41 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


..............................................................................................
llama_new_context_with_model: kv self size  =  256.00 MB
.....................................................................llama_new_context_with_model: compute buffer total size =  117.41 MB
.GGML_ASSERT: llama.cpp:2834: n_threads > 0
...[vm10-100-1-215:376352] *** Process received signal ***
.....[vm10-100-1-215:376352] Signal: Aborted (6)
[vm10-100-1-215:376352] Signal code:  (-6)
......[vm10-100-1-215:376352] [ 0] .../lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ff0c40be520]
[vm10-100-1-215:376352] [ 1] ../lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7ff0c4112a7c]
[vm10-100-1-215:376352] [ 2] ../lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7ff0c40be476]
[vm10-100-1-215:376352] [ 3] ../lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7ff0c40a47f3]
[vm10-100-1-215:376352] [ 4] /home/ubuntu/llama.cpp/main(+0x5a314)[0x56224177c314]
.
[vm10-100-1-215:376352] [ 5] /home/ubuntu/llama.cpp/main(+0x5a47f)[0x56224177c47f]
[vm10-100-1-215:376352] [ 6] /home/ubuntu/llama.cpp/main(+0x5ab92)[0x56224177cb92]
[vm10-100-1-215:376352] [ 7] /home/ubuntu/llama.cpp/main(+0x90762)[0x5622417b2762]
[vm10-100-1-215:376352] [ 8] /home/ubuntu/llama.cpp/main(+0xf7da)[0x5622417317da]
[vm10-100-1-215:376352] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7ff0c40a5d90]
[vm10-100-1-215:376352] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7ff0c40a5e40]
[vm10-100-1-215:376352] [11] /home/ubuntu/llama.cpp/main(+0x12fc5)[0x562241734fc5]
[vm10-100-1-215:376352] *** End of error message ***
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size =  117.41 MB
GGML_ASSERT: llama.cpp:2834: n_threads > 0
[vm10-100-1-215:376361] *** Process received signal ***
[vm10-100-1-215:376361] Signal: Aborted (6)
[vm10-100-1-215:376361] Signal code:  (-6)
[vm10-100-1-215:376361] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fabe2dc7520]
[vm10-100-1-215:376361] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fabe2e1ba7c]
[vm10-100-1-215:376361] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fabe2dc7476]
[vm10-100-1-215:376361] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fabe2dad7f3]
[vm10-100-1-215:376361] [ 4] /home/ubuntu/llama.cpp/main(+0x5a314)[0x558c5659c314]
[vm10-100-1-215:376361] [ 5] /home/ubuntu/llama.cpp/main(+0x5a47f)[0x558c5659c47f]
[vm10-100-1-215:376361] [ 6] /home/ubuntu/llama.cpp/main(+0x5ab92)[0x558c5659cb92]
[vm10-100-1-215:376361] [ 7] /home/ubuntu/llama.cpp/main(+0x90762)[0x558c565d2762]
[vm10-100-1-215:376361] [ 8] /home/ubuntu/llama.cpp/main(+0xf7da)[0x558c565517da]
[vm10-100-1-215:376361] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fabe2daed90]
[vm10-100-1-215:376361] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fabe2daee40]
[vm10-100-1-215:376361] [11] /home/ubuntu/llama.cpp/main(+0x12fc5)[0x558c56554fc5]
[vm10-100-1-215:376361] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node vm10-100-1-215 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
@slaren
Copy link
Member

slaren commented Aug 27, 2023

The issue is that MPI calls llama_eval with 0 threads in the workers:

https://github.com/ggerganov/llama.cpp/blob/230d46c723edf5999752e4cb67fd94edb19ef9c7/llama.cpp#L5528-L5538

But an assert was added in llama_eval_internal to prevent this:
https://github.com/ggerganov/llama.cpp/blob/230d46c723edf5999752e4cb67fd94edb19ef9c7/llama.cpp#L2848

Ideally, MPI would use the number of threads from the command line. In the meanwhile, I guess that we could remove the assert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants