Skip to content

sycl : Overcoming workaround for mmap() allocation on Windows #13482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 20, 2025

Conversation

s-Nick
Copy link
Collaborator

@s-Nick s-Nick commented May 12, 2025

This PR removes the usage of a workaround for mmap bug on some Intel GPUs on Linux. The bug is not present on Windows, so there is no meaning of having it in place.
This causes a small split in the codebase according to the OS in use, but it shows good performance improvements.

The work introduced here is based on #13109

N.B All numbers assessed with GGML_SYCL_DISABLE_OPT=0

Lunar Lake's performance (this PR)

model size params backend ngl sm test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none pp512 1330.42 ± 6.59
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none tg128 58.92 ± 0.46
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none pp512 2044.01 ± 13.08
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none tg128 44.47 ± 0.13
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none pp512 320.23 ± 0.97
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none tg128 22.66 ± 0.02
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none pp512 533.16 ± 1.41
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none tg128 15.41 ± 0.44
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none pp512 1402.31 ± 7.56
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none tg128 28.55 ± 0.06
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none pp512 502.78 ± 1.02
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none tg128 35.83 ± 0.07
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none pp512 807.02 ± 2.71
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none tg128 23.57 ± 0.08

build: 0e1009f (5334)

Lunar Lake's performance (#13109)

model size params backend ngl sm mmap test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 pp512 1323.21 ± 8.43
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 tg128 52.47 ± 0.42
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 pp512 1994.78 ± 6.69
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 tg128 40.50 ± 0.10
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 pp512 297.47 ± 0.49
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 tg128 21.58 ± 0.08
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 pp512 499.53 ± 2.32
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 tg128 15.54 ± 0.31
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 pp512 907.84 ± 0.56
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 tg128 27.54 ± 0.09
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 pp512 477.35 ± 0.33
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 tg128 33.95 ± 0.07
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 pp512 757.61 ± 1.53
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 tg128 21.80 ± 0.32

build: f7e7d2a (5331)

Battlemage(B580) performance (this PR)

model size params backend ngl threads sm test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 5 none pp512 7314.80 ± 23.23
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 5 none tg128 71.10 ± 2.21
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 5 none pp512 7419.09 ± 27.47
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 5 none tg128 88.57 ± 0.12
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 5 none pp512 2147.78 ± 6.70
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 5 none tg128 40.59 ± 0.07
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 5 none pp512 2189.34 ± 2.19
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 5 none tg128 38.32 ± 0.02
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 5 none pp512 5605.63 ± 22.70
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 5 none tg128 72.54 ± 0.29
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 5 none pp512 3002.45 ± 4.25
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 5 none tg128 62.49 ± 0.04
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 5 none pp512 3103.20 ± 3.79
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 5 none tg128 58.64 ± 0.01

build: 0e1009f (5334)

Battlemage(B580) performance(#13109 )

model size params backend ngl threads sm mmap test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 5 none 0 pp512 7067.24 ± 53.67
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 5 none 0 tg128 64.51 ± 0.33
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 5 none 0 pp512 7132.89 ± 28.96
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 5 none 0 tg128 78.58 ± 0.19
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 5 none 0 pp512 2109.49 ± 2.46
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 5 none 0 tg128 38.37 ± 0.11
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 5 none 0 pp512 2143.62 ± 0.99
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 5 none 0 tg128 36.33 ± 0.03
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 5 none 0 pp512 5322.20 ± 22.77
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 5 none 0 tg128 64.48 ± 0.08
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 5 none 0 pp512 2936.43 ± 7.73
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 5 none 0 tg128 57.50 ± 0.11
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 5 none 0 pp512 3024.06 ± 8.17
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 5 none 0 tg128 54.19 ± 0.05

build: f7e7d2a (5331)

LOG for different GPUs on Linux

In this section there are many logs about this patch working on Linux without affecting performance and or correctness.

Lunar Lake

lnl-test.txt

lnl_bench.txt
master_lnl.txt

Battlemage B580

bmg-test.txt

bmg_bench.txt
master_bmg.txt

PVC

pvc-test.txt

pvc_bench.txt
master_pvc.txt

ARC A770

arc-test.txt

arc_bench.txt
master_arc.txt

llama-cli output

bmg_cli_output.txt
lnl_cli_output.txt
pvc_cli_output.txt
arc_cli_output.txt

@s-Nick s-Nick requested a review from Alcpz May 12, 2025 13:04
@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 12, 2025
@NeoZhangJianyu
Copy link
Collaborator

@s-Nick
This PR title is about mmap().
But there is more code about other functions.

Could you clear other code change in this PR?

@s-Nick s-Nick changed the title [SYCL] Overcoming workaround for mmap() allocation on Windows [SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait May 15, 2025
@s-Nick s-Nick force-pushed the add_win_mmap_support branch from 0e1009f to 083f56b Compare May 16, 2025 08:03
@NeoZhangJianyu
Copy link
Collaborator

All wait() in SYCL backend have been confirmed with the value.
Don't rm them before detailed test.

@s-Nick
Copy link
Collaborator Author

s-Nick commented May 16, 2025

Thank your for your review @NeoZhangJianyu.

I modified the description adding many logs of llama-bench, llama-cli and test-backend-ops to address your concerns. I hope everything is clear now. If necessary, I can run other tests available in llama.cpp.

@s-Nick s-Nick marked this pull request as ready for review May 16, 2025 13:43
Copy link
Collaborator

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident we can remove that many waits unfortunately. Hopefully reverting them will not add back the waits that you saw being removed in the models used?

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented May 19, 2025

Thank your for your review @NeoZhangJianyu.

I modified the description adding many logs of llama-bench, llama-cli and test-backend-ops to address your concerns. I hope everything is clear now. If necessary, I can run other tests available in llama.cpp.

I have tested every wait() when I handled an issue before.
At that moment, I make sure every wait() is necessary.

s-Nick added 2 commits May 19, 2025 14:27
After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.
SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
@s-Nick s-Nick force-pushed the add_win_mmap_support branch from 88252ac to a2afcb3 Compare May 19, 2025 13:36
Copy link
Collaborator

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, you can edit the PR title since it's not removing waits anymore.

@s-Nick s-Nick changed the title [SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait [SYCL] Overcoming workaround for mmap() allocation on Windows May 19, 2025
@Alcpz Alcpz changed the title [SYCL] Overcoming workaround for mmap() allocation on Windows sycl : Overcoming workaround for mmap() allocation on Windows May 19, 2025
Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank your work to make code better!

@NeoZhangJianyu NeoZhangJianyu merged commit f7c9429 into ggml-org:master May 20, 2025
46 checks passed
infil00p pushed a commit to baseweight/llama.cpp that referenced this pull request May 22, 2025
…rg#13482)

* Remove mmap workaround on windows

After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.

* Update llama-bench README

SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants