Skip to content

vulkan: initialize devices properly for LLAMA_SPLIT_MODE_NONE #7552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 28, 2024

Conversation

Adriankhl
Copy link
Contributor

@Adriankhl Adriankhl commented May 27, 2024

Before the change, split mode "none" doesn't work for vulkan because it is not properly initialized before offload:

.\bin\main.exe -m "C:\Users\adriankhl\git\models\Meta-Llama-3-8B-Instruct.Q5_K_M.gguf" --prompt "hello" -sm "none"

gives an error.

This is because vulkan backend is initialized here
https://github.com/ggerganov/llama.cpp/blob/d6ef0e77dd25f54fb5856af47e3926cf6f36c281/ggml-vulkan.cpp#L6014-L6024
which is being called in llama_get_device_count, and llama_get_device_count is not being called when split mode is "none"

This PR make llm_load_tensors to call llama_get_device_count even when split mode is "none".

Something worth thinking about, is relying on llama_get_device_count to initialize vulkan backend a good idea? Feels like this is a bit cryptic.

Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 553 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8457.35ms p(95)=21491.72ms fails=, finish reason: stop=507 truncated=46
  • Prompt processing (pp): avg=101.31tk/s p(95)=464.1tk/s
  • Token generation (tg): avg=35.07tk/s p(95)=50.48tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=fix_vulkan_device commit=bd00902cda298ef8a595ca78eb6360546a010263

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716782690 --> 1716783316
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 694.2, 694.2, 694.2, 694.2, 694.2, 920.43, 920.43, 920.43, 920.43, 920.43, 900.94, 900.94, 900.94, 900.94, 900.94, 889.77, 889.77, 889.77, 889.77, 889.77, 917.0, 917.0, 917.0, 917.0, 917.0, 912.9, 912.9, 912.9, 912.9, 912.9, 909.83, 909.83, 909.83, 909.83, 909.83, 924.66, 924.66, 924.66, 924.66, 924.66, 932.59, 932.59, 932.59, 932.59, 932.59, 932.8, 932.8, 932.8, 932.8, 932.8, 972.77, 972.77, 972.77, 972.77, 972.77, 991.77, 991.77, 991.77, 991.77, 991.77, 1016.21, 1016.21, 1016.21, 1016.21, 1016.21, 1017.49, 1017.49, 1017.49, 1017.49, 1017.49, 1012.5, 1012.5, 1012.5, 1012.5, 1012.5, 1007.99, 1007.99, 1007.99, 1007.99, 1007.99, 1003.03, 1003.03, 1003.03, 1003.03, 1003.03, 988.57, 988.57, 988.57, 988.57, 988.57, 984.36, 984.36, 984.36, 984.36, 984.36, 978.89, 978.89, 978.89, 978.89, 978.89, 980.53, 980.53, 980.53, 980.53, 980.53, 981.45, 981.45, 981.45, 981.45, 981.45, 978.66, 978.66, 978.66, 978.66, 978.66, 980.54, 980.54, 980.54, 980.54, 980.54, 980.44, 980.44, 980.44, 980.44, 980.44, 977.45, 977.45, 977.45, 977.45, 977.45, 958.53, 958.53, 958.53, 958.53, 958.53, 954.48, 954.48, 954.48, 954.48, 954.48, 947.5, 947.5, 947.5, 947.5, 947.5, 946.46, 946.46, 946.46, 946.46, 946.46, 949.87, 949.87, 949.87, 949.87, 949.87, 947.38, 947.38, 947.38, 947.38, 947.38, 949.46, 949.46, 949.46, 949.46, 949.46, 955.28, 955.28, 955.28, 955.28, 955.28, 948.77, 948.77, 948.77, 948.77, 948.77, 953.5, 953.5, 953.5, 953.5, 953.5, 941.37, 941.37, 941.37, 941.37, 941.37, 939.71, 939.71, 939.71, 939.71, 939.71, 940.45, 940.45, 940.45, 940.45, 940.45, 940.19, 940.19, 940.19, 940.19, 940.19, 908.82, 908.82, 908.82, 908.82, 908.82, 909.71, 909.71, 909.71, 909.71, 909.71, 901.08, 901.08, 901.08, 901.08, 901.08, 898.28, 898.28, 898.28, 898.28, 898.28, 896.74, 896.74, 896.74, 896.74, 896.74, 899.44, 899.44, 899.44, 899.44, 899.44, 899.39, 899.39, 899.39, 899.39, 899.39, 897.58, 897.58, 897.58, 897.58, 897.58, 901.87, 901.87, 901.87, 901.87, 901.87, 900.38, 900.38, 900.38, 900.38, 900.38, 901.87, 901.87, 901.87, 901.87, 901.87, 904.26, 904.26, 904.26, 904.26, 904.26, 902.92, 902.92, 902.92, 902.92, 902.92, 892.29, 892.29, 892.29, 892.29, 892.29, 893.53, 893.53, 893.53, 893.53, 893.53, 892.68, 892.68, 892.68, 892.68, 892.68, 892.12, 892.12, 892.12, 892.12, 892.12, 890.96, 890.96, 890.96, 890.96, 890.96, 892.41, 892.41, 892.41, 892.41, 892.41, 894.82, 894.82, 894.82, 894.82, 894.82, 894.59, 894.59, 894.59, 894.59]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716782690 --> 1716783316
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.43, 41.43, 41.43, 41.43, 41.43, 33.67, 33.67, 33.67, 33.67, 33.67, 28.37, 28.37, 28.37, 28.37, 28.37, 31.16, 31.16, 31.16, 31.16, 31.16, 32.15, 32.15, 32.15, 32.15, 32.15, 32.65, 32.65, 32.65, 32.65, 32.65, 33.32, 33.32, 33.32, 33.32, 33.32, 33.95, 33.95, 33.95, 33.95, 33.95, 34.35, 34.35, 34.35, 34.35, 34.35, 34.77, 34.77, 34.77, 34.77, 34.77, 34.93, 34.93, 34.93, 34.93, 34.93, 34.43, 34.43, 34.43, 34.43, 34.43, 34.42, 34.42, 34.42, 34.42, 34.42, 33.36, 33.36, 33.36, 33.36, 33.36, 30.96, 30.96, 30.96, 30.96, 30.96, 30.46, 30.46, 30.46, 30.46, 30.46, 30.49, 30.49, 30.49, 30.49, 30.49, 30.89, 30.89, 30.89, 30.89, 30.89, 30.61, 30.61, 30.61, 30.61, 30.61, 30.08, 30.08, 30.08, 30.08, 30.08, 30.02, 30.02, 30.02, 30.02, 30.02, 30.06, 30.06, 30.06, 30.06, 30.06, 30.36, 30.36, 30.36, 30.36, 30.36, 30.28, 30.28, 30.28, 30.28, 30.28, 30.32, 30.32, 30.32, 30.32, 30.32, 30.41, 30.41, 30.41, 30.41, 30.41, 30.66, 30.66, 30.66, 30.66, 30.66, 30.4, 30.4, 30.4, 30.4, 30.4, 30.4, 30.4, 30.4, 30.4, 30.4, 30.74, 30.74, 30.74, 30.74, 30.74, 30.73, 30.73, 30.73, 30.73, 30.73, 30.81, 30.81, 30.81, 30.81, 30.81, 31.02, 31.02, 31.02, 31.02, 31.02, 31.16, 31.16, 31.16, 31.16, 31.16, 31.09, 31.09, 31.09, 31.09, 31.09, 30.91, 30.91, 30.91, 30.91, 30.91, 30.68, 30.68, 30.68, 30.68, 30.68, 30.89, 30.89, 30.89, 30.89, 30.89, 31.11, 31.11, 31.11, 31.11, 31.11, 31.21, 31.21, 31.21, 31.21, 31.21, 31.36, 31.36, 31.36, 31.36, 31.36, 31.1, 31.1, 31.1, 31.1, 31.1, 30.84, 30.84, 30.84, 30.84, 30.84, 30.49, 30.49, 30.49, 30.49, 30.49, 29.59, 29.59, 29.59, 29.59, 29.59, 29.39, 29.39, 29.39, 29.39, 29.39, 29.35, 29.35, 29.35, 29.35, 29.35, 29.29, 29.29, 29.29, 29.29, 29.29, 29.32, 29.32, 29.32, 29.32, 29.32, 29.27, 29.27, 29.27, 29.27, 29.27, 29.31, 29.31, 29.31, 29.31, 29.31, 29.32, 29.32, 29.32, 29.32, 29.32, 29.26, 29.26, 29.26, 29.26, 29.26, 29.3, 29.3, 29.3, 29.3, 29.3, 29.25, 29.25, 29.25, 29.25, 29.25, 29.42, 29.42, 29.42, 29.42, 29.42, 29.54, 29.54, 29.54, 29.54, 29.54, 29.64, 29.64, 29.64, 29.64, 29.64, 29.79, 29.79, 29.79, 29.79, 29.79, 29.78, 29.78, 29.78, 29.78, 29.78, 29.84, 29.84, 29.84, 29.84]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716782690 --> 1716783316
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18, 0.18, 0.18, 0.18, 0.18, 0.32, 0.32, 0.32, 0.32, 0.32, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.07, 0.07, 0.07, 0.07, 0.07, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.34, 0.34, 0.34, 0.34, 0.34, 0.45, 0.45, 0.45, 0.45, 0.45, 0.43, 0.43, 0.43, 0.43, 0.43, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.29, 0.29, 0.29, 0.29, 0.29, 0.35, 0.35, 0.35, 0.35, 0.35, 0.34, 0.34, 0.34, 0.34, 0.34, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.08, 0.08, 0.08, 0.08, 0.08, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.27, 0.27, 0.27, 0.27, 0.27, 0.22, 0.22, 0.22, 0.22, 0.22, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.36, 0.36, 0.36, 0.36, 0.36, 0.58, 0.58, 0.58, 0.58, 0.58, 0.59, 0.59, 0.59, 0.59, 0.59, 0.45, 0.45, 0.45, 0.45, 0.45, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.27, 0.27, 0.27, 0.27, 0.27, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.3, 0.3, 0.3, 0.3, 0.3, 0.12, 0.12, 0.12, 0.12, 0.12, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716782690 --> 1716783316
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0]
                    
Loading

@slaren
Copy link
Member

slaren commented May 27, 2024

Should be fixed on the vulkan backend instead.

@Adriankhl Adriankhl force-pushed the fix_vulkan_device branch from bd00902 to c61e8e9 Compare May 27, 2024 07:57
@Adriankhl Adriankhl changed the title ggml: initialize vulkan devices properly for LLAMA_SPLIT_MODE_NONE vulkan: initialize devices properly for LLAMA_SPLIT_MODE_NONE May 27, 2024
@Adriankhl Adriankhl force-pushed the fix_vulkan_device branch from c61e8e9 to ee84384 Compare May 27, 2024 07:59
@Adriankhl
Copy link
Contributor Author

Should be fixed on the vulkan backend instead.

Ok, changed it.

@github-actions github-actions bot added the Vulkan Issues specific to the Vulkan backend label May 27, 2024
@mofosyne mofosyne added Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix bugfix fixes an issue or bug labels May 27, 2024
@slaren slaren requested a review from 0cc4m May 27, 2024 12:04
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@0cc4m 0cc4m merged commit 56411a9 into ggml-org:master May 28, 2024
65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants