Stuck loading VRAM ROCm multi gpu #3991

bojak83318 · 2023-11-08T10:22:50Z

Context

Once it loads it stuck at loading VRAM

My computer is running Dual AMD GPU 7900 XTX and 7900 XT Ubuntu 22.04 , ROCm 5.7

ROCM-SMI Output

========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK   MCLK   Fan     Perf  PwrCap  VRAM%  GPU%  
0    69.0c           26.0W   28Mhz  96Mhz  22.75%  auto  291.0W   67%   0%    
1    50.0c           30.0W   33Mhz  96Mhz  14.9%   auto  282.0W   67%   0%    
====================================================================================
=============================== End of ROCm SMI Log ================================

$ python3 --version
Python 3.10.12
$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Steps to Reproduce

./main -ngl 99 -m ../llama_cpp_models/llama-2-70b-chat.Q4_0.gguf -mg 0 -p "Write a function in TypeScript that sums numbers"

Failure Logs

Log start
main: build = 1487 (c41ea36)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1699438381
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
  Device 1: Radeon RX 7900 XT, compute capability 11.0
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from ../llama_cpp_models/llama-2-70b-chat.Q4_0.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  8192, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q4_0     [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_0     [  8192, 28672,     1,     1 ]
...
llama_model_loader: - tensor  718:        blk.79.attn_output.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  719:             blk.79.attn_q.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  720:             blk.79.attn_v.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  721:               output_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  722:                    output.weight q6_K     [  8192, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  18:               general.quantization_version u32     
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_0:  561 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 36.20 GiB (4.51 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.26 MB
llm_load_tensors: using ROCm for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Radeon RX 7900 XTX) as main device
llm_load_tensors: mem required  =  140.89 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 36930.11 MB
...........................................................................

The text was updated successfully, but these errors were encountered:

wizd · 2023-11-09T00:54:43Z

my rig with 2 7900 XTX:

super@super-System-Product-Name:~/apps/llama.cpp$ rocm-smi


========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
ERROR: GPU[2]   : sclk clock is unsupported
====================================================================================
GPU[2]          : get_power_cap, Not supported on the given system
GPU  Temp (DieEdge)  AvgPwr   SCLK     MCLK     Fan  Perf  PwrCap       VRAM%  GPU%
0    33.0c           24.0W    295Mhz   96Mhz    0%   auto  339.0W         0%   10%
1    31.0c           81.0W    1514Mhz  96Mhz    0%   auto  339.0W         0%   56%
2    37.0c           55.141W  None     1800Mhz  0%   auto  Unsupported    9%   0%
====================================================================================
=============================== End of ROCm SMI Log ================================
super@super-System-Product-Name:~/apps/llama.cpp$ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 9 7950X3D 16-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 9 7950X3D 16-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   4200
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            32
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    131082728(0x7d029e8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131082728(0x7d029e8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    131082728(0x7d029e8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1100
  Uuid:                    GPU-aee456bdb1c699e6
  Marketing Name:          Radeon RX 7900 XTX
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2526
  BDFID:                   768
  Internal Node ID:        1
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 546
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS:
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 3
*******
  Name:                    gfx1100
  Uuid:                    GPU-398a3f843a146602
  Marketing Name:          Radeon RX 7900 XTX
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2526
  BDFID:                   2304
  Internal Node ID:        2
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 546
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS:
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 4
*******
  Name:                    gfx1036
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon Graphics
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    3
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      256(0x100) KB
  Chip ID:                 5710(0x164e)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2200
  BDFID:                   4352
  Internal Node ID:        3
  Compute Unit:            2
  SIMDs per CU:            2
  Shader Engines:          1
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 20
  SDMA engine uCode::      8
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    524288(0x80000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS:
      Size:                    524288(0x80000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1036
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***
super@super-System-Product-Name:~/apps/llama.cpp$ HIP_VISIBLE_DEVICES=0 ./bin/main -m ../models/openbuddy-mistral-7b-v13.1.Q8_0.gguf -
n 128 -ngl 50
Log start
main: build = 1499 (875fb42)
main: built with AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.7.0 23352 d1e13c532a947d0cbfc94759c00dcf152294aa13) for x86_64-unknown-linux-gnu
main: seed  = 1699491182
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../models/openbuddy-mistral-7b-v13.1.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q8_0     [  4096, 36608,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   13:         blk.1.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   19:              blk.2.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   20:              blk.2.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   22:         blk.2.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   28:              blk.3.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   29:              blk.3.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   31:         blk.3.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   37:              blk.4.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   38:              blk.4.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   40:         blk.4.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   46:              blk.5.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   47:              blk.5.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   49:         blk.5.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   53:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   55:              blk.6.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   56:              blk.6.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   58:         blk.6.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   59:            blk.6.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   60:              blk.6.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   61:            blk.6.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   62:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   64:              blk.7.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   65:              blk.7.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   67:         blk.7.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   68:            blk.7.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   69:              blk.7.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   70:            blk.7.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   71:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   73:              blk.8.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   74:              blk.8.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   76:         blk.8.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   77:            blk.8.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   78:              blk.8.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   79:            blk.8.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   80:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   82:              blk.9.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   83:              blk.9.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   85:         blk.9.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   86:            blk.9.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   87:              blk.9.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   89:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   91:             blk.10.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   92:             blk.10.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   94:        blk.10.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   95:           blk.10.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   96:             blk.10.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   98:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  100:             blk.11.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  101:             blk.11.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  103:        blk.11.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  104:           blk.11.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  105:             blk.11.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  107:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  109:             blk.12.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  110:             blk.12.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  112:        blk.12.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  113:           blk.12.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  114:             blk.12.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  116:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  118:             blk.13.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  119:             blk.13.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  121:        blk.13.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  122:           blk.13.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  123:             blk.13.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  125:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  126:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  127:             blk.14.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  128:             blk.14.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  130:        blk.14.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  131:           blk.14.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  132:             blk.14.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  134:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  136:             blk.15.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  137:             blk.15.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  139:        blk.15.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  143:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  144:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  145:             blk.16.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  146:             blk.16.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  148:        blk.16.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  150:             blk.16.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  152:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  154:             blk.17.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  155:             blk.17.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  157:        blk.17.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  159:             blk.17.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  161:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  162:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  163:             blk.18.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  164:             blk.18.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  166:        blk.18.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  168:             blk.18.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  170:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  172:             blk.19.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  173:             blk.19.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  175:        blk.19.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  177:             blk.19.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  179:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  180:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  181:             blk.20.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  182:             blk.20.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  184:        blk.20.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  186:             blk.20.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  188:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  190:             blk.21.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  191:             blk.21.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  193:        blk.21.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  195:             blk.21.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  197:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  198:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  199:             blk.22.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  200:             blk.22.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  202:        blk.22.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  204:             blk.22.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  206:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  207:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  208:             blk.23.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  209:             blk.23.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  211:        blk.23.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  213:             blk.23.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  214:           blk.23.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  215:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  216:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  217:             blk.24.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  218:             blk.24.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  220:        blk.24.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  222:             blk.24.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  223:           blk.24.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  224:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  225:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  226:             blk.25.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  227:             blk.25.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  229:        blk.25.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  231:             blk.25.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  232:           blk.25.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  233:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  234:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  235:             blk.26.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  236:             blk.26.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  238:        blk.26.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  240:             blk.26.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  241:           blk.26.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  242:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  243:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  244:             blk.27.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  245:             blk.27.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  247:        blk.27.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  249:             blk.27.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  251:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  253:             blk.28.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  254:             blk.28.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  256:        blk.28.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  259:           blk.28.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  260:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  261:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  262:             blk.29.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  263:             blk.29.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  265:        blk.29.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  268:           blk.29.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  269:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  270:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  271:             blk.30.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  277:           blk.30.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  280:             blk.31.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  281:             blk.31.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_v.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  283:        blk.31.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  285:             blk.31.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  289:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  290:                    output.weight q8_0     [  4096, 36608,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
llama_model_loader: - kv  11:                          general.file_type u32
llama_model_loader: - kv  12:                       tokenizer.ggml.model str
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv  19:               general.quantization_version u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: mismatch in special tokens definition ( 361/36608 vs 259/36608 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 36608
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q8_0
llm_load_print_meta: model params     = 7.28 B
llm_load_print_meta: model size       = 7.20 GiB (8.50 BPW)
llm_load_print_meta: general.name   = openbuddy_openbuddy-mistral-7b-v13.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required  =  152.04 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 7224.95 MB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 86.13 MB
llama_new_context_with_model: VRAM scratch buffer: 79.50 MB
llama_new_context_with_model: total VRAM used: 7368.45 MB (model: 7224.95 MB, context: 143.50 MB)

CUDA error 98 at /home/super/apps/llama.cpp/ggml-cuda.cu:6779: invalid device function
current device: 0
super@super-System-Product-Name:~/apps/llama.cpp$

8XXD8 · 2023-11-09T10:40:39Z

I had the same kind of problems too. You have to build it with make, cmake caused the cuda errors for me.
And for the stuck loading, try launching it with -no-mmap, you will need enough ram or swap for the full model.
I had these problems with 2X MI25

wizd · 2023-11-09T13:02:31Z

Thanks @8XXD8 Now loading is OK but got endless '#' as output:

llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 39564.34 MB
llm_load_tensors: using ROCm for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Radeon RX 7900 XTX) as main device
llm_load_tensors: mem required  =  165.64 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 39398.70 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 160.00 MB
llama_new_context_with_model: kv self size  =  160.00 MB
llama_build_graph: non-view tensors processed: 1844/1844
llama_new_context_with_model: compute buffer total size = 151.63 MB
llama_new_context_with_model: VRAM scratch buffer: 145.00 MB
llama_new_context_with_model: total VRAM used: 39703.71 MB (model: 39398.70 MB, context: 305.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0


tell me a very long story################################################################################################################################################################################################################################.......

running:

export HIP_VISIBLE_DEVICES="0,1"
./main -m ../text-generation-webui/models/openbuddy-llama2-70b-v10.1.Q4_K_M.gguf -ngl 100 -p "tell me a very long story" -n 1000 --no-mmap

8XXD8 · 2023-11-09T15:54:19Z

Have you tried other models?
Openbuddy might be sensitive to prompt format, I downloaded the 7b version to test and with your prompt sometimes it gave me only an end of text:

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0


tell me a very long story
 [end of text]

I had some proper responses too.

bojak83318 · 2023-11-10T02:19:04Z

must i make with

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=1 make -j

OR

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

for multi gpu ?

wizd · 2023-11-10T02:30:42Z

must i make with

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=1 make -j

OR

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

for multi gpu ?

wow you are so COOL! I got my first proper output by 2 gpu:

export HIP_VISIBLE_DEVICES="0,1"
./main -m ../text-generation-webui/models/samantha-1.1-llama-33b.Q6_K.gguf -ngl 100 -p "tell me a very long story" -n 1000 --no-mmap

got output:

......
llm_load_print_meta: model ftype      = mostly Q6_K
llm_load_print_meta: model params     = 32.53 B
llm_load_print_meta: model size       = 24.85 GiB (6.56 BPW)
llm_load_print_meta: general.name   = samantha-1.1-llama-33b.ggmlv3.q6_K.bin
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 25450.39 MB
llm_load_tensors: mem required  = 25450.39 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  780.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 103.63 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0


tell me a very long story, one that can keep us both entertained and engaged while we continue to chat. It's important for me to build our connection through stimulating conversations." [end of text]

bojak83318 · 2023-11-10T03:04:35Z

congrats so happy for u ..
im still at

 llm_load_tensors: VRAM used: 36930.11 MB
...........................................................................

bojak83318 · 2023-11-10T03:06:07Z

btw for make .. what are the commands did you run @wizd ?

wizd · 2023-11-10T03:26:18Z

btw for make .. what are the commands did you run @wizd ?

you are right. must do gpu selection on compile time.

bojak83318 · 2023-11-10T03:40:16Z

so its make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j
or
make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES="0,1" make -j

bojak83318 · 2023-11-10T03:47:47Z

if i use

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES="0,1" make -j

i get

warning: see main README.md for information on enabling GPU BLAS support
warning: llama.cpp was compiled without cuBLAS. It is not possible to set a main GPU.

bojak83318 · 2023-11-10T03:54:31Z

only possible with this make command

make clean && LLAMA_HIPBLAS=1 make -j

and the output is

main: build = 1501 (a75fa57)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1699588357
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
  Device 1: Radeon RX 7900 XT, compute capability 11.0

but i get jibberish

Write a function in TypeScript that sums numbers Geography#####################################################

wizd · 2023-11-10T04:09:55Z

sorry my post was wrong because this line compile without gpu support and runs on cpu:

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

so the bug is still there. when turn on two gpu:

....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 780.00 MB
llama_new_context_with_model: kv self size  =  780.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 103.63 MB
llama_new_context_with_model: VRAM scratch buffer: 97.00 MB
llama_new_context_with_model: total VRAM used: 26160.57 MB (model: 25283.57 MB, context: 877.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0


tell me a very long storyviews RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA

bojak83318 · 2023-11-10T04:16:36Z

yes when i use 2 gpu its a problem

bojak83318 · 2023-11-10T04:17:43Z

i posted upstream on RCCL github ROCm/rccl#957

bojak83318 · 2023-11-10T04:22:23Z

sorry my post was wrong because this line compile without gpu support and runs on cpu:

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

so the bug is still there. when turn on two gpu:

....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 780.00 MB
llama_new_context_with_model: kv self size  =  780.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 103.63 MB
llama_new_context_with_model: VRAM scratch buffer: 97.00 MB
llama_new_context_with_model: total VRAM used: 26160.57 MB (model: 25283.57 MB, context: 877.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0


tell me a very long storyviews RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA

so its just this that works with gpu

make clean && LLAMA_HIPBLAS=1 make -j

right ?

wizd · 2023-11-10T05:08:09Z

sorry my post was wrong because this line compile without gpu support and runs on cpu:

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

so the bug is still there. when turn on two gpu:

....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 780.00 MB
llama_new_context_with_model: kv self size  =  780.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 103.63 MB
llama_new_context_with_model: VRAM scratch buffer: 97.00 MB
llama_new_context_with_model: total VRAM used: 26160.57 MB (model: 25283.57 MB, context: 877.00 MB)

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 1000, n_keep = 0


tell me a very long storyviews RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA RA

so its just this that works with gpu

make clean && LLAMA_HIPBLAS=1 make -j

right ?

No, compiling it like this will only produce output that is compatible only with CPUs.

8XXD8 · 2023-11-10T05:32:21Z

I dont use HIP_VISIBLE_DEVICES, just

make clean
make LLAMA_HIPBLAS=1

and it works for me.
And i run ./main without HIP_VISIBLE_DEVICES, and it uses both cards. Im on Rocm 5.7 and Ubuntu server 22.04.3

bojak83318 · 2023-11-10T06:15:46Z

I dont use HIP_VISIBLE_DEVICES, just
make clean
make LLAMA_HIPBLAS=1
and it works for me. And i run ./main without HIP_VISIBLE_DEVICES, and it uses both cards. Im on Rocm 5.7 and Ubuntu server 22.04.3

Thanks its compiled as per gpu now

Log start
main: build = 1501 (a75fa57)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1699596835
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
  Device 1: Radeon RX 7900 XT, compute capability 11.0

but ...

Write in python object class star wars and its charactersDesc##################################################################

still getting jibberish

bojak83318 · 2023-11-10T06:18:44Z

I dont use HIP_VISIBLE_DEVICES, just
make clean
make LLAMA_HIPBLAS=1
and it works for me. And i run ./main without HIP_VISIBLE_DEVICES, and it uses both cards. Im on Rocm 5.7 and Ubuntu server 22.04.3

can you perhaps outline like this guide , from start to finish ...
https://www.reddit.com/r/LocalLLaMA/comments/170tghx/guide_installing_rocmhip_for_llamacpp_on_linux/

bojak83318 · 2023-11-10T07:52:47Z

apparently nvidia also has this problem #3772

ccbadd · 2023-11-10T12:44:12Z

HIP_VISIBLE_DEVICES is an environment variable read by ROCm during run time and has not reason to be part of the command line for compiling. Simply running "make LLAMA_HIPBLAS=1" is all that is required on my Ubuntu 22.04 server.

xangelix · 2023-11-11T00:42:49Z

Unclear if this is related, but I can't load any model at all on multigpu ROCm.

Segmentation fault after model load for ROCm multi-gpu, multi-gfx. Best I can remember it worked a couple months ago, but has now been broken at least 2 weeks.

Tested on: Arch Linux Kernel 6.5.9, ROCm 5.7.1, llamacpp 4a4fd3e

rocminfo

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 9 7950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 9 7950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   6021                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    65539100(0x3e80c1c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65539100(0x3e80c1c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    65539100(0x3e80c1c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1100                            
  Uuid:                    GPU-28b5961221d81024               
  Marketing Name:          AMD Radeon RX 7900 XTX             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      6144(0x1800) KB                    
    L3:                      98304(0x18000) KB                  
  Chip ID:                 29772(0x744c)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2526                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            96                                 
  SIMDs per CU:            2                                  
  Shader Engines:          6                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 528                                
  SDMA engine uCode::      19                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1100         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx1030                            
  Uuid:                    GPU-8de346d621abe448               
  Marketing Name:          AMD Radeon RX 6900 XT              
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      4096(0x1000) KB                    
    L3:                      131072(0x20000) KB                 
  Chip ID:                 29615(0x73af)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2720                               
  BDFID:                   1792                               
  Internal Node ID:        2                                  
  Compute Unit:            80                                 
  SIMDs per CU:            2                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 115                                
  SDMA engine uCode::      83                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1030         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

make LLAMA_HIPBLAS=1

I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   unknown
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native 
I NVCCFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native "
I LDFLAGS:   -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
I CC:        cc (GCC) 13.2.1 20230801
I CXX:       g++ (GCC) 13.2.1 20230801

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native    -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/sampling.cpp -o sampling.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/grammar-parser.cpp -o grammar-parser.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/build-info.cpp -o build-info.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/console.cpp -o console.o
/opt/rocm/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  --offload-arch=gfx1100 --offload-arch=gfx1030 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml-cuda.o ggml-cuda.cu
warning: unknown warning option '-Wno-format-truncation' [-Wunknown-warning-option]
ggml-cuda.cu:235:41: warning: cast from 'const signed char *' to 'unsigned short *' drops const qualifier [-Wcast-qual]
    const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment
                                        ^
ggml-cuda.cu:245:41: warning: cast from 'const unsigned char *' to 'unsigned short *' drops const qualifier [-Wcast-qual]
    const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment
                                        ^
ggml-cuda.cu:255:22: warning: cast from 'const signed char *' to 'int *' drops const qualifier [-Wcast-qual]
    return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment
                     ^
ggml-cuda.cu:259:22: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
    return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment
                     ^
ggml-cuda.cu:467:75: warning: suggest braces around initialization of subobject [-Wmissing-braces]
static cudaStream_t g_cudaStreams[GGML_CUDA_MAX_DEVICES][MAX_STREAMS] = { nullptr };
                                                                          ^~~~~~~
                                                                          {      }
ggml-cuda.cu:2227:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2227:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual]
    const block_q4_0 * bx0 = (block_q4_0 *) vx;
                                            ^
ggml-cuda.cu:2237:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2238:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2288:37: warning: cast from 'const __half2 *' to 'float *' drops const qualifier [-Wcast-qual]
    const float * x_dmf = (float *) x_dm;
                                    ^
ggml-cuda.cu:2284:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2284:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2321:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2321:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual]
    const block_q4_1 * bx0 = (block_q4_1 *) vx;
                                            ^
ggml-cuda.cu:2331:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2332:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2375:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2375:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2413:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2413:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual]
    const block_q5_0 * bx0 = (block_q5_0 *) vx;
                                            ^
ggml-cuda.cu:2423:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2424:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2487:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2487:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2527:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2527:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual]
    const block_q5_1 * bx0 = (block_q5_1 *) vx;
                                            ^
ggml-cuda.cu:2537:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2538:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2598:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2598:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2633:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2633:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual]
    const block_q8_0 * bx0 = (block_q8_0 *) vx;
                                            ^
ggml-cuda.cu:2643:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2644:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2688:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2688:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2722:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q2_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual]
    const block_q2_K * bx0 = (block_q2_K *) vx;
                                            ^
ggml-cuda.cu:2734:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2791:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual]
    const block_q3_K * bx0 = (block_q3_K *) vx;
                                            ^
ggml-cuda.cu:2947:41: warning: cast from 'const int *' to 'signed char *' drops const qualifier [-Wcast-qual]
    const int8_t * scales = ((int8_t *) (x_sc + i * (WARP_SIZE/4) + i/4 + kbx*4)) + ky/4;
                                        ^
ggml-cuda.cu:3061:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual]
    const block_q4_K * bx0 = (block_q4_K *) vx;
                                            ^
ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3073:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:3142:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:3242:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual]
    const block_q5_K * bx0 = (block_q5_K *) vx;
                                            ^
ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3254:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:3334:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:3371:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q6_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual]
    const block_q6_K * bx0 = (block_q6_K *) vx;
                                            ^
ggml-cuda.cu:3383:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:3454:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:6947:26: warning: unused variable 'size_src0_ddq' [-Wunused-variable]
            const size_t size_src0_ddq = split ? (row_high[id]-row_low[id])*ne00 * src0_ts/src0_bs : ggml_nbytes(src0);
                         ^
ggml-cuda.cu:6846:19: warning: unused variable 'nrows0' [-Wunused-variable]
    const int64_t nrows0 = ggml_nrows(src0);
                  ^
ggml-cuda.cu:8135:103: warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
                                                                                                      ^
ggml-cuda.cu:8142:106: warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
                                                                                                         ^
ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual]
    const block_q4_0 * bx0 = (block_q4_0 *) vx;
                                            ^
ggml-cuda.cu:3618:9: note: in instantiation of function template specialization 'load_tiles_q4_0<128, 8, false>' requested here
        load_tiles_q4_0<mmq_y, nwarps, need_check>, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5158:9: note: in instantiation of function template specialization 'mul_mat_q4_0<false>' requested here
        mul_mat_q4_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3617:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 128, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK4_0, QR4_0, QI4_0, true, block_q4_0, mmq_x, mmq_y, nwarps, allocate_tiles_q4_0<mmq_y>,
    ^
ggml-cuda.cu:5158:9: note: in instantiation of function template specialization 'mul_mat_q4_0<false>' requested here
        mul_mat_q4_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual]
    const block_q4_0 * bx0 = (block_q4_0 *) vx;
                                            ^
ggml-cuda.cu:3618:9: note: in instantiation of function template specialization 'load_tiles_q4_0<128, 8, true>' requested here
        load_tiles_q4_0<mmq_y, nwarps, need_check>, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5162:9: note: in instantiation of function template specialization 'mul_mat_q4_0<true>' requested here
        mul_mat_q4_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3617:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 128, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK4_0, QR4_0, QI4_0, true, block_q4_0, mmq_x, mmq_y, nwarps, allocate_tiles_q4_0<mmq_y>,
    ^
ggml-cuda.cu:5162:9: note: in instantiation of function template specialization 'mul_mat_q4_0<true>' requested here
        mul_mat_q4_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual]
    const block_q4_1 * bx0 = (block_q4_1 *) vx;
                                            ^
ggml-cuda.cu:3687:9: note: in instantiation of function template specialization 'load_tiles_q4_1<128, 8, false>' requested here
        load_tiles_q4_1<mmq_y, nwarps, need_check>, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat>
        ^
ggml-cuda.cu:5203:9: note: in instantiation of function template specialization 'mul_mat_q4_1<false>' requested here
        mul_mat_q4_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3686:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 128, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here
    mul_mat_q<QK4_1, QR4_1, QI4_1, true, block_q4_1, mmq_x, mmq_y, nwarps, allocate_tiles_q4_1<mmq_y>,
    ^
ggml-cuda.cu:5203:9: note: in instantiation of function template specialization 'mul_mat_q4_1<false>' requested here
        mul_mat_q4_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual]
    const block_q4_1 * bx0 = (block_q4_1 *) vx;
                                            ^
ggml-cuda.cu:3687:9: note: in instantiation of function template specialization 'load_tiles_q4_1<128, 8, true>' requested here
        load_tiles_q4_1<mmq_y, nwarps, need_check>, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat>
        ^
ggml-cuda.cu:5207:9: note: in instantiation of function template specialization 'mul_mat_q4_1<true>' requested here
        mul_mat_q4_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3686:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 128, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here
    mul_mat_q<QK4_1, QR4_1, QI4_1, true, block_q4_1, mmq_x, mmq_y, nwarps, allocate_tiles_q4_1<mmq_y>,
    ^
ggml-cuda.cu:5207:9: note: in instantiation of function template specialization 'mul_mat_q4_1<true>' requested here
        mul_mat_q4_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual]
    const block_q5_0 * bx0 = (block_q5_0 *) vx;
                                            ^
ggml-cuda.cu:3754:9: note: in instantiation of function template specialization 'load_tiles_q5_0<128, 8, false>' requested here
        load_tiles_q5_0<mmq_y, nwarps, need_check>, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5248:9: note: in instantiation of function template specialization 'mul_mat_q5_0<false>' requested here
        mul_mat_q5_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3753:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 128, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK5_0, QR5_0, QI5_0, false, block_q5_0, mmq_x, mmq_y, nwarps, allocate_tiles_q5_0<mmq_y>,
    ^
ggml-cuda.cu:5248:9: note: in instantiation of function template specialization 'mul_mat_q5_0<false>' requested here
        mul_mat_q5_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual]
    const block_q5_0 * bx0 = (block_q5_0 *) vx;
                                            ^
ggml-cuda.cu:3754:9: note: in instantiation of function template specialization 'load_tiles_q5_0<128, 8, true>' requested here
        load_tiles_q5_0<mmq_y, nwarps, need_check>, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5252:9: note: in instantiation of function template specialization 'mul_mat_q5_0<true>' requested here
        mul_mat_q5_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3753:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 128, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK5_0, QR5_0, QI5_0, false, block_q5_0, mmq_x, mmq_y, nwarps, allocate_tiles_q5_0<mmq_y>,
    ^
ggml-cuda.cu:5252:9: note: in instantiation of function template specialization 'mul_mat_q5_0<true>' requested here
        mul_mat_q5_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual]
    const block_q5_1 * bx0 = (block_q5_1 *) vx;
                                            ^
ggml-cuda.cu:3821:9: note: in instantiation of function template specialization 'load_tiles_q5_1<128, 8, false>' requested here
        load_tiles_q5_1<mmq_y, nwarps, need_check>, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat>
        ^
ggml-cuda.cu:5293:9: note: in instantiation of function template specialization 'mul_mat_q5_1<false>' requested here
        mul_mat_q5_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3820:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 128, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here
    mul_mat_q<QK5_1, QR5_1, QI5_1, true, block_q5_1, mmq_x, mmq_y, nwarps, allocate_tiles_q5_1<mmq_y>,
    ^
ggml-cuda.cu:5293:9: note: in instantiation of function template specialization 'mul_mat_q5_1<false>' requested here
        mul_mat_q5_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual]
    const block_q5_1 * bx0 = (block_q5_1 *) vx;
                                            ^
ggml-cuda.cu:3821:9: note: in instantiation of function template specialization 'load_tiles_q5_1<128, 8, true>' requested here
        load_tiles_q5_1<mmq_y, nwarps, need_check>, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat>
        ^
ggml-cuda.cu:5297:9: note: in instantiation of function template specialization 'mul_mat_q5_1<true>' requested here
        mul_mat_q5_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3820:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 128, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here
    mul_mat_q<QK5_1, QR5_1, QI5_1, true, block_q5_1, mmq_x, mmq_y, nwarps, allocate_tiles_q5_1<mmq_y>,
    ^
ggml-cuda.cu:5297:9: note: in instantiation of function template specialization 'mul_mat_q5_1<true>' requested here
        mul_mat_q5_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual]
    const block_q8_0 * bx0 = (block_q8_0 *) vx;
                                            ^
ggml-cuda.cu:3888:9: note: in instantiation of function template specialization 'load_tiles_q8_0<128, 8, false>' requested here
        load_tiles_q8_0<mmq_y, nwarps, need_check>, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5338:9: note: in instantiation of function template specialization 'mul_mat_q8_0<false>' requested here
        mul_mat_q8_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3887:5: note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 128, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK8_0, QR8_0, QI8_0, false, block_q8_0, mmq_x, mmq_y, nwarps, allocate_tiles_q8_0<mmq_y>,
    ^
ggml-cuda.cu:5338:9: note: in instantiation of function template specialization 'mul_mat_q8_0<false>' requested here
        mul_mat_q8_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual]
    const block_q8_0 * bx0 = (block_q8_0 *) vx;
                                            ^
ggml-cuda.cu:3888:9: note: in instantiation of function template specialization 'load_tiles_q8_0<128, 8, true>' requested here
        load_tiles_q8_0<mmq_y, nwarps, need_check>, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5342:9: note: in instantiation of function template specialization 'mul_mat_q8_0<true>' requested here
        mul_mat_q8_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3887:5: note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 128, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK8_0, QR8_0, QI8_0, false, block_q8_0, mmq_x, mmq_y, nwarps, allocate_tiles_q8_0<mmq_y>,
    ^
ggml-cuda.cu:5342:9: note: in instantiation of function template specialization 'mul_mat_q8_0<true>' requested here
        mul_mat_q8_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual]
    const block_q2_K * bx0 = (block_q2_K *) vx;
                                            ^
ggml-cuda.cu:3955:9: note: in instantiation of function template specialization 'load_tiles_q2_K<128, 8, false>' requested here
        load_tiles_q2_K<mmq_y, nwarps, need_check>, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5383:9: note: in instantiation of function template specialization 'mul_mat_q2_K<false>' requested here
        mul_mat_q2_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3954:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 64, 128, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR2_K, QI2_K, false, block_q2_K, mmq_x, mmq_y, nwarps, allocate_tiles_q2_K<mmq_y>,
    ^
ggml-cuda.cu:5383:9: note: in instantiation of function template specialization 'mul_mat_q2_K<false>' requested here
        mul_mat_q2_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual]
    const block_q2_K * bx0 = (block_q2_K *) vx;
                                            ^
ggml-cuda.cu:3955:9: note: in instantiation of function template specialization 'load_tiles_q2_K<128, 8, true>' requested here
        load_tiles_q2_K<mmq_y, nwarps, need_check>, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5387:9: note: in instantiation of function template specialization 'mul_mat_q2_K<true>' requested here
        mul_mat_q2_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3954:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 64, 128, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR2_K, QI2_K, false, block_q2_K, mmq_x, mmq_y, nwarps, allocate_tiles_q2_K<mmq_y>,
    ^
ggml-cuda.cu:5387:9: note: in instantiation of function template specialization 'mul_mat_q2_K<true>' requested here
        mul_mat_q2_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual]
    const block_q3_K * bx0 = (block_q3_K *) vx;
                                            ^
ggml-cuda.cu:4024:9: note: in instantiation of function template specialization 'load_tiles_q3_K<64, 8, false>' requested here
        load_tiles_q3_K<mmq_y, nwarps, need_check>, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5430:9: note: in instantiation of function template specialization 'mul_mat_q3_K<false>' requested here
        mul_mat_q3_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4023:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 128, 64, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR3_K, QI3_K, false, block_q3_K, mmq_x, mmq_y, nwarps, allocate_tiles_q3_K<mmq_y>,
    ^
ggml-cuda.cu:5430:9: note: in instantiation of function template specialization 'mul_mat_q3_K<false>' requested here
        mul_mat_q3_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual]
    const block_q3_K * bx0 = (block_q3_K *) vx;
                                            ^
ggml-cuda.cu:4024:9: note: in instantiation of function template specialization 'load_tiles_q3_K<64, 8, true>' requested here
        load_tiles_q3_K<mmq_y, nwarps, need_check>, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5434:9: note: in instantiation of function template specialization 'mul_mat_q3_K<true>' requested here
        mul_mat_q3_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4023:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 128, 64, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR3_K, QI3_K, false, block_q3_K, mmq_x, mmq_y, nwarps, allocate_tiles_q3_K<mmq_y>,
    ^
ggml-cuda.cu:5434:9: note: in instantiation of function template specialization 'mul_mat_q3_K<true>' requested here
        mul_mat_q3_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual]
    const block_q4_K * bx0 = (block_q4_K *) vx;
                                            ^
ggml-cuda.cu:4093:9: note: in instantiation of function template specialization 'load_tiles_q4_K<128, 8, false>' requested here
        load_tiles_q4_K<mmq_y, nwarps, need_check>, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5476:9: note: in instantiation of function template specialization 'mul_mat_q4_K<false>' requested here
        mul_mat_q4_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4092:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 64, 128, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR4_K, QI4_K, true, block_q4_K, mmq_x, mmq_y, nwarps, allocate_tiles_q4_K<mmq_y>,
    ^
ggml-cuda.cu:5476:9: note: in instantiation of function template specialization 'mul_mat_q4_K<false>' requested here
        mul_mat_q4_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual]
    const block_q4_K * bx0 = (block_q4_K *) vx;
                                            ^
ggml-cuda.cu:4093:9: note: in instantiation of function template specialization 'load_tiles_q4_K<128, 8, true>' requested here
        load_tiles_q4_K<mmq_y, nwarps, need_check>, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5480:9: note: in instantiation of function template specialization 'mul_mat_q4_K<true>' requested here
        mul_mat_q4_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4092:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 64, 128, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR4_K, QI4_K, true, block_q4_K, mmq_x, mmq_y, nwarps, allocate_tiles_q4_K<mmq_y>,
    ^
ggml-cuda.cu:5480:9: note: in instantiation of function template specialization 'mul_mat_q4_K<true>' requested here
        mul_mat_q4_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual]
    const block_q5_K * bx0 = (block_q5_K *) vx;
                                            ^
ggml-cuda.cu:4160:9: note: in instantiation of function template specialization 'load_tiles_q5_K<128, 8, false>' requested here
        load_tiles_q5_K<mmq_y, nwarps, need_check>, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5521:9: note: in instantiation of function template specialization 'mul_mat_q5_K<false>' requested here
        mul_mat_q5_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4159:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 64, 128, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR5_K, QI5_K, true, block_q5_K, mmq_x, mmq_y, nwarps, allocate_tiles_q5_K<mmq_y>,
    ^
ggml-cuda.cu:5521:9: note: in instantiation of function template specialization 'mul_mat_q5_K<false>' requested here
        mul_mat_q5_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual]
    const block_q5_K * bx0 = (block_q5_K *) vx;
                                            ^
ggml-cuda.cu:4160:9: note: in instantiation of function template specialization 'load_tiles_q5_K<128, 8, true>' requested here
        load_tiles_q5_K<mmq_y, nwarps, need_check>, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5525:9: note: in instantiation of function template specialization 'mul_mat_q5_K<true>' requested here
        mul_mat_q5_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4159:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 64, 128, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR5_K, QI5_K, true, block_q5_K, mmq_x, mmq_y, nwarps, allocate_tiles_q5_K<mmq_y>,
    ^
ggml-cuda.cu:5525:9: note: in instantiation of function template specialization 'mul_mat_q5_K<true>' requested here
        mul_mat_q5_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual]
    const block_q6_K * bx0 = (block_q6_K *) vx;
                                            ^
ggml-cuda.cu:4229:9: note: in instantiation of function template specialization 'load_tiles_q6_K<128, 8, false>' requested here
        load_tiles_q6_K<mmq_y, nwarps, need_check>, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5566:9: note: in instantiation of function template specialization 'mul_mat_q6_K<false>' requested here
        mul_mat_q6_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4228:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 64, 128, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR6_K, QI6_K, false, block_q6_K, mmq_x, mmq_y, nwarps, allocate_tiles_q6_K<mmq_y>,
    ^
ggml-cuda.cu:5566:9: note: in instantiation of function template specialization 'mul_mat_q6_K<false>' requested here
        mul_mat_q6_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual]
    const block_q6_K * bx0 = (block_q6_K *) vx;
                                            ^
ggml-cuda.cu:4229:9: note: in instantiation of function template specialization 'load_tiles_q6_K<128, 8, true>' requested here
        load_tiles_q6_K<mmq_y, nwarps, need_check>, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5570:9: note: in instantiation of function template specialization 'mul_mat_q6_K<true>' requested here
        mul_mat_q6_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4228:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 64, 128, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR6_K, QI6_K, false, block_q6_K, mmq_x, mmq_y, nwarps, allocate_tiles_q6_K<mmq_y>,
    ^
ggml-cuda.cu:5570:9: note: in instantiation of function template specialization 'mul_mat_q6_K<true>' requested here
        mul_mat_q6_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
110 warnings generated when compiling for gfx1030.
warning: unknown warning option '-Wno-format-truncation' [-Wunknown-warning-option]
ggml-cuda.cu:235:41: warning: cast from 'const signed char *' to 'unsigned short *' drops const qualifier [-Wcast-qual]
    const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment
                                        ^
ggml-cuda.cu:245:41: warning: cast from 'const unsigned char *' to 'unsigned short *' drops const qualifier [-Wcast-qual]
    const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment
                                        ^
ggml-cuda.cu:255:22: warning: cast from 'const signed char *' to 'int *' drops const qualifier [-Wcast-qual]
    return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment
                     ^
ggml-cuda.cu:259:22: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
    return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment
                     ^
ggml-cuda.cu:467:75: warning: suggest braces around initialization of subobject [-Wmissing-braces]
static cudaStream_t g_cudaStreams[GGML_CUDA_MAX_DEVICES][MAX_STREAMS] = { nullptr };
                                                                          ^~~~~~~
                                                                          {      }
ggml-cuda.cu:2227:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2227:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual]
    const block_q4_0 * bx0 = (block_q4_0 *) vx;
                                            ^
ggml-cuda.cu:2237:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2238:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2288:37: warning: cast from 'const __half2 *' to 'float *' drops const qualifier [-Wcast-qual]
    const float * x_dmf = (float *) x_dm;
                                    ^
ggml-cuda.cu:2284:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2284:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2321:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2321:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual]
    const block_q4_1 * bx0 = (block_q4_1 *) vx;
                                            ^
ggml-cuda.cu:2331:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2332:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2375:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2375:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2413:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2413:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual]
    const block_q5_0 * bx0 = (block_q5_0 *) vx;
                                            ^
ggml-cuda.cu:2423:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2424:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2487:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2487:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2527:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2527:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual]
    const block_q5_1 * bx0 = (block_q5_1 *) vx;
                                            ^
ggml-cuda.cu:2537:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2538:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2598:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2598:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2633:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2633:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual]
    const block_q8_0 * bx0 = (block_q8_0 *) vx;
                                            ^
ggml-cuda.cu:2643:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2644:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2688:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2688:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2722:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q2_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual]
    const block_q2_K * bx0 = (block_q2_K *) vx;
                                            ^
ggml-cuda.cu:2734:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2791:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual]
    const block_q3_K * bx0 = (block_q3_K *) vx;
                                            ^
ggml-cuda.cu:2947:41: warning: cast from 'const int *' to 'signed char *' drops const qualifier [-Wcast-qual]
    const int8_t * scales = ((int8_t *) (x_sc + i * (WARP_SIZE/4) + i/4 + kbx*4)) + ky/4;
                                        ^
ggml-cuda.cu:3061:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual]
    const block_q4_K * bx0 = (block_q4_K *) vx;
                                            ^
ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3073:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:3142:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:3242:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual]
    const block_q5_K * bx0 = (block_q5_K *) vx;
                                            ^
ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3254:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:3334:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:3371:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q6_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual]
    const block_q6_K * bx0 = (block_q6_K *) vx;
                                            ^
ggml-cuda.cu:3383:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:3454:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:6947:26: warning: unused variable 'size_src0_ddq' [-Wunused-variable]
            const size_t size_src0_ddq = split ? (row_high[id]-row_low[id])*ne00 * src0_ts/src0_bs : ggml_nbytes(src0);
                         ^
ggml-cuda.cu:6846:19: warning: unused variable 'nrows0' [-Wunused-variable]
    const int64_t nrows0 = ggml_nrows(src0);
                  ^
ggml-cuda.cu:8135:103: warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
                                                                                                      ^
ggml-cuda.cu:8142:106: warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
                                                                                                         ^
ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual]
    const block_q4_0 * bx0 = (block_q4_0 *) vx;
                                            ^
ggml-cuda.cu:3618:9: note: in instantiation of function template specialization 'load_tiles_q4_0<128, 8, false>' requested here
        load_tiles_q4_0<mmq_y, nwarps, need_check>, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5158:9: note: in instantiation of function template specialization 'mul_mat_q4_0<false>' requested here
        mul_mat_q4_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3617:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 128, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK4_0, QR4_0, QI4_0, true, block_q4_0, mmq_x, mmq_y, nwarps, allocate_tiles_q4_0<mmq_y>,
    ^
ggml-cuda.cu:5158:9: note: in instantiation of function template specialization 'mul_mat_q4_0<false>' requested here
        mul_mat_q4_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual]
    const block_q4_0 * bx0 = (block_q4_0 *) vx;
                                            ^
ggml-cuda.cu:3618:9: note: in instantiation of function template specialization 'load_tiles_q4_0<128, 8, true>' requested here
        load_tiles_q4_0<mmq_y, nwarps, need_check>, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5162:9: note: in instantiation of function template specialization 'mul_mat_q4_0<true>' requested here
        mul_mat_q4_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3617:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 128, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK4_0, QR4_0, QI4_0, true, block_q4_0, mmq_x, mmq_y, nwarps, allocate_tiles_q4_0<mmq_y>,
    ^
ggml-cuda.cu:5162:9: note: in instantiation of function template specialization 'mul_mat_q4_0<true>' requested here
        mul_mat_q4_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual]
    const block_q4_1 * bx0 = (block_q4_1 *) vx;
                                            ^
ggml-cuda.cu:3687:9: note: in instantiation of function template specialization 'load_tiles_q4_1<128, 8, false>' requested here
        load_tiles_q4_1<mmq_y, nwarps, need_check>, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat>
        ^
ggml-cuda.cu:5203:9: note: in instantiation of function template specialization 'mul_mat_q4_1<false>' requested here
        mul_mat_q4_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3686:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 128, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here
    mul_mat_q<QK4_1, QR4_1, QI4_1, true, block_q4_1, mmq_x, mmq_y, nwarps, allocate_tiles_q4_1<mmq_y>,
    ^
ggml-cuda.cu:5203:9: note: in instantiation of function template specialization 'mul_mat_q4_1<false>' requested here
        mul_mat_q4_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual]
    const block_q4_1 * bx0 = (block_q4_1 *) vx;
                                            ^
ggml-cuda.cu:3687:9: note: in instantiation of function template specialization 'load_tiles_q4_1<128, 8, true>' requested here
        load_tiles_q4_1<mmq_y, nwarps, need_check>, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat>
        ^
ggml-cuda.cu:5207:9: note: in instantiation of function template specialization 'mul_mat_q4_1<true>' requested here
        mul_mat_q4_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3686:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 128, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here
    mul_mat_q<QK4_1, QR4_1, QI4_1, true, block_q4_1, mmq_x, mmq_y, nwarps, allocate_tiles_q4_1<mmq_y>,
    ^
ggml-cuda.cu:5207:9: note: in instantiation of function template specialization 'mul_mat_q4_1<true>' requested here
        mul_mat_q4_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual]
    const block_q5_0 * bx0 = (block_q5_0 *) vx;
                                            ^
ggml-cuda.cu:3754:9: note: in instantiation of function template specialization 'load_tiles_q5_0<128, 8, false>' requested here
        load_tiles_q5_0<mmq_y, nwarps, need_check>, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5248:9: note: in instantiation of function template specialization 'mul_mat_q5_0<false>' requested here
        mul_mat_q5_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3753:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 128, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK5_0, QR5_0, QI5_0, false, block_q5_0, mmq_x, mmq_y, nwarps, allocate_tiles_q5_0<mmq_y>,
    ^
ggml-cuda.cu:5248:9: note: in instantiation of function template specialization 'mul_mat_q5_0<false>' requested here
        mul_mat_q5_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual]
    const block_q5_0 * bx0 = (block_q5_0 *) vx;
                                            ^
ggml-cuda.cu:3754:9: note: in instantiation of function template specialization 'load_tiles_q5_0<128, 8, true>' requested here
        load_tiles_q5_0<mmq_y, nwarps, need_check>, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5252:9: note: in instantiation of function template specialization 'mul_mat_q5_0<true>' requested here
        mul_mat_q5_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3753:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 128, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK5_0, QR5_0, QI5_0, false, block_q5_0, mmq_x, mmq_y, nwarps, allocate_tiles_q5_0<mmq_y>,
    ^
ggml-cuda.cu:5252:9: note: in instantiation of function template specialization 'mul_mat_q5_0<true>' requested here
        mul_mat_q5_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual]
    const block_q5_1 * bx0 = (block_q5_1 *) vx;
                                            ^
ggml-cuda.cu:3821:9: note: in instantiation of function template specialization 'load_tiles_q5_1<128, 8, false>' requested here
        load_tiles_q5_1<mmq_y, nwarps, need_check>, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat>
        ^
ggml-cuda.cu:5293:9: note: in instantiation of function template specialization 'mul_mat_q5_1<false>' requested here
        mul_mat_q5_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3820:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 128, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here
    mul_mat_q<QK5_1, QR5_1, QI5_1, true, block_q5_1, mmq_x, mmq_y, nwarps, allocate_tiles_q5_1<mmq_y>,
    ^
ggml-cuda.cu:5293:9: note: in instantiation of function template specialization 'mul_mat_q5_1<false>' requested here
        mul_mat_q5_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual]
    const block_q5_1 * bx0 = (block_q5_1 *) vx;
                                            ^
ggml-cuda.cu:3821:9: note: in instantiation of function template specialization 'load_tiles_q5_1<128, 8, true>' requested here
        load_tiles_q5_1<mmq_y, nwarps, need_check>, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat>
        ^
ggml-cuda.cu:5297:9: note: in instantiation of function template specialization 'mul_mat_q5_1<true>' requested here
        mul_mat_q5_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3820:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 128, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here
    mul_mat_q<QK5_1, QR5_1, QI5_1, true, block_q5_1, mmq_x, mmq_y, nwarps, allocate_tiles_q5_1<mmq_y>,
    ^
ggml-cuda.cu:5297:9: note: in instantiation of function template specialization 'mul_mat_q5_1<true>' requested here
        mul_mat_q5_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual]
    const block_q8_0 * bx0 = (block_q8_0 *) vx;
                                            ^
ggml-cuda.cu:3888:9: note: in instantiation of function template specialization 'load_tiles_q8_0<128, 8, false>' requested here
        load_tiles_q8_0<mmq_y, nwarps, need_check>, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5338:9: note: in instantiation of function template specialization 'mul_mat_q8_0<false>' requested here
        mul_mat_q8_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3887:5: note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 128, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK8_0, QR8_0, QI8_0, false, block_q8_0, mmq_x, mmq_y, nwarps, allocate_tiles_q8_0<mmq_y>,
    ^
ggml-cuda.cu:5338:9: note: in instantiation of function template specialization 'mul_mat_q8_0<false>' requested here
        mul_mat_q8_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual]
    const block_q8_0 * bx0 = (block_q8_0 *) vx;
                                            ^
ggml-cuda.cu:3888:9: note: in instantiation of function template specialization 'load_tiles_q8_0<128, 8, true>' requested here
        load_tiles_q8_0<mmq_y, nwarps, need_check>, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5342:9: note: in instantiation of function template specialization 'mul_mat_q8_0<true>' requested here
        mul_mat_q8_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3887:5: note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 128, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK8_0, QR8_0, QI8_0, false, block_q8_0, mmq_x, mmq_y, nwarps, allocate_tiles_q8_0<mmq_y>,
    ^
ggml-cuda.cu:5342:9: note: in instantiation of function template specialization 'mul_mat_q8_0<true>' requested here
        mul_mat_q8_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual]
    const block_q2_K * bx0 = (block_q2_K *) vx;
                                            ^
ggml-cuda.cu:3955:9: note: in instantiation of function template specialization 'load_tiles_q2_K<128, 8, false>' requested here
        load_tiles_q2_K<mmq_y, nwarps, need_check>, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5383:9: note: in instantiation of function template specialization 'mul_mat_q2_K<false>' requested here
        mul_mat_q2_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3954:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 64, 128, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR2_K, QI2_K, false, block_q2_K, mmq_x, mmq_y, nwarps, allocate_tiles_q2_K<mmq_y>,
    ^
ggml-cuda.cu:5383:9: note: in instantiation of function template specialization 'mul_mat_q2_K<false>' requested here
        mul_mat_q2_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual]
    const block_q2_K * bx0 = (block_q2_K *) vx;
                                            ^
ggml-cuda.cu:3955:9: note: in instantiation of function template specialization 'load_tiles_q2_K<128, 8, true>' requested here
        load_tiles_q2_K<mmq_y, nwarps, need_check>, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5387:9: note: in instantiation of function template specialization 'mul_mat_q2_K<true>' requested here
        mul_mat_q2_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3954:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 64, 128, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR2_K, QI2_K, false, block_q2_K, mmq_x, mmq_y, nwarps, allocate_tiles_q2_K<mmq_y>,
    ^
ggml-cuda.cu:5387:9: note: in instantiation of function template specialization 'mul_mat_q2_K<true>' requested here
        mul_mat_q2_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual]
    const block_q3_K * bx0 = (block_q3_K *) vx;
                                            ^
ggml-cuda.cu:4024:9: note: in instantiation of function template specialization 'load_tiles_q3_K<64, 8, false>' requested here
        load_tiles_q3_K<mmq_y, nwarps, need_check>, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5430:9: note: in instantiation of function template specialization 'mul_mat_q3_K<false>' requested here
        mul_mat_q3_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4023:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 128, 64, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR3_K, QI3_K, false, block_q3_K, mmq_x, mmq_y, nwarps, allocate_tiles_q3_K<mmq_y>,
    ^
ggml-cuda.cu:5430:9: note: in instantiation of function template specialization 'mul_mat_q3_K<false>' requested here
        mul_mat_q3_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual]
    const block_q3_K * bx0 = (block_q3_K *) vx;
                                            ^
ggml-cuda.cu:4024:9: note: in instantiation of function template specialization 'load_tiles_q3_K<64, 8, true>' requested here
        load_tiles_q3_K<mmq_y, nwarps, need_check>, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5434:9: note: in instantiation of function template specialization 'mul_mat_q3_K<true>' requested here
        mul_mat_q3_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4023:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 128, 64, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR3_K, QI3_K, false, block_q3_K, mmq_x, mmq_y, nwarps, allocate_tiles_q3_K<mmq_y>,
    ^
ggml-cuda.cu:5434:9: note: in instantiation of function template specialization 'mul_mat_q3_K<true>' requested here
        mul_mat_q3_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual]
    const block_q4_K * bx0 = (block_q4_K *) vx;
                                            ^
ggml-cuda.cu:4093:9: note: in instantiation of function template specialization 'load_tiles_q4_K<128, 8, false>' requested here
        load_tiles_q4_K<mmq_y, nwarps, need_check>, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5476:9: note: in instantiation of function template specialization 'mul_mat_q4_K<false>' requested here
        mul_mat_q4_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4092:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 64, 128, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR4_K, QI4_K, true, block_q4_K, mmq_x, mmq_y, nwarps, allocate_tiles_q4_K<mmq_y>,
    ^
ggml-cuda.cu:5476:9: note: in instantiation of function template specialization 'mul_mat_q4_K<false>' requested here
        mul_mat_q4_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual]
    const block_q4_K * bx0 = (block_q4_K *) vx;
                                            ^
ggml-cuda.cu:4093:9: note: in instantiation of function template specialization 'load_tiles_q4_K<128, 8, true>' requested here
        load_tiles_q4_K<mmq_y, nwarps, need_check>, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5480:9: note: in instantiation of function template specialization 'mul_mat_q4_K<true>' requested here
        mul_mat_q4_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4092:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 64, 128, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR4_K, QI4_K, true, block_q4_K, mmq_x, mmq_y, nwarps, allocate_tiles_q4_K<mmq_y>,
    ^
ggml-cuda.cu:5480:9: note: in instantiation of function template specialization 'mul_mat_q4_K<true>' requested here
        mul_mat_q4_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual]
    const block_q5_K * bx0 = (block_q5_K *) vx;
                                            ^
ggml-cuda.cu:4160:9: note: in instantiation of function template specialization 'load_tiles_q5_K<128, 8, false>' requested here
        load_tiles_q5_K<mmq_y, nwarps, need_check>, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5521:9: note: in instantiation of function template specialization 'mul_mat_q5_K<false>' requested here
        mul_mat_q5_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4159:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 64, 128, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR5_K, QI5_K, true, block_q5_K, mmq_x, mmq_y, nwarps, allocate_tiles_q5_K<mmq_y>,
    ^
ggml-cuda.cu:5521:9: note: in instantiation of function template specialization 'mul_mat_q5_K<false>' requested here
        mul_mat_q5_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual]
    const block_q5_K * bx0 = (block_q5_K *) vx;
                                            ^
ggml-cuda.cu:4160:9: note: in instantiation of function template specialization 'load_tiles_q5_K<128, 8, true>' requested here
        load_tiles_q5_K<mmq_y, nwarps, need_check>, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5525:9: note: in instantiation of function template specialization 'mul_mat_q5_K<true>' requested here
        mul_mat_q5_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4159:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 64, 128, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR5_K, QI5_K, true, block_q5_K, mmq_x, mmq_y, nwarps, allocate_tiles_q5_K<mmq_y>,
    ^
ggml-cuda.cu:5525:9: note: in instantiation of function template specialization 'mul_mat_q5_K<true>' requested here
        mul_mat_q5_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual]
    const block_q6_K * bx0 = (block_q6_K *) vx;
                                            ^
ggml-cuda.cu:4229:9: note: in instantiation of function template specialization 'load_tiles_q6_K<128, 8, false>' requested here
        load_tiles_q6_K<mmq_y, nwarps, need_check>, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5566:9: note: in instantiation of function template specialization 'mul_mat_q6_K<false>' requested here
        mul_mat_q6_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4228:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 64, 128, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR6_K, QI6_K, false, block_q6_K, mmq_x, mmq_y, nwarps, allocate_tiles_q6_K<mmq_y>,
    ^
ggml-cuda.cu:5566:9: note: in instantiation of function template specialization 'mul_mat_q6_K<false>' requested here
        mul_mat_q6_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual]
    const block_q6_K * bx0 = (block_q6_K *) vx;
                                            ^
ggml-cuda.cu:4229:9: note: in instantiation of function template specialization 'load_tiles_q6_K<128, 8, true>' requested here
        load_tiles_q6_K<mmq_y, nwarps, need_check>, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5570:9: note: in instantiation of function template specialization 'mul_mat_q6_K<true>' requested here
        mul_mat_q6_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4228:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 64, 128, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR6_K, QI6_K, false, block_q6_K, mmq_x, mmq_y, nwarps, allocate_tiles_q6_K<mmq_y>,
    ^
ggml-cuda.cu:5570:9: note: in instantiation of function template specialization 'mul_mat_q6_K<true>' requested here
        mul_mat_q6_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
110 warnings generated when compiling for gfx1100.
warning: unknown warning option '-Wno-format-truncation' [-Wunknown-warning-option]
ggml-cuda.cu:235:41: warning: cast from 'const signed char *' to 'unsigned short *' drops const qualifier [-Wcast-qual]
    const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment
                                        ^
ggml-cuda.cu:245:41: warning: cast from 'const unsigned char *' to 'unsigned short *' drops const qualifier [-Wcast-qual]
    const uint16_t * x16 = (uint16_t *) (x8 + sizeof(int) * i32); // assume at least 2 byte alignment
                                        ^
ggml-cuda.cu:255:22: warning: cast from 'const signed char *' to 'int *' drops const qualifier [-Wcast-qual]
    return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment
                     ^
ggml-cuda.cu:259:22: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
    return *((int *) (x8 + sizeof(int) * i32)); // assume at least 4 byte alignment
                     ^
ggml-cuda.cu:467:75: warning: suggest braces around initialization of subobject [-Wmissing-braces]
static cudaStream_t g_cudaStreams[GGML_CUDA_MAX_DEVICES][MAX_STREAMS] = { nullptr };
                                                                          ^~~~~~~
                                                                          {      }
ggml-cuda.cu:2227:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2227:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual]
    const block_q4_0 * bx0 = (block_q4_0 *) vx;
                                            ^
ggml-cuda.cu:2237:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2238:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2288:37: warning: cast from 'const __half2 *' to 'float *' drops const qualifier [-Wcast-qual]
    const float * x_dmf = (float *) x_dm;
                                    ^
ggml-cuda.cu:2284:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2284:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2321:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2321:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual]
    const block_q4_1 * bx0 = (block_q4_1 *) vx;
                                            ^
ggml-cuda.cu:2331:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2332:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2375:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2375:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2413:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2413:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual]
    const block_q5_0 * bx0 = (block_q5_0 *) vx;
                                            ^
ggml-cuda.cu:2423:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2424:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2487:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2487:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2527:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2527:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_1(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual]
    const block_q5_1 * bx0 = (block_q5_1 *) vx;
                                            ^
ggml-cuda.cu:2537:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2538:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2598:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2598:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2633:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2633:129: warning: unused parameter 'x_sc' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q8_0(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                                ^
ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual]
    const block_q8_0 * bx0 = (block_q8_0 *) vx;
                                            ^
ggml-cuda.cu:2643:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2644:24: warning: unused parameter 'x_sc' [-Wunused-parameter]
    int * __restrict__ x_sc, const int & i_offset, const int & i_max, const int & k, const int & blocks_per_row) {
                       ^
ggml-cuda.cu:2688:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2688:125: warning: unused parameter 'x_sc' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                                                            ^
ggml-cuda.cu:2722:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q2_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual]
    const block_q2_K * bx0 = (block_q2_K *) vx;
                                            ^
ggml-cuda.cu:2734:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:2791:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual]
    const block_q3_K * bx0 = (block_q3_K *) vx;
                                            ^
ggml-cuda.cu:2947:41: warning: cast from 'const int *' to 'signed char *' drops const qualifier [-Wcast-qual]
    const int8_t * scales = ((int8_t *) (x_sc + i * (WARP_SIZE/4) + i/4 + kbx*4)) + ky/4;
                                        ^
ggml-cuda.cu:3061:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q4_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual]
    const block_q4_K * bx0 = (block_q4_K *) vx;
                                            ^
ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3073:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:3142:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:3242:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q5_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual]
    const block_q5_K * bx0 = (block_q5_K *) vx;
                                            ^
ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3254:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:3334:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:3371:116: warning: unused parameter 'x_qh' [-Wunused-parameter]
template <int mmq_y> static __device__ __forceinline__ void allocate_tiles_q6_K(int ** x_ql, half2 ** x_dm, int ** x_qh, int ** x_sc) {
                                                                                                                   ^
ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual]
    const block_q6_K * bx0 = (block_q6_K *) vx;
                                            ^
ggml-cuda.cu:3383:106: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const void * __restrict__ vx, int * __restrict__ x_ql, half2 * __restrict__ x_dm, int * __restrict__ x_qh,
                                                                                                         ^
ggml-cuda.cu:3454:94: warning: unused parameter 'x_qh' [-Wunused-parameter]
    const int * __restrict__ x_ql, const half2 * __restrict__ x_dm, const int * __restrict__ x_qh, const int * __restrict__ x_sc,
                                                                                             ^
ggml-cuda.cu:6947:26: warning: unused variable 'size_src0_ddq' [-Wunused-variable]
            const size_t size_src0_ddq = split ? (row_high[id]-row_low[id])*ne00 * src0_ts/src0_bs : ggml_nbytes(src0);
                         ^
ggml-cuda.cu:6846:19: warning: unused variable 'nrows0' [-Wunused-variable]
    const int64_t nrows0 = ggml_nrows(src0);
                  ^
ggml-cuda.cu:8135:103: warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
                                                                                                      ^
ggml-cuda.cu:8142:106: warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
                                                                                                         ^
ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual]
    const block_q4_0 * bx0 = (block_q4_0 *) vx;
                                            ^
ggml-cuda.cu:3618:9: note: in instantiation of function template specialization 'load_tiles_q4_0<64, 8, false>' requested here
        load_tiles_q4_0<mmq_y, nwarps, need_check>, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5158:9: note: in instantiation of function template specialization 'mul_mat_q4_0<false>' requested here
        mul_mat_q4_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3617:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 64, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK4_0, QR4_0, QI4_0, true, block_q4_0, mmq_x, mmq_y, nwarps, allocate_tiles_q4_0<mmq_y>,
    ^
ggml-cuda.cu:5158:9: note: in instantiation of function template specialization 'mul_mat_q4_0<false>' requested here
        mul_mat_q4_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2248:45: warning: cast from 'const void *' to 'block_q4_0 *' drops const qualifier [-Wcast-qual]
    const block_q4_0 * bx0 = (block_q4_0 *) vx;
                                            ^
ggml-cuda.cu:3618:9: note: in instantiation of function template specialization 'load_tiles_q4_0<64, 8, true>' requested here
        load_tiles_q4_0<mmq_y, nwarps, need_check>, VDR_Q4_0_Q8_1_MMQ, vec_dot_q4_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5162:9: note: in instantiation of function template specialization 'mul_mat_q4_0<true>' requested here
        mul_mat_q4_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3617:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_0, 64, 64, 8, &allocate_tiles_q4_0, &load_tiles_q4_0, 4, &vec_dot_q4_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK4_0, QR4_0, QI4_0, true, block_q4_0, mmq_x, mmq_y, nwarps, allocate_tiles_q4_0<mmq_y>,
    ^
ggml-cuda.cu:5162:9: note: in instantiation of function template specialization 'mul_mat_q4_0<true>' requested here
        mul_mat_q4_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual]
    const block_q4_1 * bx0 = (block_q4_1 *) vx;
                                            ^
ggml-cuda.cu:3687:9: note: in instantiation of function template specialization 'load_tiles_q4_1<64, 8, false>' requested here
        load_tiles_q4_1<mmq_y, nwarps, need_check>, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat>
        ^
ggml-cuda.cu:5203:9: note: in instantiation of function template specialization 'mul_mat_q4_1<false>' requested here
        mul_mat_q4_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3686:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 64, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here
    mul_mat_q<QK4_1, QR4_1, QI4_1, true, block_q4_1, mmq_x, mmq_y, nwarps, allocate_tiles_q4_1<mmq_y>,
    ^
ggml-cuda.cu:5203:9: note: in instantiation of function template specialization 'mul_mat_q4_1<false>' requested here
        mul_mat_q4_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2342:45: warning: cast from 'const void *' to 'block_q4_1 *' drops const qualifier [-Wcast-qual]
    const block_q4_1 * bx0 = (block_q4_1 *) vx;
                                            ^
ggml-cuda.cu:3687:9: note: in instantiation of function template specialization 'load_tiles_q4_1<64, 8, true>' requested here
        load_tiles_q4_1<mmq_y, nwarps, need_check>, VDR_Q4_1_Q8_1_MMQ, vec_dot_q4_1_q8_1_mul_mat>
        ^
ggml-cuda.cu:5207:9: note: in instantiation of function template specialization 'mul_mat_q4_1<true>' requested here
        mul_mat_q4_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3686:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q4_1, 64, 64, 8, &allocate_tiles_q4_1, &load_tiles_q4_1, 4, &vec_dot_q4_1_q8_1_mul_mat>' requested here
    mul_mat_q<QK4_1, QR4_1, QI4_1, true, block_q4_1, mmq_x, mmq_y, nwarps, allocate_tiles_q4_1<mmq_y>,
    ^
ggml-cuda.cu:5207:9: note: in instantiation of function template specialization 'mul_mat_q4_1<true>' requested here
        mul_mat_q4_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual]
    const block_q5_0 * bx0 = (block_q5_0 *) vx;
                                            ^
ggml-cuda.cu:3754:9: note: in instantiation of function template specialization 'load_tiles_q5_0<64, 8, false>' requested here
        load_tiles_q5_0<mmq_y, nwarps, need_check>, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5248:9: note: in instantiation of function template specialization 'mul_mat_q5_0<false>' requested here
        mul_mat_q5_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3753:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 64, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK5_0, QR5_0, QI5_0, false, block_q5_0, mmq_x, mmq_y, nwarps, allocate_tiles_q5_0<mmq_y>,
    ^
ggml-cuda.cu:5248:9: note: in instantiation of function template specialization 'mul_mat_q5_0<false>' requested here
        mul_mat_q5_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2434:45: warning: cast from 'const void *' to 'block_q5_0 *' drops const qualifier [-Wcast-qual]
    const block_q5_0 * bx0 = (block_q5_0 *) vx;
                                            ^
ggml-cuda.cu:3754:9: note: in instantiation of function template specialization 'load_tiles_q5_0<64, 8, true>' requested here
        load_tiles_q5_0<mmq_y, nwarps, need_check>, VDR_Q5_0_Q8_1_MMQ, vec_dot_q5_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5252:9: note: in instantiation of function template specialization 'mul_mat_q5_0<true>' requested here
        mul_mat_q5_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3753:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, false, block_q5_0, 64, 64, 8, &allocate_tiles_q5_0, &load_tiles_q5_0, 4, &vec_dot_q5_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK5_0, QR5_0, QI5_0, false, block_q5_0, mmq_x, mmq_y, nwarps, allocate_tiles_q5_0<mmq_y>,
    ^
ggml-cuda.cu:5252:9: note: in instantiation of function template specialization 'mul_mat_q5_0<true>' requested here
        mul_mat_q5_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual]
    const block_q5_1 * bx0 = (block_q5_1 *) vx;
                                            ^
ggml-cuda.cu:3821:9: note: in instantiation of function template specialization 'load_tiles_q5_1<64, 8, false>' requested here
        load_tiles_q5_1<mmq_y, nwarps, need_check>, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat>
        ^
ggml-cuda.cu:5293:9: note: in instantiation of function template specialization 'mul_mat_q5_1<false>' requested here
        mul_mat_q5_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3820:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 64, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here
    mul_mat_q<QK5_1, QR5_1, QI5_1, true, block_q5_1, mmq_x, mmq_y, nwarps, allocate_tiles_q5_1<mmq_y>,
    ^
ggml-cuda.cu:5293:9: note: in instantiation of function template specialization 'mul_mat_q5_1<false>' requested here
        mul_mat_q5_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2548:45: warning: cast from 'const void *' to 'block_q5_1 *' drops const qualifier [-Wcast-qual]
    const block_q5_1 * bx0 = (block_q5_1 *) vx;
                                            ^
ggml-cuda.cu:3821:9: note: in instantiation of function template specialization 'load_tiles_q5_1<64, 8, true>' requested here
        load_tiles_q5_1<mmq_y, nwarps, need_check>, VDR_Q5_1_Q8_1_MMQ, vec_dot_q5_1_q8_1_mul_mat>
        ^
ggml-cuda.cu:5297:9: note: in instantiation of function template specialization 'mul_mat_q5_1<true>' requested here
        mul_mat_q5_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3820:5: note: in instantiation of function template specialization 'mul_mat_q<32, 2, 4, true, block_q5_1, 64, 64, 8, &allocate_tiles_q5_1, &load_tiles_q5_1, 4, &vec_dot_q5_1_q8_1_mul_mat>' requested here
    mul_mat_q<QK5_1, QR5_1, QI5_1, true, block_q5_1, mmq_x, mmq_y, nwarps, allocate_tiles_q5_1<mmq_y>,
    ^
ggml-cuda.cu:5297:9: note: in instantiation of function template specialization 'mul_mat_q5_1<true>' requested here
        mul_mat_q5_1<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual]
    const block_q8_0 * bx0 = (block_q8_0 *) vx;
                                            ^
ggml-cuda.cu:3888:9: note: in instantiation of function template specialization 'load_tiles_q8_0<64, 8, false>' requested here
        load_tiles_q8_0<mmq_y, nwarps, need_check>, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5338:9: note: in instantiation of function template specialization 'mul_mat_q8_0<false>' requested here
        mul_mat_q8_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3887:5: note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 64, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK8_0, QR8_0, QI8_0, false, block_q8_0, mmq_x, mmq_y, nwarps, allocate_tiles_q8_0<mmq_y>,
    ^
ggml-cuda.cu:5338:9: note: in instantiation of function template specialization 'mul_mat_q8_0<false>' requested here
        mul_mat_q8_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2655:45: warning: cast from 'const void *' to 'block_q8_0 *' drops const qualifier [-Wcast-qual]
    const block_q8_0 * bx0 = (block_q8_0 *) vx;
                                            ^
ggml-cuda.cu:3888:9: note: in instantiation of function template specialization 'load_tiles_q8_0<64, 8, true>' requested here
        load_tiles_q8_0<mmq_y, nwarps, need_check>, VDR_Q8_0_Q8_1_MMQ, vec_dot_q8_0_q8_1_mul_mat>
        ^
ggml-cuda.cu:5342:9: note: in instantiation of function template specialization 'mul_mat_q8_0<true>' requested here
        mul_mat_q8_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3887:5: note: in instantiation of function template specialization 'mul_mat_q<32, 1, 8, false, block_q8_0, 64, 64, 8, &allocate_tiles_q8_0, &load_tiles_q8_0, 8, &vec_dot_q8_0_q8_1_mul_mat>' requested here
    mul_mat_q<QK8_0, QR8_0, QI8_0, false, block_q8_0, mmq_x, mmq_y, nwarps, allocate_tiles_q8_0<mmq_y>,
    ^
ggml-cuda.cu:5342:9: note: in instantiation of function template specialization 'mul_mat_q8_0<true>' requested here
        mul_mat_q8_0<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual]
    const block_q2_K * bx0 = (block_q2_K *) vx;
                                            ^
ggml-cuda.cu:3955:9: note: in instantiation of function template specialization 'load_tiles_q2_K<32, 8, false>' requested here
        load_tiles_q2_K<mmq_y, nwarps, need_check>, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5383:9: note: in instantiation of function template specialization 'mul_mat_q2_K<false>' requested here
        mul_mat_q2_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3954:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 128, 32, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR2_K, QI2_K, false, block_q2_K, mmq_x, mmq_y, nwarps, allocate_tiles_q2_K<mmq_y>,
    ^
ggml-cuda.cu:5383:9: note: in instantiation of function template specialization 'mul_mat_q2_K<false>' requested here
        mul_mat_q2_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2745:45: warning: cast from 'const void *' to 'block_q2_K *' drops const qualifier [-Wcast-qual]
    const block_q2_K * bx0 = (block_q2_K *) vx;
                                            ^
ggml-cuda.cu:3955:9: note: in instantiation of function template specialization 'load_tiles_q2_K<32, 8, true>' requested here
        load_tiles_q2_K<mmq_y, nwarps, need_check>, VDR_Q2_K_Q8_1_MMQ, vec_dot_q2_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5387:9: note: in instantiation of function template specialization 'mul_mat_q2_K<true>' requested here
        mul_mat_q2_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:3954:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q2_K, 128, 32, 8, &allocate_tiles_q2_K, &load_tiles_q2_K, 2, &vec_dot_q2_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR2_K, QI2_K, false, block_q2_K, mmq_x, mmq_y, nwarps, allocate_tiles_q2_K<mmq_y>,
    ^
ggml-cuda.cu:5387:9: note: in instantiation of function template specialization 'mul_mat_q2_K<true>' requested here
        mul_mat_q2_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual]
    const block_q3_K * bx0 = (block_q3_K *) vx;
                                            ^
ggml-cuda.cu:4024:9: note: in instantiation of function template specialization 'load_tiles_q3_K<128, 8, false>' requested here
        load_tiles_q3_K<mmq_y, nwarps, need_check>, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5430:9: note: in instantiation of function template specialization 'mul_mat_q3_K<false>' requested here
        mul_mat_q3_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4023:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 32, 128, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR3_K, QI3_K, false, block_q3_K, mmq_x, mmq_y, nwarps, allocate_tiles_q3_K<mmq_y>,
    ^
ggml-cuda.cu:5430:9: note: in instantiation of function template specialization 'mul_mat_q3_K<false>' requested here
        mul_mat_q3_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:2866:45: warning: cast from 'const void *' to 'block_q3_K *' drops const qualifier [-Wcast-qual]
    const block_q3_K * bx0 = (block_q3_K *) vx;
                                            ^
ggml-cuda.cu:4024:9: note: in instantiation of function template specialization 'load_tiles_q3_K<128, 8, true>' requested here
        load_tiles_q3_K<mmq_y, nwarps, need_check>, VDR_Q3_K_Q8_1_MMQ, vec_dot_q3_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5434:9: note: in instantiation of function template specialization 'mul_mat_q3_K<true>' requested here
        mul_mat_q3_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4023:5: note: in instantiation of function template specialization 'mul_mat_q<256, 4, 16, false, block_q3_K, 32, 128, 8, &allocate_tiles_q3_K, &load_tiles_q3_K, 2, &vec_dot_q3_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR3_K, QI3_K, false, block_q3_K, mmq_x, mmq_y, nwarps, allocate_tiles_q3_K<mmq_y>,
    ^
ggml-cuda.cu:5434:9: note: in instantiation of function template specialization 'mul_mat_q3_K<true>' requested here
        mul_mat_q3_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual]
    const block_q4_K * bx0 = (block_q4_K *) vx;
                                            ^
ggml-cuda.cu:4093:9: note: in instantiation of function template specialization 'load_tiles_q4_K<64, 8, false>' requested here
        load_tiles_q4_K<mmq_y, nwarps, need_check>, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5476:9: note: in instantiation of function template specialization 'mul_mat_q4_K<false>' requested here
        mul_mat_q4_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4092:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 32, 64, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR4_K, QI4_K, true, block_q4_K, mmq_x, mmq_y, nwarps, allocate_tiles_q4_K<mmq_y>,
    ^
ggml-cuda.cu:5476:9: note: in instantiation of function template specialization 'mul_mat_q4_K<false>' requested here
        mul_mat_q4_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3084:45: warning: cast from 'const void *' to 'block_q4_K *' drops const qualifier [-Wcast-qual]
    const block_q4_K * bx0 = (block_q4_K *) vx;
                                            ^
ggml-cuda.cu:4093:9: note: in instantiation of function template specialization 'load_tiles_q4_K<64, 8, true>' requested here
        load_tiles_q4_K<mmq_y, nwarps, need_check>, VDR_Q4_K_Q8_1_MMQ, vec_dot_q4_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5480:9: note: in instantiation of function template specialization 'mul_mat_q4_K<true>' requested here
        mul_mat_q4_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3129:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4092:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q4_K, 32, 64, 8, &allocate_tiles_q4_K, &load_tiles_q4_K, 8, &vec_dot_q4_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR4_K, QI4_K, true, block_q4_K, mmq_x, mmq_y, nwarps, allocate_tiles_q4_K<mmq_y>,
    ^
ggml-cuda.cu:5480:9: note: in instantiation of function template specialization 'mul_mat_q4_K<true>' requested here
        mul_mat_q4_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual]
    const block_q5_K * bx0 = (block_q5_K *) vx;
                                            ^
ggml-cuda.cu:4160:9: note: in instantiation of function template specialization 'load_tiles_q5_K<64, 8, false>' requested here
        load_tiles_q5_K<mmq_y, nwarps, need_check>, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5521:9: note: in instantiation of function template specialization 'mul_mat_q5_K<false>' requested here
        mul_mat_q5_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4159:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 32, 64, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR5_K, QI5_K, true, block_q5_K, mmq_x, mmq_y, nwarps, allocate_tiles_q5_K<mmq_y>,
    ^
ggml-cuda.cu:5521:9: note: in instantiation of function template specialization 'mul_mat_q5_K<false>' requested here
        mul_mat_q5_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3265:45: warning: cast from 'const void *' to 'block_q5_K *' drops const qualifier [-Wcast-qual]
    const block_q5_K * bx0 = (block_q5_K *) vx;
                                            ^
ggml-cuda.cu:4160:9: note: in instantiation of function template specialization 'load_tiles_q5_K<64, 8, true>' requested here
        load_tiles_q5_K<mmq_y, nwarps, need_check>, VDR_Q5_K_Q8_1_MMQ, vec_dot_q5_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5525:9: note: in instantiation of function template specialization 'mul_mat_q5_K<true>' requested here
        mul_mat_q5_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3321:38: warning: cast from 'const unsigned char *' to 'int *' drops const qualifier [-Wcast-qual]
        const int * scales = (int *) bxi->scales;
                                     ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4159:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, true, block_q5_K, 32, 64, 8, &allocate_tiles_q5_K, &load_tiles_q5_K, 8, &vec_dot_q5_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR5_K, QI5_K, true, block_q5_K, mmq_x, mmq_y, nwarps, allocate_tiles_q5_K<mmq_y>,
    ^
ggml-cuda.cu:5525:9: note: in instantiation of function template specialization 'mul_mat_q5_K<true>' requested here
        mul_mat_q5_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual]
    const block_q6_K * bx0 = (block_q6_K *) vx;
                                            ^
ggml-cuda.cu:4229:9: note: in instantiation of function template specialization 'load_tiles_q6_K<64, 8, false>' requested here
        load_tiles_q6_K<mmq_y, nwarps, need_check>, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5566:9: note: in instantiation of function template specialization 'mul_mat_q6_K<false>' requested here
        mul_mat_q6_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4228:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 32, 64, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR6_K, QI6_K, false, block_q6_K, mmq_x, mmq_y, nwarps, allocate_tiles_q6_K<mmq_y>,
    ^
ggml-cuda.cu:5566:9: note: in instantiation of function template specialization 'mul_mat_q6_K<false>' requested here
        mul_mat_q6_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3394:45: warning: cast from 'const void *' to 'block_q6_K *' drops const qualifier [-Wcast-qual]
    const block_q6_K * bx0 = (block_q6_K *) vx;
                                            ^
ggml-cuda.cu:4229:9: note: in instantiation of function template specialization 'load_tiles_q6_K<64, 8, true>' requested here
        load_tiles_q6_K<mmq_y, nwarps, need_check>, VDR_Q6_K_Q8_1_MMQ, vec_dot_q6_K_q8_1_mul_mat>
        ^
ggml-cuda.cu:5570:9: note: in instantiation of function template specialization 'mul_mat_q6_K<true>' requested here
        mul_mat_q6_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
ggml-cuda.cu:3498:49: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    float sum[mmq_y/WARP_SIZE][mmq_x/nwarps] = {0.0f};
                                                ^~~~
                                                {   }
ggml-cuda.cu:4228:5: note: in instantiation of function template specialization 'mul_mat_q<256, 2, 32, false, block_q6_K, 32, 64, 8, &allocate_tiles_q6_K, &load_tiles_q6_K, 8, &vec_dot_q6_K_q8_1_mul_mat>' requested here
    mul_mat_q<QK_K, QR6_K, QI6_K, false, block_q6_K, mmq_x, mmq_y, nwarps, allocate_tiles_q6_K<mmq_y>,
    ^
ggml-cuda.cu:5570:9: note: in instantiation of function template specialization 'mul_mat_q6_K<true>' requested here
        mul_mat_q6_K<need_check><<<block_nums, block_dims, 0, stream>>>
        ^
110 warnings generated when compiling for host.
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native    -c ggml-alloc.c -o ggml-alloc.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native    -c ggml-backend.c -o ggml-backend.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native     -c ggml-quants.c -o ggml-quants.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/main/main.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o main -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 

====  Run ./main -h for help.  ====

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/quantize/quantize.cpp build-info.o ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o quantize -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/quantize-stats/quantize-stats.cpp build-info.o ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o quantize-stats -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/perplexity/perplexity.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o perplexity -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/embedding/embedding.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o embedding -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  pocs/vdot/vdot.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o vdot -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  pocs/vdot/q8dot.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o q8dot -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/train.cpp -o train.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o train-text-from-scratch -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o convert-llama2c-to-ggml -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/simple/simple.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o simple -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/batched/batched.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o batched -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/batched-bench/batched-bench.cpp build-info.o ggml.o llama.o common.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o batched-bench -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/save-load-state/save-load-state.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o save-load-state -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -Iexamples/server examples/server/server.cpp examples/llava/clip.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o server -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas   -Wno-cast-qual
In copy constructor ‘task_result::task_result(const task_result&)’,
    inlined from ‘void std::__new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_result; _Args = {const task_result&}; _Tp = task_result]’ at /usr/include/c++/13.2.1/bits/new_allocator.h:187:4,
    inlined from ‘static void std::allocator_traits<std::allocator<_Tp1> >::construct(allocator_type&, _Up*, _Args&& ...) [with _Up = task_result; _Args = {const task_result&}; _Tp = task_result]’ at /usr/include/c++/13.2.1/bits/alloc_traits.h:537:17,
    inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_result; _Alloc = std::allocator<task_result>]’ at /usr/include/c++/13.2.1/bits/stl_vector.h:1283:30,
    inlined from ‘void llama_server_context::send_error(int, std::string)’ at examples/server/server.cpp:1097:32:
examples/server/server.cpp:154:8: warning: ‘res.task_result::stop’ may be used uninitialized [-Wmaybe-uninitialized]
  154 | struct task_result {
      |        ^~~~~~~~~~~
examples/server/server.cpp: In member function ‘void llama_server_context::send_error(int, std::string)’:
examples/server/server.cpp:1093:21: note: ‘res’ declared here
 1093 |         task_result res;
      |                     ^~~
In copy constructor ‘task_server::task_server(const task_server&)’,
    inlined from ‘void std::__new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/13.2.1/bits/new_allocator.h:187:4,
    inlined from ‘static void std::allocator_traits<std::allocator<_Tp1> >::construct(allocator_type&, _Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/13.2.1/bits/alloc_traits.h:537:17,
    inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_server; _Alloc = std::allocator<task_server>]’ at /usr/include/c++/13.2.1/bits/stl_vector.h:1283:30,
    inlined from ‘int llama_server_context::request_completion(json, bool, bool)’ at examples/server/server.cpp:1259:30,
    inlined from ‘main(int, char**)::<lambda(const httplib::Request&, httplib::Response&)>’ at examples/server/server.cpp:2333:61:
examples/server/server.cpp:145:8: warning: ‘task.task_server::target_id’ may be used uninitialized [-Wmaybe-uninitialized]
  145 | struct task_server {
      |        ^~~~~~~~~~~
examples/server/server.cpp: In lambda function:
examples/server/server.cpp:1253:21: note: ‘task’ declared here
 1253 |         task_server task;
      |                     ^~~~
In copy constructor ‘task_server::task_server(const task_server&)’,
    inlined from ‘void std::__new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/13.2.1/bits/new_allocator.h:187:4,
    inlined from ‘static void std::allocator_traits<std::allocator<_Tp1> >::construct(allocator_type&, _Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/13.2.1/bits/alloc_traits.h:537:17,
    inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_server; _Alloc = std::allocator<task_server>]’ at /usr/include/c++/13.2.1/bits/stl_vector.h:1283:30,
    inlined from ‘int llama_server_context::request_completion(json, bool, bool)’ at examples/server/server.cpp:1259:30,
    inlined from ‘main(int, char**)::<lambda(const httplib::Request&, httplib::Response&)>’ at examples/server/server.cpp:2388:61:
examples/server/server.cpp:145:8: warning: ‘task.task_server::target_id’ may be used uninitialized [-Wmaybe-uninitialized]
  145 | struct task_server {
      |        ^~~~~~~~~~~
examples/server/server.cpp: In lambda function:
examples/server/server.cpp:1253:21: note: ‘task’ declared here
 1253 |         task_server task;
      |                     ^~~~
In copy constructor ‘task_server::task_server(const task_server&)’,
    inlined from ‘void std::__new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/13.2.1/bits/new_allocator.h:187:4,
    inlined from ‘static void std::allocator_traits<std::allocator<_Tp1> >::construct(allocator_type&, _Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/13.2.1/bits/alloc_traits.h:537:17,
    inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_server; _Alloc = std::allocator<task_server>]’ at /usr/include/c++/13.2.1/bits/stl_vector.h:1283:30,
    inlined from ‘int llama_server_context::request_completion(json, bool, bool)’ at examples/server/server.cpp:1259:30,
    inlined from ‘main(int, char**)::<lambda(const httplib::Request&, httplib::Response&)>’ at examples/server/server.cpp:2492:61:
examples/server/server.cpp:145:8: warning: ‘task.task_server::target_id’ may be used uninitialized [-Wmaybe-uninitialized]
  145 | struct task_server {
      |        ^~~~~~~~~~~
examples/server/server.cpp: In lambda function:
examples/server/server.cpp:1253:21: note: ‘task’ declared here
 1253 |         task_server task;
      |                     ^~~~
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/gguf/gguf.cpp ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o gguf -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/llama-bench/llama-bench.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o llama-bench -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/llava/llava-cli.cpp examples/llava/clip.cpp examples/llava/llava.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o llava-cli -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas  -Wno-cast-qual
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/baby-llama/baby-llama.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o baby-llama -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/beam-search/beam-search.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o beam-search -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/speculative/speculative.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o speculative -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/infill/infill.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o infill -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/benchmark/benchmark-matmult.cpp build-info.o ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o benchmark-matmult -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/parallel/parallel.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o parallel -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/finetune/finetune.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o finetune -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/export-lora/export-lora.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o export-lora -L/opt/rocm/lib -Wl,-rpath=/opt/rocm/lib -lhipblas -lamdhip64 -lrocblas 
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native  -c tests/test-c.c -o tests/test-c.o

./main -ngl 99 -m ../koboldcpp/models/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/mistral-7b-instruct-v0.1.Q5_K_M.gguf -mg 0 -p "Write a function in TypeScript that sums numbers"

Log start
main: build = 1503 (4a4fd3e)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1699662201
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0
  Device 1: AMD Radeon RX 6900 XT, compute capability 10.3
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../koboldcpp/models/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/mistral-7b-instruct-v0.1.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   13:         blk.1.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   19:              blk.2.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   20:              blk.2.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   22:         blk.2.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   28:              blk.3.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   29:              blk.3.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   31:         blk.3.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   37:              blk.4.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   38:              blk.4.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   40:         blk.4.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   46:              blk.5.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   47:              blk.5.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   49:         blk.5.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   53:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   55:              blk.6.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   56:              blk.6.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   58:         blk.6.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   59:            blk.6.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   60:              blk.6.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   61:            blk.6.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   62:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   64:              blk.7.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   65:              blk.7.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   67:         blk.7.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   68:            blk.7.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   69:              blk.7.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   70:            blk.7.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   71:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   73:              blk.8.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   74:              blk.8.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   76:         blk.8.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   77:            blk.8.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   78:              blk.8.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   79:            blk.8.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   80:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   82:              blk.9.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   83:              blk.9.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   85:         blk.9.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   86:            blk.9.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   87:              blk.9.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   89:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   91:             blk.10.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   92:             blk.10.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   94:        blk.10.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   95:           blk.10.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   96:             blk.10.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   98:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  100:             blk.11.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  101:             blk.11.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  103:        blk.11.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  104:           blk.11.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  105:             blk.11.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  107:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  109:             blk.12.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  110:             blk.12.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  112:        blk.12.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  113:           blk.12.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  114:             blk.12.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  116:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  118:             blk.13.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  119:             blk.13.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  121:        blk.13.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  122:           blk.13.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  123:             blk.13.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  125:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  126:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  127:             blk.14.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  128:             blk.14.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  130:        blk.14.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  131:           blk.14.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  132:             blk.14.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  134:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  136:             blk.15.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  137:             blk.15.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  139:        blk.15.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  143:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  144:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  145:             blk.16.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  146:             blk.16.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  148:        blk.16.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  150:             blk.16.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  152:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  154:             blk.17.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  155:             blk.17.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  157:        blk.17.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  159:             blk.17.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  161:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  162:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  163:             blk.18.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  164:             blk.18.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  166:        blk.18.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  168:             blk.18.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  170:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  172:             blk.19.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  173:             blk.19.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  175:        blk.19.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  177:             blk.19.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  179:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  180:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  181:             blk.20.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  182:             blk.20.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  184:        blk.20.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  186:             blk.20.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  188:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  190:             blk.21.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  191:             blk.21.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  193:        blk.21.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  195:             blk.21.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  197:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  198:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  199:             blk.22.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  200:             blk.22.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  202:        blk.22.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  204:             blk.22.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  206:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  207:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  208:             blk.23.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  209:             blk.23.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  211:        blk.23.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  213:             blk.23.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  214:           blk.23.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  215:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  216:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  217:             blk.24.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  218:             blk.24.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  220:        blk.24.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  222:             blk.24.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  223:           blk.24.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  224:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  225:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  226:             blk.25.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  227:             blk.25.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  229:        blk.25.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  231:             blk.25.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  232:           blk.25.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  233:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  234:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  235:             blk.26.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  236:             blk.26.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  238:        blk.26.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  240:             blk.26.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  241:           blk.26.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  242:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  243:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  244:             blk.27.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  245:             blk.27.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  247:        blk.27.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  249:             blk.27.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  251:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  253:             blk.28.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  254:             blk.28.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  256:        blk.28.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  259:           blk.28.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  260:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  261:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  262:             blk.29.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  263:             blk.29.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  265:        blk.29.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  268:           blk.29.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  269:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  270:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  271:             blk.30.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  277:           blk.30.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  280:             blk.31.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  281:             blk.31.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  283:        blk.31.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  285:             blk.31.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  289:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  290:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                       llama.rope.freq_base f32     
llama_model_loader: - kv  11:                          general.file_type u32     
llama_model_loader: - kv  12:                       tokenizer.ggml.model str     
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  19:               general.quantization_version u32     
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.78 GiB (5.67 BPW) 
llm_load_print_meta: general.name   = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using ROCm for GPU acceleration
ggml_cuda_set_main_device: using device 0 (AMD Radeon RX 7900 XTX) as main device
llm_load_tensors: mem required  =   86.04 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4807.05 MB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 4944.06 MB (model: 4807.05 MB, context: 137.00 MB)
fish: Job 1, './main -ngl 99 -m ../koboldcpp/…' terminated by signal SIGSEGV (Address boundary error)

8XXD8 · 2023-11-11T13:13:55Z

This could be Rdna3/gfx1100 specific. I cross compiled from a debian nvidia build for gfx900 target and it worked fine

8XXD8 · 2023-11-11T18:13:15Z

Well i managed to run into the same problem.
After updating llama-cpp-python the only response i get is an endless stream of "K" characters.
I tried compiling libllama.so separately, but had the same result.

Its odd that main works with this:

make main LLAMA_HIPBLAS=on

but libllama.so produces garbage

make libllama.so LLAMA_HIPBLAS=on

bojak83318 · 2023-11-14T04:01:08Z

Anyways my lanes are not equal , this could be a problem as in rocm 5.7 multi gpu is just a preview
'''
Recommended system configuration for multi-GPU#

PCIe® slots connected to the GPU must have identical PCIe lane width or bifurcation settings, and support PCIe 3.0 Atomics.

Refer to How ROCm uses PCIe Atomics for more information.

Example:

✓ - GPU0 PCIe x16 connection + GPU1 PCIe x16 connection

✓ - GPU0 PCIe x8 connection + GPU1 PCIe x8 connection

X - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection

'''
Source : https://rocm.docs.amd.com/projects/radeon/en/docs-5.7.0/docs/limitations.html

8XXD8 · 2023-11-14T07:36:31Z

Anyways my lanes are not equal , this could be a problem as in rocm 5.7 multi gpu is just a preview ''' Recommended system configuration for multi-GPU#

PCIe® slots connected to the GPU must have identical PCIe lane width or bifurcation settings, and support PCIe 3.0 Atomics.

Refer to How ROCm uses PCIe Atomics for more information.

Example:

✓ - GPU0 PCIe x16 connection + GPU1 PCIe x16 connection

✓ - GPU0 PCIe x8 connection + GPU1 PCIe x8 connection

X - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection

''' Source : https://rocm.docs.amd.com/projects/radeon/en/docs-5.7.0/docs/limitations.html

Have you tried removing one of the GPU-s? I mean physically, not just disabling it with HIP_VISIBLE_DEVICES.
That way you can see if it is a multi gpu / pcie lane issue, or something else.

YellowRoseCx · 2023-12-05T02:33:23Z

I believe the problem lies in how the initialization process is bugged in ROCm. It has been fixed, but the fix will not be released until ROCm 6.0.0 I believe; unless you build ROCm using a self compiled ROCBlas and Tensile yourself after these commits:
rocBLAS Commit ID: bc4d8f5
Tensile Commit ID: ROCmSoftwarePlatform/Tensile@24d54d7

github-actions · 2024-04-02T01:11:57Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

bojak83318 added the bug-unconfirmed label Nov 8, 2023

xangelix mentioned this issue Nov 11, 2023

Segmentation fault after model load on ROCm multi-gpu, multi-gfx #4030

Closed

4 tasks

github-actions bot added the stale label Mar 19, 2024

github-actions bot closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck loading VRAM ROCm multi gpu #3991

Stuck loading VRAM ROCm multi gpu #3991

bojak83318 commented Nov 8, 2023

wizd commented Nov 9, 2023

8XXD8 commented Nov 9, 2023

wizd commented Nov 9, 2023

8XXD8 commented Nov 9, 2023

bojak83318 commented Nov 10, 2023

wizd commented Nov 10, 2023

bojak83318 commented Nov 10, 2023 •

edited

Loading

bojak83318 commented Nov 10, 2023

wizd commented Nov 10, 2023

bojak83318 commented Nov 10, 2023

bojak83318 commented Nov 10, 2023 •

edited

Loading

bojak83318 commented Nov 10, 2023

wizd commented Nov 10, 2023

bojak83318 commented Nov 10, 2023

bojak83318 commented Nov 10, 2023

bojak83318 commented Nov 10, 2023

wizd commented Nov 10, 2023

8XXD8 commented Nov 10, 2023

bojak83318 commented Nov 10, 2023 •

edited

Loading

bojak83318 commented Nov 10, 2023

bojak83318 commented Nov 10, 2023

ccbadd commented Nov 10, 2023

xangelix commented Nov 11, 2023

8XXD8 commented Nov 11, 2023

8XXD8 commented Nov 11, 2023

bojak83318 commented Nov 14, 2023

8XXD8 commented Nov 14, 2023

YellowRoseCx commented Dec 5, 2023

github-actions bot commented Apr 2, 2024

Stuck loading VRAM ROCm multi gpu #3991

Stuck loading VRAM ROCm multi gpu #3991

Comments

bojak83318 commented Nov 8, 2023

Context

Steps to Reproduce

Failure Logs

wizd commented Nov 9, 2023

8XXD8 commented Nov 9, 2023

wizd commented Nov 9, 2023

8XXD8 commented Nov 9, 2023

bojak83318 commented Nov 10, 2023

wizd commented Nov 10, 2023

bojak83318 commented Nov 10, 2023 • edited Loading

bojak83318 commented Nov 10, 2023

wizd commented Nov 10, 2023

bojak83318 commented Nov 10, 2023

bojak83318 commented Nov 10, 2023 • edited Loading

bojak83318 commented Nov 10, 2023

wizd commented Nov 10, 2023

bojak83318 commented Nov 10, 2023

bojak83318 commented Nov 10, 2023

bojak83318 commented Nov 10, 2023

wizd commented Nov 10, 2023

8XXD8 commented Nov 10, 2023

bojak83318 commented Nov 10, 2023 • edited Loading

bojak83318 commented Nov 10, 2023

bojak83318 commented Nov 10, 2023

ccbadd commented Nov 10, 2023

xangelix commented Nov 11, 2023

8XXD8 commented Nov 11, 2023

8XXD8 commented Nov 11, 2023

bojak83318 commented Nov 14, 2023

8XXD8 commented Nov 14, 2023

YellowRoseCx commented Dec 5, 2023

github-actions bot commented Apr 2, 2024

bojak83318 commented Nov 10, 2023 •

edited

Loading

bojak83318 commented Nov 10, 2023 •

edited

Loading

bojak83318 commented Nov 10, 2023 •

edited

Loading