Skip to content

@FIR-2010 - Enable Ollama Support with Triton MAT_MUL Integration#58

Merged
akapoor3518 merged 1 commit into
mainfrom
llama-cpp-mat-mul
Jun 26, 2026
Merged

@FIR-2010 - Enable Ollama Support with Triton MAT_MUL Integration#58
akapoor3518 merged 1 commit into
mainfrom
llama-cpp-mat-mul

Conversation

@akapoor3518

Copy link
Copy Markdown
                   |

Last login: Mon Jun 22 19:53:12 2026 from 10.0.2.2
root@tsisim:~# ollama run qwen:0.5b "Where is Amazon River?"
Amazon River is located in the state of Kansas, United States.

root@tsisim:~# ollama run Gemma3:270M "Where is Amazon river?"
Amazon River is located in South America, specifically in the Amazon basin.

root@tsisim:# ls -lrt /usr/local/
total 32
drwxr-xr-x 2 root root 4096 Dec 13 2025 games
drwxr-xr-x 2 root root 4096 Dec 13 2025 include
drwxr-xr-x 2 root root 4096 Dec 13 2025 src
drwxr-xr-x 2 root root 4096 Dec 13 2025 sbin
drwxr-xr-x 2 root root 4096 Dec 13 2025 etc
lrwxrwxrwx 1 root root 9 Dec 13 2025 man -> share/man
drwxr-xr-x 7 root root 4096 Dec 13 2025 share
lrwxrwxrwx 1 root root 45 Jun 22 19:04 ollama-arm64-release -> /tsi/tsi-sw/anoop_ollama/ollama-arm64-release
drwxr-xr-x 2 root root 4096 Jun 22 19:04 bin
drwxr-xr-x 4 root root 4096 Jun 25 18:37 lib
root@tsisim:
#

The list of available updates is more than a week old.
To check for new updates run: sudo apt update

   Tsavorite Scalable Intelligence

|||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||
||||||                          |||||
||||||                          |||||
|||||||||||||||||   |||||       |||||
|||||||||||||||||   |||||       |||||
           ||||||   |||||
||||||     ||||||   |||||      ||||||
 ||||||||  ||||||   |||||   ||||||||
   ||||||||||||||   ||||||||||||||
      |||||||||||   ||||||||||||
        |||||||||   ||||||||||
          |||||||   ||||||||
            |||||   |||||
              |||   |||
                    |

Last login: Mon Jun 22 19:53:12 2026 from 10.0.2.2
root@tsisim:# https://www.linkedin.com/in/kathleenqin/
-bash: https://www.linkedin.com/in/kathleenqin/: No such file or directory
root@tsisim:
# journalctl -u ollama -f
Jun 25 18:39:03 tsisim ollama[585]: [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
Jun 25 18:39:03 tsisim ollama[585]: [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
Jun 25 18:39:03 tsisim ollama[585]: [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
Jun 25 18:39:03 tsisim ollama[585]: [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
Jun 25 18:39:03 tsisim ollama[585]: [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
Jun 25 18:39:03 tsisim ollama[585]: time=2026-06-25T18:39:03.224Z level=INFO source=routes.go:1569 msg="Listening on [::]:11434 (version 0.0.0)"
Jun 25 18:39:03 tsisim ollama[585]: time=2026-06-25T18:39:03.346Z level=INFO source=runner.go:80 msg="discovering available GPUs..."
Jun 25 18:39:08 tsisim ollama[585]: time=2026-06-25T18:39:08.138Z level=INFO source=runner.go:551 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH=[/tsi/tsi-sw/anoop_ollama/ollama-arm64-release/bin] extra_envs=[] error="llamarunner free vram reporting not supported"
Jun 25 18:39:08 tsisim ollama[585]: time=2026-06-25T18:39:08.224Z level=INFO source=types.go:129 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="3.3 GiB" available="3.0 GiB"
Jun 25 18:39:08 tsisim ollama[585]: time=2026-06-25T18:39:08.225Z level=INFO source=routes.go:1610 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB"
Jun 25 18:50:01 tsisim ollama[585]: [GIN] 2026/06/25 - 18:50:01 | 200 | 6.466177ms | 127.0.0.1 | HEAD "/"
Jun 25 18:50:01 tsisim ollama[585]: [GIN] 2026/06/25 - 18:50:01 | 200 | 709.054833ms | 127.0.0.1 | POST "/api/show"
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /tsi/ollama-models/blobs/sha256-fad2a06e4cc705c2fa8bec5477ddb00dc0c859ac184c34dcc5586663774161ca (version GGUF V3 (latest))
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 0: general.architecture str = qwen2
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 1: general.name str = Qwen2-beta-0_5B-Chat
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 2: qwen2.block_count u32 = 24
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 4: qwen2.embedding_length u32 = 1024
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 2816
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 8: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 9: qwen2.use_parallel_residual bool = true
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 10: tokenizer.ggml.model str = gpt2
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 12: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 13: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 151643
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 15: tokenizer.ggml.padding_token_id u32 = 151643
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151643
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 17: tokenizer.chat_template str = {% for message in messages %}{% if lo...
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 18: general.quantization_version u32 = 2
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 19: general.file_type u32 = 2
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - type f32: 121 tensors
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - type q4_0: 169 tensors
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - type q6_K: 1 tensors
Jun 25 18:50:04 tsisim ollama[585]: print_info: file format = GGUF V3 (latest)
Jun 25 18:50:04 tsisim ollama[585]: print_info: file type = Q4_0
Jun 25 18:50:04 tsisim ollama[585]: print_info: file size = 371.02 MiB (5.02 BPW)
Jun 25 18:50:04 tsisim ollama[585]: load: missing or unrecognized pre-tokenizer type, using: 'default'
Jun 25 18:50:05 tsisim ollama[585]: load: printing all EOG tokens:
Jun 25 18:50:05 tsisim ollama[585]: load: - 151643 ('<|endoftext|>')
Jun 25 18:50:05 tsisim ollama[585]: load: - 151645 ('<|im_end|>')
Jun 25 18:50:05 tsisim ollama[585]: load: special tokens cache size = 293
Jun 25 18:50:05 tsisim ollama[585]: load: token to piece cache size = 0.9338 MB
Jun 25 18:50:05 tsisim ollama[585]: print_info: arch = qwen2
Jun 25 18:50:05 tsisim ollama[585]: print_info: vocab_only = 1
Jun 25 18:50:05 tsisim ollama[585]: print_info: model type = ?B
Jun 25 18:50:05 tsisim ollama[585]: print_info: model params = 619.57 M
Jun 25 18:50:05 tsisim ollama[585]: print_info: general.name = Qwen2-beta-0_5B-Chat
Jun 25 18:50:05 tsisim ollama[585]: print_info: vocab type = BPE
Jun 25 18:50:05 tsisim ollama[585]: print_info: n_vocab = 151936
Jun 25 18:50:05 tsisim ollama[585]: print_info: n_merges = 151387
Jun 25 18:50:05 tsisim ollama[585]: print_info: BOS token = 151643 '<|endoftext|>'
Jun 25 18:50:05 tsisim ollama[585]: print_info: EOS token = 151643 '<|endoftext|>'
Jun 25 18:50:05 tsisim ollama[585]: print_info: EOT token = 151645 '<|im_end|>'
Jun 25 18:50:05 tsisim ollama[585]: print_info: PAD token = 151643 '<|endoftext|>'
Jun 25 18:50:05 tsisim ollama[585]: print_info: LF token = 198 'Ċ'
Jun 25 18:50:05 tsisim ollama[585]: print_info: EOG token = 151643 '<|endoftext|>'
Jun 25 18:50:05 tsisim ollama[585]: print_info: EOG token = 151645 '<|im_end|>'
Jun 25 18:50:05 tsisim ollama[585]: print_info: max token length = 256
Jun 25 18:50:05 tsisim ollama[585]: llama_model_load: vocab only - skipping tensors
Jun 25 18:50:05 tsisim ollama[585]: time=2026-06-25T18:50:05.886Z level=INFO source=server.go:402 msg="starting runner" cmd="/tsi/tsi-sw/anoop_ollama/ollama-arm64-release/bin/ollama runner --model /tsi/ollama-models/blobs/sha256-fad2a06e4cc705c2fa8bec5477ddb00dc0c859ac184c34dcc5586663774161ca --port 41211"
Jun 25 18:50:05 tsisim ollama[585]: time=2026-06-25T18:50:05.902Z level=INFO source=server.go:507 msg="system memory" total="3.3 GiB" free="2.9 GiB" free_swap="8.0 GiB"
Jun 25 18:50:05 tsisim ollama[585]: time=2026-06-25T18:50:05.907Z level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/tsi/ollama-models/blobs/sha256-fad2a06e4cc705c2fa8bec5477ddb00dc0c859ac184c34dcc5586663774161ca library=cpu parallel=1 required="0 B" gpus=1
Jun 25 18:50:05 tsisim ollama[585]: time=2026-06-25T18:50:05.910Z level=INFO source=server.go:547 msg=offload library=cpu layers.requested=-1 layers.model=25 layers.offload=0 layers.split=[] memory.available="[3.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="993.2 MiB" memory.required.partial="0 B" memory.required.kv="384.0 MiB" memory.required.allocations="[993.2 MiB]" memory.weights.total="287.6 MiB" memory.weights.repeating="165.8 MiB" memory.weights.nonrepeating="121.7 MiB" memory.graph.full="298.8 MiB" memory.graph.partial="420.5 MiB"
Jun 25 18:50:06 tsisim ollama[585]: time=2026-06-25T18:50:06.093Z level=INFO source=runner.go:907 msg="starting llama runner"
Jun 25 18:50:08 tsisim ollama[585]: time=2026-06-25T18:50:08.883Z level=INFO source=ggml.go:104 msg=system CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
Jun 25 18:50:08 tsisim ollama[585]: time=2026-06-25T18:50:08.903Z level=INFO source=runner.go:967 msg="Server listening on 127.0.0.1:41211"
Jun 25 18:50:08 tsisim ollama[585]: time=2026-06-25T18:50:08.926Z level=INFO source=runner.go:830 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:4 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jun 25 18:50:08 tsisim ollama[585]: time=2026-06-25T18:50:08.942Z level=INFO source=server.go:1274 msg="waiting for llama runner to start responding"
Jun 25 18:50:08 tsisim ollama[585]: llama_model_load_from_file_impl: using device Tsavorite (txe) (unknown id) - 128 MiB free
Jun 25 18:50:08 tsisim ollama[585]: time=2026-06-25T18:50:08.946Z level=INFO source=server.go:1308 msg="waiting for server to become available" status="llm server loading model"
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /tsi/ollama-models/blobs/sha256-fad2a06e4cc705c2fa8bec5477ddb00dc0c859ac184c34dcc5586663774161ca (version GGUF V3 (latest))
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 0: general.architecture str = qwen2
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 1: general.name str = Qwen2-beta-0_5B-Chat
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 2: qwen2.block_count u32 = 24
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 4: qwen2.embedding_length u32 = 1024
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 2816
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 8: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 9: qwen2.use_parallel_residual bool = true
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 10: tokenizer.ggml.model str = gpt2
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 12: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 13: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 151643
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 15: tokenizer.ggml.padding_token_id u32 = 151643
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151643
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 17: tokenizer.chat_template str = {% for message in messages %}{% if lo...
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 18: general.quantization_version u32 = 2
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 19: general.file_type u32 = 2
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - type f32: 121 tensors
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - type q4_0: 169 tensors
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - type q6_K: 1 tensors
Jun 25 18:50:09 tsisim ollama[585]: print_info: file format = GGUF V3 (latest)
Jun 25 18:50:09 tsisim ollama[585]: print_info: file type = Q4_0
Jun 25 18:50:09 tsisim ollama[585]: print_info: file size = 371.02 MiB (5.02 BPW)
Jun 25 18:50:10 tsisim ollama[585]: load: missing or unrecognized pre-tokenizer type, using: 'default'
Jun 25 18:50:11 tsisim ollama[585]: load: printing all EOG tokens:
Jun 25 18:50:11 tsisim ollama[585]: load: - 151643 ('<|endoftext|>')
Jun 25 18:50:11 tsisim ollama[585]: load: - 151645 ('<|im_end|>')
Jun 25 18:50:11 tsisim ollama[585]: load: special tokens cache size = 293
Jun 25 18:50:11 tsisim ollama[585]: load: token to piece cache size = 0.9338 MB
Jun 25 18:50:11 tsisim ollama[585]: print_info: arch = qwen2
Jun 25 18:50:11 tsisim ollama[585]: print_info: vocab_only = 0
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_ctx_train = 32768
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_embd = 1024
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_layer = 24
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_head = 16
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_head_kv = 16
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_rot = 64
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_swa = 0
Jun 25 18:50:11 tsisim ollama[585]: print_info: is_swa_any = 0
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_embd_head_k = 64
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_embd_head_v = 64
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_gqa = 1
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_embd_k_gqa = 1024
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_embd_v_gqa = 1024
Jun 25 18:50:11 tsisim ollama[585]: print_info: f_norm_eps = 0.0e+00
Jun 25 18:50:11 tsisim ollama[585]: print_info: f_norm_rms_eps = 1.0e-06
Jun 25 18:50:11 tsisim ollama[585]: print_info: f_clamp_kqv = 0.0e+00
Jun 25 18:50:11 tsisim ollama[585]: print_info: f_max_alibi_bias = 0.0e+00
Jun 25 18:50:11 tsisim ollama[585]: print_info: f_logit_scale = 0.0e+00
Jun 25 18:50:11 tsisim ollama[585]: print_info: f_attn_scale = 0.0e+00
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_ff = 2816
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_expert = 0
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_expert_used = 0
Jun 25 18:50:11 tsisim ollama[585]: print_info: causal attn = 1
Jun 25 18:50:11 tsisim ollama[585]: print_info: pooling type = -1
Jun 25 18:50:11 tsisim ollama[585]: print_info: rope type = 2
Jun 25 18:50:11 tsisim ollama[585]: print_info: rope scaling = linear
Jun 25 18:50:11 tsisim ollama[585]: print_info: freq_base_train = 10000.0
Jun 25 18:50:11 tsisim ollama[585]: print_info: freq_scale_train = 1
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_ctx_orig_yarn = 32768
Jun 25 18:50:11 tsisim ollama[585]: print_info: rope_finetuned = unknown
Jun 25 18:50:11 tsisim ollama[585]: print_info: model type = 0.5B
Jun 25 18:50:11 tsisim ollama[585]: print_info: model params = 619.57 M
Jun 25 18:50:11 tsisim ollama[585]: print_info: general.name = Qwen2-beta-0_5B-Chat
Jun 25 18:50:11 tsisim ollama[585]: print_info: vocab type = BPE
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_vocab = 151936
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_merges = 151387
Jun 25 18:50:11 tsisim ollama[585]: print_info: BOS token = 151643 '<|endoftext|>'
Jun 25 18:50:11 tsisim ollama[585]: print_info: EOS token = 151643 '<|endoftext|>'
Jun 25 18:50:11 tsisim ollama[585]: print_info: EOT token = 151645 '<|im_end|>'
Jun 25 18:50:11 tsisim ollama[585]: print_info: PAD token = 151643 '<|endoftext|>'
Jun 25 18:50:11 tsisim ollama[585]: print_info: LF token = 198 'Ċ'
Jun 25 18:50:11 tsisim ollama[585]: print_info: EOG token = 151643 '<|endoftext|>'
Jun 25 18:50:11 tsisim ollama[585]: print_info: EOG token = 151645 '<|im_end|>'
Jun 25 18:50:11 tsisim ollama[585]: print_info: max token length = 256
Jun 25 18:50:11 tsisim ollama[585]: load_tensors: loading model tensors, this can take a while... (mmap = false)
Jun 25 18:50:11 tsisim ollama[585]: load_tensors: offloading 0 repeating layers to GPU
Jun 25 18:50:11 tsisim ollama[585]: load_tensors: offloaded 0/25 layers to GPU
Jun 25 18:50:11 tsisim ollama[585]: load_tensors: CPU model buffer size = 371.02 MiB
Jun 25 18:50:26 tsisim ollama[585]: llama_context: constructing llama_context
Jun 25 18:50:26 tsisim ollama[585]: llama_context: n_seq_max = 1
Jun 25 18:50:26 tsisim ollama[585]: llama_context: n_ctx = 4096
Jun 25 18:50:26 tsisim ollama[585]: llama_context: n_ctx_per_seq = 4096
Jun 25 18:50:26 tsisim ollama[585]: llama_context: n_batch = 512
Jun 25 18:50:26 tsisim ollama[585]: llama_context: n_ubatch = 512
Jun 25 18:50:26 tsisim ollama[585]: llama_context: causal_attn = 1
Jun 25 18:50:26 tsisim ollama[585]: llama_context: flash_attn = disabled
Jun 25 18:50:26 tsisim ollama[585]: llama_context: kv_unified = false
Jun 25 18:50:26 tsisim ollama[585]: llama_context: freq_base = 10000.0
Jun 25 18:50:26 tsisim ollama[585]: llama_context: freq_scale = 1
Jun 25 18:50:26 tsisim ollama[585]: llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
Jun 25 18:50:26 tsisim ollama[585]: llama_context: CPU output buffer size = 0.58 MiB
Jun 25 18:50:26 tsisim ollama[585]: llama_kv_cache: CPU KV buffer size = 384.00 MiB
Jun 25 18:50:30 tsisim ollama[585]: llama_kv_cache: size = 384.00 MiB ( 4096 cells, 24 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
Jun 25 18:50:31 tsisim ollama[585]: llama_context: tsavorite compute buffer size = 20.50 MiB
Jun 25 18:50:31 tsisim ollama[585]: llama_context: CPU compute buffer size = 298.75 MiB
Jun 25 18:50:31 tsisim ollama[585]: llama_context: graph nodes = 942
Jun 25 18:50:31 tsisim ollama[585]: llama_context: graph splits = 196
Jun 25 18:50:31 tsisim ollama[585]: time=2026-06-25T18:50:31.249Z level=INFO source=server.go:1312 msg="llama runner started in 25.36 seconds"
Jun 25 18:50:31 tsisim ollama[585]: time=2026-06-25T18:50:31.250Z level=INFO source=sched.go:485 msg="loaded runners" count=1
Jun 25 18:50:31 tsisim ollama[585]: time=2026-06-25T18:50:31.250Z level=INFO source=server.go:1274 msg="waiting for llama runner to start responding"
Jun 25 18:50:31 tsisim ollama[585]: time=2026-06-25T18:50:31.254Z level=INFO source=server.go:1312 msg="llama runner started in 25.37 seconds"
Jun 25 18:52:06 tsisim ollama[1197]: TSI deploy yaml=/tsi/tsi-sw/anoop_ollama/ollama-arm64-release/bin/tsavorite-model-deployment.yaml txe_count=1 multi_thread_enable=0
Jun 25 18:52:06 tsisim ollama[1197]: finalize 4
Jun 25 18:52:06 tsisim ollama[1197]: OPU Profiling Results:
Jun 25 18:52:06 tsisim ollama[1197]: Profiler disabled
Jun 25 18:52:06 tsisim ollama[585]: [GIN] 2026/06/25 - 18:52:06 | 200 | 2m4s | 127.0.0.1 | POST "/api/generate"
Jun 25 18:57:06 tsisim ollama[585]: time=2026-06-25T18:57:06.667Z level=WARN source=server.go:1757 msg="llama server stopped" pid=1197
Jun 25 18:59:56 tsisim ollama[585]: [GIN] 2026/06/25 - 18:59:56 | 200 | 97.951µs | 127.0.0.1 | HEAD "/"
Jun 25 18:59:57 tsisim ollama[585]: [GIN] 2026/06/25 - 18:59:57 | 200 | 941.549726ms | 127.0.0.1 | POST "/api/show"
Jun 25 18:59:59 tsisim ollama[585]: time=2026-06-25T18:59:59.236Z level=INFO source=server.go:218 msg="enabling flash attention"
Jun 25 18:59:59 tsisim ollama[585]: time=2026-06-25T18:59:59.241Z level=INFO source=server.go:402 msg="starting runner" cmd="/tsi/tsi-sw/anoop_ollama/ollama-arm64-release/bin/ollama runner --ollama-engine --model /tsi/ollama-models/blobs/sha256-735af2139dc652bf01112746474883d79a52fa1c19038265d363e3d42556f7a2 --port 40943"
Jun 25 18:59:59 tsisim ollama[585]: time=2026-06-25T18:59:59.246Z level=INFO source=server.go:678 msg="loading model" "model layers"=19 requested=-1
Jun 25 18:59:59 tsisim ollama[585]: time=2026-06-25T18:59:59.251Z level=INFO source=server.go:684 msg="system memory" total="3.3 GiB" free="2.9 GiB" free_swap="8.0 GiB"
Jun 25 18:59:59 tsisim ollama[585]: time=2026-06-25T18:59:59.446Z level=INFO source=runner.go:907 msg="starting llama runner"
Jun 25 19:00:01 tsisim ollama[585]: time=2026-06-25T19:00:01.968Z level=INFO source=ggml.go:104 msg=system CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
Jun 25 19:00:01 tsisim ollama[585]: time=2026-06-25T19:00:01.977Z level=INFO source=runner.go:967 msg="Server listening on 127.0.0.1:40943"
Jun 25 19:00:01 tsisim ollama[585]: time=2026-06-25T19:00:01.994Z level=INFO source=runner.go:830 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:4 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jun 25 19:00:02 tsisim ollama[585]: time=2026-06-25T19:00:01.997Z level=INFO source=runner.go:830 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:4 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jun 25 19:00:02 tsisim ollama[585]: time=2026-06-25T19:00:02.007Z level=INFO source=runner.go:830 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:4 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jun 25 19:00:02 tsisim ollama[585]: llama_model_load_from_file_impl: using device Tsavorite (txe) (unknown id) - 128 MiB free
Jun 25 19:00:02 tsisim ollama[585]: time=2026-06-25T19:00:02.017Z level=INFO source=sched.go:485 msg="loaded runners" count=1
Jun 25 19:00:02 tsisim ollama[585]: time=2026-06-25T19:00:02.018Z level=INFO source=server.go:1274 msg="waiting for llama runner to start responding"
Jun 25 19:00:02 tsisim ollama[585]: time=2026-06-25T19:00:02.021Z level=INFO source=server.go:1308 msg="waiting for server to become available" status="llm server loading model"
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: loaded meta data with 36 key-value pairs and 236 tensors from /tsi/ollama-models/blobs/sha256-735af2139dc652bf01112746474883d79a52fa1c19038265d363e3d42556f7a2 (version GGUF V3 (latest))
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 0: general.architecture str = gemma3
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 1: general.type str = model
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 2: general.size_label str = 268M
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 3: general.license str = gemma
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 4: general.base_model.count u32 = 1
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 5: general.base_model.0.name str = Gemma 3 270m
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 6: general.base_model.0.organization str = Google
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 7: general.base_model.0.repo_url str = https://huggingface.co/google/gemma-3...
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 8: general.tags arr[str,4] = ["gemma3", "gemma", "google", "text-g...
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 9: gemma3.context_length u32 = 32768
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 10: gemma3.embedding_length u32 = 640
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 11: gemma3.block_count u32 = 18
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 12: gemma3.feed_forward_length u32 = 2048
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 13: gemma3.attention.head_count u32 = 4
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 14: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 15: gemma3.attention.key_length u32 = 256
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 16: gemma3.attention.value_length u32 = 256
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 17: gemma3.rope.freq_base f32 = 1000000.000000
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 18: gemma3.attention.sliding_window u32 = 512
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 19: gemma3.attention.head_count_kv u32 = 1
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 20: tokenizer.ggml.model str = llama
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = default
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,262144] = ["", "", "", "", ...
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 23: tokenizer.ggml.scores arr[f32,262144] = [-1000.000000, -1000.000000, -1000.00...
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,262144] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 2
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 1
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 3
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 0
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = true
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 30: tokenizer.ggml.add_sep_token bool = false
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 34: general.quantization_version u32 = 2
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 35: general.file_type u32 = 7
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - type f32: 109 tensors
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - type q8_0: 127 tensors
Jun 25 19:00:03 tsisim ollama[585]: print_info: file format = GGUF V3 (latest)
Jun 25 19:00:03 tsisim ollama[585]: print_info: file type = Q8_0
Jun 25 19:00:03 tsisim ollama[585]: print_info: file size = 271.81 MiB (8.50 BPW)
Jun 25 19:00:04 tsisim ollama[585]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Jun 25 19:00:04 tsisim ollama[585]: load: printing all EOG tokens:
Jun 25 19:00:04 tsisim ollama[585]: load: - 1 ('')
Jun 25 19:00:04 tsisim ollama[585]: load: - 106 ('<end_of_turn>')
Jun 25 19:00:04 tsisim ollama[585]: load: special tokens cache size = 6414
Jun 25 19:00:05 tsisim ollama[585]: load: token to piece cache size = 1.9446 MB
Jun 25 19:00:05 tsisim ollama[585]: print_info: arch = gemma3
Jun 25 19:00:05 tsisim ollama[585]: print_info: vocab_only = 0
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_ctx_train = 32768
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_embd = 640
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_layer = 18
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_head = 4
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_head_kv = 1
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_rot = 256
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_swa = 512
Jun 25 19:00:05 tsisim ollama[585]: print_info: is_swa_any = 1
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_embd_head_k = 256
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_embd_head_v = 256
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_gqa = 4
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_embd_k_gqa = 256
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_embd_v_gqa = 256
Jun 25 19:00:05 tsisim ollama[585]: print_info: f_norm_eps = 0.0e+00
Jun 25 19:00:05 tsisim ollama[585]: print_info: f_norm_rms_eps = 1.0e-06
Jun 25 19:00:05 tsisim ollama[585]: print_info: f_clamp_kqv = 0.0e+00
Jun 25 19:00:05 tsisim ollama[585]: print_info: f_max_alibi_bias = 0.0e+00
Jun 25 19:00:05 tsisim ollama[585]: print_info: f_logit_scale = 0.0e+00
Jun 25 19:00:05 tsisim ollama[585]: print_info: f_attn_scale = 6.2e-02
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_ff = 2048
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_expert = 0
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_expert_used = 0
Jun 25 19:00:05 tsisim ollama[585]: print_info: causal attn = 1
Jun 25 19:00:05 tsisim ollama[585]: print_info: pooling type = 0
Jun 25 19:00:05 tsisim ollama[585]: print_info: rope type = 2
Jun 25 19:00:05 tsisim ollama[585]: print_info: rope scaling = linear
Jun 25 19:00:05 tsisim ollama[585]: print_info: freq_base_train = 1000000.0
Jun 25 19:00:05 tsisim ollama[585]: print_info: freq_scale_train = 1
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_ctx_orig_yarn = 32768
Jun 25 19:00:05 tsisim ollama[585]: print_info: rope_finetuned = unknown
Jun 25 19:00:05 tsisim ollama[585]: print_info: model type = 270M
Jun 25 19:00:05 tsisim ollama[585]: print_info: model params = 268.10 M
Jun 25 19:00:05 tsisim ollama[585]: print_info: general.name = n/a
Jun 25 19:00:05 tsisim ollama[585]: print_info: vocab type = SPM
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_vocab = 262144
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_merges = 0
Jun 25 19:00:05 tsisim ollama[585]: print_info: BOS token = 2 ''
Jun 25 19:00:05 tsisim ollama[585]: print_info: EOS token = 1 ''
Jun 25 19:00:05 tsisim ollama[585]: print_info: EOT token = 106 '<end_of_turn>'
Jun 25 19:00:05 tsisim ollama[585]: print_info: UNK token = 3 ''
Jun 25 19:00:05 tsisim ollama[585]: print_info: PAD token = 0 ''
Jun 25 19:00:05 tsisim ollama[585]: print_info: LF token = 248 '<0x0A>'
Jun 25 19:00:05 tsisim ollama[585]: print_info: EOG token = 1 ''
Jun 25 19:00:05 tsisim ollama[585]: print_info: EOG token = 106 '<end_of_turn>'
Jun 25 19:00:05 tsisim ollama[585]: print_info: max token length = 48
Jun 25 19:00:05 tsisim ollama[585]: load_tensors: loading model tensors, this can take a while... (mmap = false)
Jun 25 19:00:05 tsisim ollama[585]: load_tensors: offloading 0 repeating layers to GPU
Jun 25 19:00:05 tsisim ollama[585]: load_tensors: offloaded 0/19 layers to GPU
Jun 25 19:00:05 tsisim ollama[585]: load_tensors: CPU model buffer size = 271.81 MiB
Jun 25 19:00:12 tsisim ollama[585]: llama_init_from_model: model default pooling_type is [0], but [-1] was specified
Jun 25 19:00:12 tsisim ollama[585]: llama_context: constructing llama_context
Jun 25 19:00:12 tsisim ollama[585]: llama_context: n_seq_max = 1
Jun 25 19:00:12 tsisim ollama[585]: llama_context: n_ctx = 4096
Jun 25 19:00:12 tsisim ollama[585]: llama_context: n_ctx_per_seq = 4096
Jun 25 19:00:12 tsisim ollama[585]: llama_context: n_batch = 512
Jun 25 19:00:12 tsisim ollama[585]: llama_context: n_ubatch = 512
Jun 25 19:00:12 tsisim ollama[585]: llama_context: causal_attn = 1
Jun 25 19:00:12 tsisim ollama[585]: llama_context: flash_attn = enabled
Jun 25 19:00:12 tsisim ollama[585]: llama_context: kv_unified = false
Jun 25 19:00:12 tsisim ollama[585]: llama_context: freq_base = 1000000.0
Jun 25 19:00:12 tsisim ollama[585]: llama_context: freq_scale = 1
Jun 25 19:00:12 tsisim ollama[585]: llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
Jun 25 19:00:12 tsisim ollama[585]: llama_context: CPU output buffer size = 1.00 MiB
Jun 25 19:00:12 tsisim ollama[585]: llama_kv_cache_iswa: using full-size SWA cache (ref: https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
Jun 25 19:00:12 tsisim ollama[585]: llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
Jun 25 19:00:12 tsisim ollama[585]: llama_kv_cache: CPU KV buffer size = 12.00 MiB
Jun 25 19:00:12 tsisim ollama[585]: llama_kv_cache: size = 12.00 MiB ( 4096 cells, 3 layers, 1/1 seqs), K (f16): 6.00 MiB, V (f16): 6.00 MiB
Jun 25 19:00:12 tsisim ollama[585]: llama_kv_cache_iswa: creating SWA KV cache, size = 4096 cells
Jun 25 19:00:12 tsisim ollama[585]: llama_kv_cache: CPU KV buffer size = 60.00 MiB
Jun 25 19:00:13 tsisim ollama[585]: llama_kv_cache: size = 60.00 MiB ( 4096 cells, 15 layers, 1/1 seqs), K (f16): 30.00 MiB, V (f16): 30.00 MiB
Jun 25 19:00:13 tsisim ollama[585]: llama_context: tsavorite compute buffer size = 3.25 MiB
Jun 25 19:00:13 tsisim ollama[585]: llama_context: CPU compute buffer size = 513.25 MiB
Jun 25 19:00:13 tsisim ollama[585]: llama_context: graph nodes = 729
Jun 25 19:00:13 tsisim ollama[585]: llama_context: graph splits = 258
Jun 25 19:00:13 tsisim ollama[585]: time=2026-06-25T19:00:13.791Z level=INFO source=server.go:1312 msg="llama runner started in 14.55 seconds"
Jun 25 19:01:14 tsisim ollama[6314]: TSI deploy yaml=/tsi/tsi-sw/anoop_ollama/ollama-arm64-release/bin/tsavorite-model-deployment.yaml txe_count=1 multi_thread_enable=0
Jun 25 19:01:14 tsisim ollama[6314]: finalize 4
Jun 25 19:01:14 tsisim ollama[6314]: OPU Profiling Results:
Jun 25 19:01:14 tsisim ollama[6314]: Profiler disabled
Jun 25 19:01:14 tsisim ollama[585]: [GIN] 2026/06/25 - 19:01:14 | 200 | 1m17s | 127.0.0.1 | POST "/api/generate"

@atrivedi-tsavoritesi atrivedi-tsavoritesi left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we list out which all models TMU is being offloaded as I tested Qwen:0.5b, Gemma3:270M and smolVLM-256M and I only see smolVLM-256M using TMU none of the other models used TMU.

currently some shape for
smolVLM-256M
and Tiny-Llama-v0.3-FP32-1.1B-F32.gguf getting offloaded to OPU

@akapoor3518

Copy link
Copy Markdown
Author

smolVLM-256M
and Tiny-Llama-v0.3-FP32-1.1B-F32.gguf getting offloaded to OPU

with this PR , models are getting offloaded to TMU
smolVLM-256M
and Tiny-Llama-v0.3-FP32-1.1B-F32.gguf getting offloaded to OPU

@akapoor3518

Copy link
Copy Markdown
Author

@atrivedi-tsavoritesi Below Models are getting offloaded
smolVLM-256M
and Tiny-Llama-v0.3-FP32-1.1B-F32.gguf

@atrivedi-tsavoritesi atrivedi-tsavoritesi left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akapoor3518 by the way Posix does not work if you use these models i..e smolVLM-256M, I am approving it regardless as we can track that as separate item.
it worked on posix, i have tried many time, did u do
export USER_DRAM_SIZE=8192

look at this confluence
https://tsavoritesi.atlassian.net/wiki/x/AwAhZQ

@akapoor3518 akapoor3518 merged commit e679325 into main Jun 26, 2026
4 of 8 checks passed
@akapoor3518 akapoor3518 deleted the llama-cpp-mat-mul branch June 26, 2026 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants