@FIR-2010 - Enable Ollama Support with Triton MAT_MUL Integration#58
Conversation
There was a problem hiding this comment.
Can we list out which all models TMU is being offloaded as I tested Qwen:0.5b, Gemma3:270M and smolVLM-256M and I only see smolVLM-256M using TMU none of the other models used TMU.
currently some shape for
smolVLM-256M
and Tiny-Llama-v0.3-FP32-1.1B-F32.gguf getting offloaded to OPU
with this PR , models are getting offloaded to TMU |
|
@atrivedi-tsavoritesi Below Models are getting offloaded |
There was a problem hiding this comment.
@akapoor3518 by the way Posix does not work if you use these models i..e smolVLM-256M, I am approving it regardless as we can track that as separate item.
it worked on posix, i have tried many time, did u do
export USER_DRAM_SIZE=8192
look at this confluence
https://tsavoritesi.atlassian.net/wiki/x/AwAhZQ
Last login: Mon Jun 22 19:53:12 2026 from 10.0.2.2
root@tsisim:~# ollama run qwen:0.5b "Where is Amazon River?"
Amazon River is located in the state of Kansas, United States.
root@tsisim:~# ollama run Gemma3:270M "Where is Amazon river?"
Amazon River is located in South America, specifically in the Amazon basin.
root@tsisim:
# ls -lrt /usr/local/#total 32
drwxr-xr-x 2 root root 4096 Dec 13 2025 games
drwxr-xr-x 2 root root 4096 Dec 13 2025 include
drwxr-xr-x 2 root root 4096 Dec 13 2025 src
drwxr-xr-x 2 root root 4096 Dec 13 2025 sbin
drwxr-xr-x 2 root root 4096 Dec 13 2025 etc
lrwxrwxrwx 1 root root 9 Dec 13 2025 man -> share/man
drwxr-xr-x 7 root root 4096 Dec 13 2025 share
lrwxrwxrwx 1 root root 45 Jun 22 19:04 ollama-arm64-release -> /tsi/tsi-sw/anoop_ollama/ollama-arm64-release
drwxr-xr-x 2 root root 4096 Jun 22 19:04 bin
drwxr-xr-x 4 root root 4096 Jun 25 18:37 lib
root@tsisim:
The list of available updates is more than a week old.
To check for new updates run: sudo apt update
Last login: Mon Jun 22 19:53:12 2026 from 10.0.2.2
root@tsisim:
# https://www.linkedin.com/in/kathleenqin/# journalctl -u ollama -f-bash: https://www.linkedin.com/in/kathleenqin/: No such file or directory
root@tsisim:
Jun 25 18:39:03 tsisim ollama[585]: [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
Jun 25 18:39:03 tsisim ollama[585]: [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
Jun 25 18:39:03 tsisim ollama[585]: [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
Jun 25 18:39:03 tsisim ollama[585]: [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
Jun 25 18:39:03 tsisim ollama[585]: [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
Jun 25 18:39:03 tsisim ollama[585]: time=2026-06-25T18:39:03.224Z level=INFO source=routes.go:1569 msg="Listening on [::]:11434 (version 0.0.0)"
Jun 25 18:39:03 tsisim ollama[585]: time=2026-06-25T18:39:03.346Z level=INFO source=runner.go:80 msg="discovering available GPUs..."
Jun 25 18:39:08 tsisim ollama[585]: time=2026-06-25T18:39:08.138Z level=INFO source=runner.go:551 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH=[/tsi/tsi-sw/anoop_ollama/ollama-arm64-release/bin] extra_envs=[] error="llamarunner free vram reporting not supported"
Jun 25 18:39:08 tsisim ollama[585]: time=2026-06-25T18:39:08.224Z level=INFO source=types.go:129 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="3.3 GiB" available="3.0 GiB"
Jun 25 18:39:08 tsisim ollama[585]: time=2026-06-25T18:39:08.225Z level=INFO source=routes.go:1610 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB"
Jun 25 18:50:01 tsisim ollama[585]: [GIN] 2026/06/25 - 18:50:01 | 200 | 6.466177ms | 127.0.0.1 | HEAD "/"
Jun 25 18:50:01 tsisim ollama[585]: [GIN] 2026/06/25 - 18:50:01 | 200 | 709.054833ms | 127.0.0.1 | POST "/api/show"
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /tsi/ollama-models/blobs/sha256-fad2a06e4cc705c2fa8bec5477ddb00dc0c859ac184c34dcc5586663774161ca (version GGUF V3 (latest))
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 0: general.architecture str = qwen2
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 1: general.name str = Qwen2-beta-0_5B-Chat
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 2: qwen2.block_count u32 = 24
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 4: qwen2.embedding_length u32 = 1024
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 2816
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 8: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 9: qwen2.use_parallel_residual bool = true
Jun 25 18:50:03 tsisim ollama[585]: llama_model_loader: - kv 10: tokenizer.ggml.model str = gpt2
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 12: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 13: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 151643
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 15: tokenizer.ggml.padding_token_id u32 = 151643
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151643
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 17: tokenizer.chat_template str = {% for message in messages %}{% if lo...
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 18: general.quantization_version u32 = 2
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - kv 19: general.file_type u32 = 2
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - type f32: 121 tensors
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - type q4_0: 169 tensors
Jun 25 18:50:04 tsisim ollama[585]: llama_model_loader: - type q6_K: 1 tensors
Jun 25 18:50:04 tsisim ollama[585]: print_info: file format = GGUF V3 (latest)
Jun 25 18:50:04 tsisim ollama[585]: print_info: file type = Q4_0
Jun 25 18:50:04 tsisim ollama[585]: print_info: file size = 371.02 MiB (5.02 BPW)
Jun 25 18:50:04 tsisim ollama[585]: load: missing or unrecognized pre-tokenizer type, using: 'default'
Jun 25 18:50:05 tsisim ollama[585]: load: printing all EOG tokens:
Jun 25 18:50:05 tsisim ollama[585]: load: - 151643 ('<|endoftext|>')
Jun 25 18:50:05 tsisim ollama[585]: load: - 151645 ('<|im_end|>')
Jun 25 18:50:05 tsisim ollama[585]: load: special tokens cache size = 293
Jun 25 18:50:05 tsisim ollama[585]: load: token to piece cache size = 0.9338 MB
Jun 25 18:50:05 tsisim ollama[585]: print_info: arch = qwen2
Jun 25 18:50:05 tsisim ollama[585]: print_info: vocab_only = 1
Jun 25 18:50:05 tsisim ollama[585]: print_info: model type = ?B
Jun 25 18:50:05 tsisim ollama[585]: print_info: model params = 619.57 M
Jun 25 18:50:05 tsisim ollama[585]: print_info: general.name = Qwen2-beta-0_5B-Chat
Jun 25 18:50:05 tsisim ollama[585]: print_info: vocab type = BPE
Jun 25 18:50:05 tsisim ollama[585]: print_info: n_vocab = 151936
Jun 25 18:50:05 tsisim ollama[585]: print_info: n_merges = 151387
Jun 25 18:50:05 tsisim ollama[585]: print_info: BOS token = 151643 '<|endoftext|>'
Jun 25 18:50:05 tsisim ollama[585]: print_info: EOS token = 151643 '<|endoftext|>'
Jun 25 18:50:05 tsisim ollama[585]: print_info: EOT token = 151645 '<|im_end|>'
Jun 25 18:50:05 tsisim ollama[585]: print_info: PAD token = 151643 '<|endoftext|>'
Jun 25 18:50:05 tsisim ollama[585]: print_info: LF token = 198 'Ċ'
Jun 25 18:50:05 tsisim ollama[585]: print_info: EOG token = 151643 '<|endoftext|>'
Jun 25 18:50:05 tsisim ollama[585]: print_info: EOG token = 151645 '<|im_end|>'
Jun 25 18:50:05 tsisim ollama[585]: print_info: max token length = 256
Jun 25 18:50:05 tsisim ollama[585]: llama_model_load: vocab only - skipping tensors
Jun 25 18:50:05 tsisim ollama[585]: time=2026-06-25T18:50:05.886Z level=INFO source=server.go:402 msg="starting runner" cmd="/tsi/tsi-sw/anoop_ollama/ollama-arm64-release/bin/ollama runner --model /tsi/ollama-models/blobs/sha256-fad2a06e4cc705c2fa8bec5477ddb00dc0c859ac184c34dcc5586663774161ca --port 41211"
Jun 25 18:50:05 tsisim ollama[585]: time=2026-06-25T18:50:05.902Z level=INFO source=server.go:507 msg="system memory" total="3.3 GiB" free="2.9 GiB" free_swap="8.0 GiB"
Jun 25 18:50:05 tsisim ollama[585]: time=2026-06-25T18:50:05.907Z level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/tsi/ollama-models/blobs/sha256-fad2a06e4cc705c2fa8bec5477ddb00dc0c859ac184c34dcc5586663774161ca library=cpu parallel=1 required="0 B" gpus=1
Jun 25 18:50:05 tsisim ollama[585]: time=2026-06-25T18:50:05.910Z level=INFO source=server.go:547 msg=offload library=cpu layers.requested=-1 layers.model=25 layers.offload=0 layers.split=[] memory.available="[3.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="993.2 MiB" memory.required.partial="0 B" memory.required.kv="384.0 MiB" memory.required.allocations="[993.2 MiB]" memory.weights.total="287.6 MiB" memory.weights.repeating="165.8 MiB" memory.weights.nonrepeating="121.7 MiB" memory.graph.full="298.8 MiB" memory.graph.partial="420.5 MiB"
Jun 25 18:50:06 tsisim ollama[585]: time=2026-06-25T18:50:06.093Z level=INFO source=runner.go:907 msg="starting llama runner"
Jun 25 18:50:08 tsisim ollama[585]: time=2026-06-25T18:50:08.883Z level=INFO source=ggml.go:104 msg=system CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
Jun 25 18:50:08 tsisim ollama[585]: time=2026-06-25T18:50:08.903Z level=INFO source=runner.go:967 msg="Server listening on 127.0.0.1:41211"
Jun 25 18:50:08 tsisim ollama[585]: time=2026-06-25T18:50:08.926Z level=INFO source=runner.go:830 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:4 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jun 25 18:50:08 tsisim ollama[585]: time=2026-06-25T18:50:08.942Z level=INFO source=server.go:1274 msg="waiting for llama runner to start responding"
Jun 25 18:50:08 tsisim ollama[585]: llama_model_load_from_file_impl: using device Tsavorite (txe) (unknown id) - 128 MiB free
Jun 25 18:50:08 tsisim ollama[585]: time=2026-06-25T18:50:08.946Z level=INFO source=server.go:1308 msg="waiting for server to become available" status="llm server loading model"
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /tsi/ollama-models/blobs/sha256-fad2a06e4cc705c2fa8bec5477ddb00dc0c859ac184c34dcc5586663774161ca (version GGUF V3 (latest))
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 0: general.architecture str = qwen2
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 1: general.name str = Qwen2-beta-0_5B-Chat
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 2: qwen2.block_count u32 = 24
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 4: qwen2.embedding_length u32 = 1024
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 2816
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 8: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 9: qwen2.use_parallel_residual bool = true
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 10: tokenizer.ggml.model str = gpt2
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 12: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 13: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 151643
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 15: tokenizer.ggml.padding_token_id u32 = 151643
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151643
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 17: tokenizer.chat_template str = {% for message in messages %}{% if lo...
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 18: general.quantization_version u32 = 2
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - kv 19: general.file_type u32 = 2
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - type f32: 121 tensors
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - type q4_0: 169 tensors
Jun 25 18:50:09 tsisim ollama[585]: llama_model_loader: - type q6_K: 1 tensors
Jun 25 18:50:09 tsisim ollama[585]: print_info: file format = GGUF V3 (latest)
Jun 25 18:50:09 tsisim ollama[585]: print_info: file type = Q4_0
Jun 25 18:50:09 tsisim ollama[585]: print_info: file size = 371.02 MiB (5.02 BPW)
Jun 25 18:50:10 tsisim ollama[585]: load: missing or unrecognized pre-tokenizer type, using: 'default'
Jun 25 18:50:11 tsisim ollama[585]: load: printing all EOG tokens:
Jun 25 18:50:11 tsisim ollama[585]: load: - 151643 ('<|endoftext|>')
Jun 25 18:50:11 tsisim ollama[585]: load: - 151645 ('<|im_end|>')
Jun 25 18:50:11 tsisim ollama[585]: load: special tokens cache size = 293
Jun 25 18:50:11 tsisim ollama[585]: load: token to piece cache size = 0.9338 MB
Jun 25 18:50:11 tsisim ollama[585]: print_info: arch = qwen2
Jun 25 18:50:11 tsisim ollama[585]: print_info: vocab_only = 0
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_ctx_train = 32768
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_embd = 1024
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_layer = 24
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_head = 16
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_head_kv = 16
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_rot = 64
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_swa = 0
Jun 25 18:50:11 tsisim ollama[585]: print_info: is_swa_any = 0
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_embd_head_k = 64
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_embd_head_v = 64
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_gqa = 1
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_embd_k_gqa = 1024
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_embd_v_gqa = 1024
Jun 25 18:50:11 tsisim ollama[585]: print_info: f_norm_eps = 0.0e+00
Jun 25 18:50:11 tsisim ollama[585]: print_info: f_norm_rms_eps = 1.0e-06
Jun 25 18:50:11 tsisim ollama[585]: print_info: f_clamp_kqv = 0.0e+00
Jun 25 18:50:11 tsisim ollama[585]: print_info: f_max_alibi_bias = 0.0e+00
Jun 25 18:50:11 tsisim ollama[585]: print_info: f_logit_scale = 0.0e+00
Jun 25 18:50:11 tsisim ollama[585]: print_info: f_attn_scale = 0.0e+00
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_ff = 2816
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_expert = 0
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_expert_used = 0
Jun 25 18:50:11 tsisim ollama[585]: print_info: causal attn = 1
Jun 25 18:50:11 tsisim ollama[585]: print_info: pooling type = -1
Jun 25 18:50:11 tsisim ollama[585]: print_info: rope type = 2
Jun 25 18:50:11 tsisim ollama[585]: print_info: rope scaling = linear
Jun 25 18:50:11 tsisim ollama[585]: print_info: freq_base_train = 10000.0
Jun 25 18:50:11 tsisim ollama[585]: print_info: freq_scale_train = 1
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_ctx_orig_yarn = 32768
Jun 25 18:50:11 tsisim ollama[585]: print_info: rope_finetuned = unknown
Jun 25 18:50:11 tsisim ollama[585]: print_info: model type = 0.5B
Jun 25 18:50:11 tsisim ollama[585]: print_info: model params = 619.57 M
Jun 25 18:50:11 tsisim ollama[585]: print_info: general.name = Qwen2-beta-0_5B-Chat
Jun 25 18:50:11 tsisim ollama[585]: print_info: vocab type = BPE
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_vocab = 151936
Jun 25 18:50:11 tsisim ollama[585]: print_info: n_merges = 151387
Jun 25 18:50:11 tsisim ollama[585]: print_info: BOS token = 151643 '<|endoftext|>'
Jun 25 18:50:11 tsisim ollama[585]: print_info: EOS token = 151643 '<|endoftext|>'
Jun 25 18:50:11 tsisim ollama[585]: print_info: EOT token = 151645 '<|im_end|>'
Jun 25 18:50:11 tsisim ollama[585]: print_info: PAD token = 151643 '<|endoftext|>'
Jun 25 18:50:11 tsisim ollama[585]: print_info: LF token = 198 'Ċ'
Jun 25 18:50:11 tsisim ollama[585]: print_info: EOG token = 151643 '<|endoftext|>'
Jun 25 18:50:11 tsisim ollama[585]: print_info: EOG token = 151645 '<|im_end|>'
Jun 25 18:50:11 tsisim ollama[585]: print_info: max token length = 256
Jun 25 18:50:11 tsisim ollama[585]: load_tensors: loading model tensors, this can take a while... (mmap = false)
Jun 25 18:50:11 tsisim ollama[585]: load_tensors: offloading 0 repeating layers to GPU
Jun 25 18:50:11 tsisim ollama[585]: load_tensors: offloaded 0/25 layers to GPU
Jun 25 18:50:11 tsisim ollama[585]: load_tensors: CPU model buffer size = 371.02 MiB
Jun 25 18:50:26 tsisim ollama[585]: llama_context: constructing llama_context
Jun 25 18:50:26 tsisim ollama[585]: llama_context: n_seq_max = 1
Jun 25 18:50:26 tsisim ollama[585]: llama_context: n_ctx = 4096
Jun 25 18:50:26 tsisim ollama[585]: llama_context: n_ctx_per_seq = 4096
Jun 25 18:50:26 tsisim ollama[585]: llama_context: n_batch = 512
Jun 25 18:50:26 tsisim ollama[585]: llama_context: n_ubatch = 512
Jun 25 18:50:26 tsisim ollama[585]: llama_context: causal_attn = 1
Jun 25 18:50:26 tsisim ollama[585]: llama_context: flash_attn = disabled
Jun 25 18:50:26 tsisim ollama[585]: llama_context: kv_unified = false
Jun 25 18:50:26 tsisim ollama[585]: llama_context: freq_base = 10000.0
Jun 25 18:50:26 tsisim ollama[585]: llama_context: freq_scale = 1
Jun 25 18:50:26 tsisim ollama[585]: llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
Jun 25 18:50:26 tsisim ollama[585]: llama_context: CPU output buffer size = 0.58 MiB
Jun 25 18:50:26 tsisim ollama[585]: llama_kv_cache: CPU KV buffer size = 384.00 MiB
Jun 25 18:50:30 tsisim ollama[585]: llama_kv_cache: size = 384.00 MiB ( 4096 cells, 24 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
Jun 25 18:50:31 tsisim ollama[585]: llama_context: tsavorite compute buffer size = 20.50 MiB
Jun 25 18:50:31 tsisim ollama[585]: llama_context: CPU compute buffer size = 298.75 MiB
Jun 25 18:50:31 tsisim ollama[585]: llama_context: graph nodes = 942
Jun 25 18:50:31 tsisim ollama[585]: llama_context: graph splits = 196
Jun 25 18:50:31 tsisim ollama[585]: time=2026-06-25T18:50:31.249Z level=INFO source=server.go:1312 msg="llama runner started in 25.36 seconds"
Jun 25 18:50:31 tsisim ollama[585]: time=2026-06-25T18:50:31.250Z level=INFO source=sched.go:485 msg="loaded runners" count=1
Jun 25 18:50:31 tsisim ollama[585]: time=2026-06-25T18:50:31.250Z level=INFO source=server.go:1274 msg="waiting for llama runner to start responding"
Jun 25 18:50:31 tsisim ollama[585]: time=2026-06-25T18:50:31.254Z level=INFO source=server.go:1312 msg="llama runner started in 25.37 seconds"
Jun 25 18:52:06 tsisim ollama[1197]: TSI deploy yaml=/tsi/tsi-sw/anoop_ollama/ollama-arm64-release/bin/tsavorite-model-deployment.yaml txe_count=1 multi_thread_enable=0
Jun 25 18:52:06 tsisim ollama[1197]: finalize 4
Jun 25 18:52:06 tsisim ollama[1197]: OPU Profiling Results:
Jun 25 18:52:06 tsisim ollama[1197]: Profiler disabled
Jun 25 18:52:06 tsisim ollama[585]: [GIN] 2026/06/25 - 18:52:06 | 200 | 2m4s | 127.0.0.1 | POST "/api/generate"
Jun 25 18:57:06 tsisim ollama[585]: time=2026-06-25T18:57:06.667Z level=WARN source=server.go:1757 msg="llama server stopped" pid=1197
Jun 25 18:59:56 tsisim ollama[585]: [GIN] 2026/06/25 - 18:59:56 | 200 | 97.951µs | 127.0.0.1 | HEAD "/"
Jun 25 18:59:57 tsisim ollama[585]: [GIN] 2026/06/25 - 18:59:57 | 200 | 941.549726ms | 127.0.0.1 | POST "/api/show"
Jun 25 18:59:59 tsisim ollama[585]: time=2026-06-25T18:59:59.236Z level=INFO source=server.go:218 msg="enabling flash attention"
Jun 25 18:59:59 tsisim ollama[585]: time=2026-06-25T18:59:59.241Z level=INFO source=server.go:402 msg="starting runner" cmd="/tsi/tsi-sw/anoop_ollama/ollama-arm64-release/bin/ollama runner --ollama-engine --model /tsi/ollama-models/blobs/sha256-735af2139dc652bf01112746474883d79a52fa1c19038265d363e3d42556f7a2 --port 40943"
Jun 25 18:59:59 tsisim ollama[585]: time=2026-06-25T18:59:59.246Z level=INFO source=server.go:678 msg="loading model" "model layers"=19 requested=-1
Jun 25 18:59:59 tsisim ollama[585]: time=2026-06-25T18:59:59.251Z level=INFO source=server.go:684 msg="system memory" total="3.3 GiB" free="2.9 GiB" free_swap="8.0 GiB"
Jun 25 18:59:59 tsisim ollama[585]: time=2026-06-25T18:59:59.446Z level=INFO source=runner.go:907 msg="starting llama runner"
Jun 25 19:00:01 tsisim ollama[585]: time=2026-06-25T19:00:01.968Z level=INFO source=ggml.go:104 msg=system CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
Jun 25 19:00:01 tsisim ollama[585]: time=2026-06-25T19:00:01.977Z level=INFO source=runner.go:967 msg="Server listening on 127.0.0.1:40943"
Jun 25 19:00:01 tsisim ollama[585]: time=2026-06-25T19:00:01.994Z level=INFO source=runner.go:830 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:4 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jun 25 19:00:02 tsisim ollama[585]: time=2026-06-25T19:00:01.997Z level=INFO source=runner.go:830 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:4 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jun 25 19:00:02 tsisim ollama[585]: time=2026-06-25T19:00:02.007Z level=INFO source=runner.go:830 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:4 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Jun 25 19:00:02 tsisim ollama[585]: llama_model_load_from_file_impl: using device Tsavorite (txe) (unknown id) - 128 MiB free
Jun 25 19:00:02 tsisim ollama[585]: time=2026-06-25T19:00:02.017Z level=INFO source=sched.go:485 msg="loaded runners" count=1
Jun 25 19:00:02 tsisim ollama[585]: time=2026-06-25T19:00:02.018Z level=INFO source=server.go:1274 msg="waiting for llama runner to start responding"
Jun 25 19:00:02 tsisim ollama[585]: time=2026-06-25T19:00:02.021Z level=INFO source=server.go:1308 msg="waiting for server to become available" status="llm server loading model"
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: loaded meta data with 36 key-value pairs and 236 tensors from /tsi/ollama-models/blobs/sha256-735af2139dc652bf01112746474883d79a52fa1c19038265d363e3d42556f7a2 (version GGUF V3 (latest))
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 0: general.architecture str = gemma3
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 1: general.type str = model
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 2: general.size_label str = 268M
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 3: general.license str = gemma
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 4: general.base_model.count u32 = 1
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 5: general.base_model.0.name str = Gemma 3 270m
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 6: general.base_model.0.organization str = Google
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 7: general.base_model.0.repo_url str = https://huggingface.co/google/gemma-3...
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 8: general.tags arr[str,4] = ["gemma3", "gemma", "google", "text-g...
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 9: gemma3.context_length u32 = 32768
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 10: gemma3.embedding_length u32 = 640
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 11: gemma3.block_count u32 = 18
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 12: gemma3.feed_forward_length u32 = 2048
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 13: gemma3.attention.head_count u32 = 4
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 14: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 15: gemma3.attention.key_length u32 = 256
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 16: gemma3.attention.value_length u32 = 256
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 17: gemma3.rope.freq_base f32 = 1000000.000000
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 18: gemma3.attention.sliding_window u32 = 512
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 19: gemma3.attention.head_count_kv u32 = 1
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 20: tokenizer.ggml.model str = llama
Jun 25 19:00:02 tsisim ollama[585]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = default
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,262144] = ["", "", "", "", ...
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 23: tokenizer.ggml.scores arr[f32,262144] = [-1000.000000, -1000.000000, -1000.00...
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,262144] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 2
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 1
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 3
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 0
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = true
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 30: tokenizer.ggml.add_sep_token bool = false
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 34: general.quantization_version u32 = 2
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - kv 35: general.file_type u32 = 7
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - type f32: 109 tensors
Jun 25 19:00:03 tsisim ollama[585]: llama_model_loader: - type q8_0: 127 tensors
Jun 25 19:00:03 tsisim ollama[585]: print_info: file format = GGUF V3 (latest)
Jun 25 19:00:03 tsisim ollama[585]: print_info: file type = Q8_0
Jun 25 19:00:03 tsisim ollama[585]: print_info: file size = 271.81 MiB (8.50 BPW)
Jun 25 19:00:04 tsisim ollama[585]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Jun 25 19:00:04 tsisim ollama[585]: load: printing all EOG tokens:
Jun 25 19:00:04 tsisim ollama[585]: load: - 1 ('')
Jun 25 19:00:04 tsisim ollama[585]: load: - 106 ('<end_of_turn>')
Jun 25 19:00:04 tsisim ollama[585]: load: special tokens cache size = 6414
Jun 25 19:00:05 tsisim ollama[585]: load: token to piece cache size = 1.9446 MB
Jun 25 19:00:05 tsisim ollama[585]: print_info: arch = gemma3
Jun 25 19:00:05 tsisim ollama[585]: print_info: vocab_only = 0
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_ctx_train = 32768
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_embd = 640
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_layer = 18
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_head = 4
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_head_kv = 1
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_rot = 256
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_swa = 512
Jun 25 19:00:05 tsisim ollama[585]: print_info: is_swa_any = 1
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_embd_head_k = 256
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_embd_head_v = 256
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_gqa = 4
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_embd_k_gqa = 256
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_embd_v_gqa = 256
Jun 25 19:00:05 tsisim ollama[585]: print_info: f_norm_eps = 0.0e+00
Jun 25 19:00:05 tsisim ollama[585]: print_info: f_norm_rms_eps = 1.0e-06
Jun 25 19:00:05 tsisim ollama[585]: print_info: f_clamp_kqv = 0.0e+00
Jun 25 19:00:05 tsisim ollama[585]: print_info: f_max_alibi_bias = 0.0e+00
Jun 25 19:00:05 tsisim ollama[585]: print_info: f_logit_scale = 0.0e+00
Jun 25 19:00:05 tsisim ollama[585]: print_info: f_attn_scale = 6.2e-02
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_ff = 2048
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_expert = 0
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_expert_used = 0
Jun 25 19:00:05 tsisim ollama[585]: print_info: causal attn = 1
Jun 25 19:00:05 tsisim ollama[585]: print_info: pooling type = 0
Jun 25 19:00:05 tsisim ollama[585]: print_info: rope type = 2
Jun 25 19:00:05 tsisim ollama[585]: print_info: rope scaling = linear
Jun 25 19:00:05 tsisim ollama[585]: print_info: freq_base_train = 1000000.0
Jun 25 19:00:05 tsisim ollama[585]: print_info: freq_scale_train = 1
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_ctx_orig_yarn = 32768
Jun 25 19:00:05 tsisim ollama[585]: print_info: rope_finetuned = unknown
Jun 25 19:00:05 tsisim ollama[585]: print_info: model type = 270M
Jun 25 19:00:05 tsisim ollama[585]: print_info: model params = 268.10 M
Jun 25 19:00:05 tsisim ollama[585]: print_info: general.name = n/a
Jun 25 19:00:05 tsisim ollama[585]: print_info: vocab type = SPM
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_vocab = 262144
Jun 25 19:00:05 tsisim ollama[585]: print_info: n_merges = 0
Jun 25 19:00:05 tsisim ollama[585]: print_info: BOS token = 2 ''
Jun 25 19:00:05 tsisim ollama[585]: print_info: EOS token = 1 ''
Jun 25 19:00:05 tsisim ollama[585]: print_info: EOT token = 106 '<end_of_turn>'
Jun 25 19:00:05 tsisim ollama[585]: print_info: UNK token = 3 ''
Jun 25 19:00:05 tsisim ollama[585]: print_info: PAD token = 0 ''
Jun 25 19:00:05 tsisim ollama[585]: print_info: LF token = 248 '<0x0A>'
Jun 25 19:00:05 tsisim ollama[585]: print_info: EOG token = 1 ''
Jun 25 19:00:05 tsisim ollama[585]: print_info: EOG token = 106 '<end_of_turn>'
Jun 25 19:00:05 tsisim ollama[585]: print_info: max token length = 48
Jun 25 19:00:05 tsisim ollama[585]: load_tensors: loading model tensors, this can take a while... (mmap = false)
Jun 25 19:00:05 tsisim ollama[585]: load_tensors: offloading 0 repeating layers to GPU
Jun 25 19:00:05 tsisim ollama[585]: load_tensors: offloaded 0/19 layers to GPU
Jun 25 19:00:05 tsisim ollama[585]: load_tensors: CPU model buffer size = 271.81 MiB
Jun 25 19:00:12 tsisim ollama[585]: llama_init_from_model: model default pooling_type is [0], but [-1] was specified
Jun 25 19:00:12 tsisim ollama[585]: llama_context: constructing llama_context
Jun 25 19:00:12 tsisim ollama[585]: llama_context: n_seq_max = 1
Jun 25 19:00:12 tsisim ollama[585]: llama_context: n_ctx = 4096
Jun 25 19:00:12 tsisim ollama[585]: llama_context: n_ctx_per_seq = 4096
Jun 25 19:00:12 tsisim ollama[585]: llama_context: n_batch = 512
Jun 25 19:00:12 tsisim ollama[585]: llama_context: n_ubatch = 512
Jun 25 19:00:12 tsisim ollama[585]: llama_context: causal_attn = 1
Jun 25 19:00:12 tsisim ollama[585]: llama_context: flash_attn = enabled
Jun 25 19:00:12 tsisim ollama[585]: llama_context: kv_unified = false
Jun 25 19:00:12 tsisim ollama[585]: llama_context: freq_base = 1000000.0
Jun 25 19:00:12 tsisim ollama[585]: llama_context: freq_scale = 1
Jun 25 19:00:12 tsisim ollama[585]: llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
Jun 25 19:00:12 tsisim ollama[585]: llama_context: CPU output buffer size = 1.00 MiB
Jun 25 19:00:12 tsisim ollama[585]: llama_kv_cache_iswa: using full-size SWA cache (ref: https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
Jun 25 19:00:12 tsisim ollama[585]: llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
Jun 25 19:00:12 tsisim ollama[585]: llama_kv_cache: CPU KV buffer size = 12.00 MiB
Jun 25 19:00:12 tsisim ollama[585]: llama_kv_cache: size = 12.00 MiB ( 4096 cells, 3 layers, 1/1 seqs), K (f16): 6.00 MiB, V (f16): 6.00 MiB
Jun 25 19:00:12 tsisim ollama[585]: llama_kv_cache_iswa: creating SWA KV cache, size = 4096 cells
Jun 25 19:00:12 tsisim ollama[585]: llama_kv_cache: CPU KV buffer size = 60.00 MiB
Jun 25 19:00:13 tsisim ollama[585]: llama_kv_cache: size = 60.00 MiB ( 4096 cells, 15 layers, 1/1 seqs), K (f16): 30.00 MiB, V (f16): 30.00 MiB
Jun 25 19:00:13 tsisim ollama[585]: llama_context: tsavorite compute buffer size = 3.25 MiB
Jun 25 19:00:13 tsisim ollama[585]: llama_context: CPU compute buffer size = 513.25 MiB
Jun 25 19:00:13 tsisim ollama[585]: llama_context: graph nodes = 729
Jun 25 19:00:13 tsisim ollama[585]: llama_context: graph splits = 258
Jun 25 19:00:13 tsisim ollama[585]: time=2026-06-25T19:00:13.791Z level=INFO source=server.go:1312 msg="llama runner started in 14.55 seconds"
Jun 25 19:01:14 tsisim ollama[6314]: TSI deploy yaml=/tsi/tsi-sw/anoop_ollama/ollama-arm64-release/bin/tsavorite-model-deployment.yaml txe_count=1 multi_thread_enable=0
Jun 25 19:01:14 tsisim ollama[6314]: finalize 4
Jun 25 19:01:14 tsisim ollama[6314]: OPU Profiling Results:
Jun 25 19:01:14 tsisim ollama[6314]: Profiler disabled
Jun 25 19:01:14 tsisim ollama[585]: [GIN] 2026/06/25 - 19:01:14 | 200 | 1m17s | 127.0.0.1 | POST "/api/generate"