The model loads and runs normally after server startup. However, if the model remains idle long enough to be unloaded from VRAM (or otherwise released by the server's idle memory management), subsequent requests cause a CUDA failure during the first inference after the model is reloaded.
The error occurs consistently after the idle → unload (sleeping mode) → reload cycle and does not occur immediately after startup.
ggml_cuda_compute_forward: MUL_MAT failed
CUDA error: invalid argument
The model can be loaded successfully, all initialization steps complete normally, and the crash only appears when processing the first request after the model has been unloaded and reloaded.
In versions earlier than b9557 ( the version I used is b9555 ), the issue does not result in an immediate error. Instead, after exiting sleep mode and reloading the model, inference performance drops significantly, suggesting that the model may not be fully loaded onto the GPU or that some computations are no longer being executed on the GPU as expected. The corresponding logs are also included in the Log Output section below.
Logs ( old version )
```
[56280] 4.11.090.386 I srv handle_sleep: server is exiting sleeping state
[56280] 4.11.090.387 I srv load_model: loading model 'D:/LLMs/Qwen3.6-27B-MTP-IQ4_NL.gguf'
[56280] 4.12.240.981 I srv load_model: [mtmd] estimated worst-case memory usage of mmproj is 1161.02 MiB
[56280] 4.12.591.172 I srv load_model: [spec] estimated memory usage of MTP context is 420.52 MiB
[56280] 4.12.591.189 I common_init_result: fitting params to device memory ...
[56280] 4.12.591.190 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
[56280] 4.19.470.426 W llama_context: n_ctx_seq (72192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[56280] 4.19.627.377 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[56280] 4.19.839.289 I srv load_model: creating MTP draft context against the target model 'D:/LLMs/Qwen3.6-27B-MTP-IQ4_NL.gguf'
[56280] 4.19.839.333 W llama_context: n_ctx_seq (72192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[56280] 4.19.867.800 W load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
[56280] 4.19.867.803 W load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
[56280] 4.19.867.803 W load_hparams: more info: #16842
[56280]
[56280] 4.20.655.546 I srv load_model: loaded multimodal model, 'D:/LLMs/Qwen3.6-27B-MTP-IQ4_NL-mmproj.gguf'
[56280] 4.20.655.552 I srv load_model: initializing slots, n_slots = 1
[56280] 4.20.733.251 I common_context_can_seq_rm: the context supports bounded partial sequence removal
[56280] 4.20.765.380 I common_speculative_impl_ngram_mod: adding speculative implementation 'ngram-mod'
[56280] 4.20.765.383 I common_speculative_impl_ngram_mod: - n_match=24, n_max=48, n_min=16
[56280] 4.20.765.386 I common_speculative_impl_ngram_mod: - mod size=4194304 (16.000 MB)
[56280] 4.20.765.389 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
[56280] 4.20.765.390 I common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00, n_embd=5120, backend_sampling=1
[56280] 4.20.765.391 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=f16, cache_v=f16, ctx_tgt=yes, ctx_dft=yes, devices=[default]
[56280] 4.20.794.929 I srv load_model: speculative decoding context initialized
[56280] 4.20.794.933 I slot load_model: id 0 | task -1 | new slot, n_ctx = 72192
[56280] 4.20.794.998 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
[56280] 4.20.795.001 I srv load_model: use --cache-ram 0 to disable the prompt cache
[56280] 4.20.802.534 I srv load_model: for more info see #16391
[56280] 4.20.802.537 I srv load_model: context checkpoints enabled, max = 32, min spacing = 256
[56280] 4.20.802.547 I srv update_slots: all slots are idle
[56280] 4.20.838.673 I srv params_from_: Chat format: peg-native
[56280] 4.20.838.856 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
[56280] 4.20.838.859 I srv get_availabl: updating prompt cache
[56280] 4.20.838.863 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
[56280] 4.20.838.865 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 72192 tokens, 8589934592 est)
[56280] 4.20.838.866 I srv get_availabl: prompt cache update took 0.01 ms
[56280] 4.20.838.913 I slot launch_slot_: id 0 | task 1973 | processing task, is_child = 0
[56280] 4.25.371.386 I slot print_timing: id 0 | task 1973 | prompt processing, n_tokens = 4096, progress = 0.49, t = 4.53 s / 903.70 tokens per second
[56280] 4.27.696.266 I slot print_timing: id 0 | task 1973 | prompt processing, n_tokens = 6144, progress = 0.74, t = 6.86 s / 895.97 tokens per second
[56280] 4.29.735.945 I slot print_timing: id 0 | task 1973 | prompt processing, n_tokens = 7791, progress = 0.94, t = 8.90 s / 875.69 tokens per second
[56280] 4.29.763.765 I slot create_check: id 0 | task 1973 | created context checkpoint 1 of 32 (pos_min = 7790, pos_max = 7790, n_tokens = 7791, size = 180.208 MiB)
[56280] 4.30.355.618 I slot print_timing: id 0 | task 1973 | prompt processing, n_tokens = 8303, progress = 1.00, t = 9.52 s / 872.47 tokens per second
[56280] 4.30.383.988 I slot create_check: id 0 | task 1973 | created context checkpoint 2 of 32 (pos_min = 8302, pos_max = 8302, n_tokens = 8303, size = 182.218 MiB)
[56280] 4.30.485.285 I begin: ngram_mod occupancy = 5955/4194304 (0.00)
[56280] 4.34.885.250 I slot print_timing: id 0 | task 1973 | n_decoded = 100, tg = 22.73 t/s
[56280] 4.38.010.444 I slot print_timing: id 0 | task 1973 | n_decoded = 189, tg = 25.12 t/s
...
[56280] 6.03.464.444 I slot print_timing: id 0 | task 1973 | n_decoded = 3309, tg = 35.59 t/s
[56280] 6.05.774.810 I slot print_timing: id 0 | task 1973 | prompt eval time = 9646.85 ms / 8307 tokens ( 1.16 ms per token, 861.11 tokens per second)
[56280] 6.05.774.816 I slot print_timing: id 0 | task 1973 | eval time = 95288.92 ms / 3382 tokens ( 28.18 ms per token, 35.49 tokens per second)
[56280] 6.05.774.817 I slot print_timing: id 0 | task 1973 | total time = 104935.77 ms / 11689 tokens
[56280] 6.05.774.817 I slot print_timing: id 0 | task 1973 | graphs reused = 740
[56280] 6.05.774.818 I slot print_timing: id 0 | task 1973 | draft acceptance = 0.69849 ( 2590 accepted / 3708 generated)
[56280] 6.05.774.836 I statistics ngram-mod: #calls(b,g,a) = 1 791 31, #gen drafts = 31, #acc drafts = 31, #gen tokens = 1428, #acc tokens = 873, dur(b,g,a) = 0.370, 1.646, 0.015 ms
[56280] 6.05.774.841 I statistics draft-mtp: #calls(b,g,a) = 1 760 760, #gen drafts = 760, #acc drafts = 669, #gen tokens = 2280, #acc tokens = 1717, dur(b,g,a) = 0.001, 6319.715, 0.773 ms
[56280] 6.05.775.201 I slot release: id 0 | task 1973 | stop processing: n_tokens = 11688, truncated = 0
[56280] 6.05.775.225 I srv update_slots: all slots are idle �[34m6.18.294.761�[0m �[32mI �[0msrv proxy_reques: proxying request to model Qwen27B/Qwen3.6-27B-MTP on port 56280
[56280] 6.05.817.120 I srv params_from_: Chat format: peg-native
[56280] 6.05.819.299 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 111031212
[56280] 6.05.819.304 I srv get_availabl: updating prompt cache
[56280] 6.05.820.428 W srv prompt_save: - saving prompt with length 11688, total state size = 583.807 MiB (draft: 45.879 MiB)
[56280] 6.05.938.037 I srv load: - looking for better prompt, base f_keep = 0.000, sim = 0.001
[56280] 6.05.938.044 I srv update: - cache state: 1 prompts, 946.233 MiB (limits: 8192.000 MiB, 72192 tokens, 101188 est)
[56280] 6.05.938.045 I srv update: - prompt 00000163110B7CD0: 11688 tokens, checkpoints: 2, 946.233 MiB
[56280] 6.05.938.048 I srv get_availabl: prompt cache update took 118.74 ms
[56280] 6.05.938.177 I slot launch_slot_: id 0 | task 2790 | processing task, is_child = 0
[56280] 6.05.938.192 I slot update_slots: id 0 | task 2790 | Checking checkpoint with [8302, 8302] against 3...
[56280] 6.05.938.193 I slot update_slots: id 0 | task 2790 | Checking checkpoint with [7790, 7790] against 3...
[56280] 6.05.938.194 W slot update_slots: id 0 | task 2790 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com//pull/13194#issuecomment-2868343055)
[56280] 6.05.938.197 W slot update_slots: id 0 | task 2790 | erased invalidated context checkpoint (pos_min = 7790, pos_max = 7790, n_tokens = 7791, n_swa = 0, pos_next = 0, size = 180.208 MiB)
[56280] 6.05.942.834 W slot update_slots: id 0 | task 2790 | erased invalidated context checkpoint (pos_min = 8302, pos_max = 8302, n_tokens = 8303, n_swa = 0, pos_next = 0, size = 182.218 MiB)
[56280] 6.07.443.957 I slot create_check: id 0 | task 2790 | created context checkpoint 1 of 32 (pos_min = 1223, pos_max = 1223, n_tokens = 1224, size = 154.431 MiB)
[56280] 6.09.726.661 I slot print_timing: id 0 | task 2790 | prompt processing, n_tokens = 3272, progress = 0.79, t = 3.79 s / 863.68 tokens per second
[56280] 6.10.200.108 I slot print_timing: id 0 | task 2790 | prompt processing, n_tokens = 3629, progress = 0.88, t = 4.26 s / 851.50 tokens per second
[56280] 6.10.225.165 I slot create_check: id 0 | task 2790 | created context checkpoint 2 of 32 (pos_min = 3628, pos_max = 3628, n_tokens = 3629, size = 163.871 MiB)
[56280] 6.10.806.087 I slot print_timing: id 0 | task 2790 | prompt processing, n_tokens = 4141, progress = 1.00, t = 4.87 s / 850.68 tokens per second
[56280] 6.10.833.328 I slot create_check: id 0 | task 2790 | created context checkpoint 3 of 32 (pos_min = 4140, pos_max = 4140, n_tokens = 4141, size = 165.881 MiB)
[56280] 6.10.926.252 I begin: ngram_mod occupancy = 8862/4194304 (0.00)
[56280] 6.14.281.611 I slot print_timing: id 0 | task 2790 | n_decoded = 102, tg = 30.41 t/s
[56280] 6.17.340.648 I slot print_timing: id 0 | task 2790 | n_decoded = 180, tg = 28.07 t/s
...
[56280] 8.29.197.005 I slot print_timing: id 0 | task 2790 | n_decoded = 3707, tg = 26.81 t/s
[56280] 8.30.117.777 I slot print_timing: id 0 | task 2790 | prompt eval time = 4988.76 ms / 4145 tokens ( 1.20 ms per token, 830.87 tokens per second)
[56280] 8.30.117.781 I slot print_timing: id 0 | task 2790 | eval time = 139190.75 ms / 3737 tokens ( 37.25 ms per token, 26.85 tokens per second)
[56280] 8.30.117.782 I slot print_timing: id 0 | task 2790 | total time = 144179.51 ms / 7882 tokens
[56280] 8.30.117.782 I slot print_timing: id 0 | task 2790 | graphs reused = 1958
[56280] 8.30.117.783 I slot print_timing: id 0 | task 2790 | draft acceptance = 0.56290 ( 2479 accepted / 4404 generated)
[56280] 8.30.117.796 I statistics ngram-mod: #calls(b,g,a) = 2 2049 45, #gen drafts = 45, #acc drafts = 45, #gen tokens = 2100, #acc tokens = 1006, dur(b,g,a) = 0.631, 4.065, 1.182 ms
[56280] 8.30.117.800 I statistics draft-mtp: #calls(b,g,a) = 2 2004 2004, #gen drafts = 2004, #acc drafts = 1646, #gen tokens = 6012, #acc tokens = 4063, dur(b,g,a) = 0.002, 16293.805, 2.227 ms
[56280] 8.30.117.954 I slot release: id 0 | task 2790 | stop processing: n_tokens = 7882, truncated = 0
[56280] 8.30.117.980 I srv update_slots: all slots are idle
[56280] 8.50.327.489 I que start_loop: entering sleeping state
[56280] cmd_child_to_router:sleep
[56280] 8.50.327.750 I srv handle_sleep: server is entering sleeping state
Name and Version
version: 9668 (32120c1)
built with MSVC 19.44.35225.0 for x64
Operating systems
Windows
GGML backends
CUDA
Hardware
Ryzen 7950X & NVIDIA RTX 3090
Models
huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_NL.gguf
Problem description & steps to reproduce
Problem description
The model loads and runs normally after server startup. However, if the model remains idle long enough to be unloaded from VRAM (or otherwise released by the server's idle memory management), subsequent requests cause a CUDA failure during the first inference after the model is reloaded.
The error occurs consistently after the idle → unload (sleeping mode) → reload cycle and does not occur immediately after startup.
The failure happens during the forward pass with the following error:
The model can be loaded successfully, all initialization steps complete normally, and the crash only appears when processing the first request after the model has been unloaded and reloaded.
Steps to reproduce
The full settings will be listed here:
Settings
Then here is the steps:
sleep-idle-secondsfor models, where the model I used is Qwen3.6-27B-MTP-IQ4_NL.Additional problem in older version
In versions earlier than b9557 ( the version I used is b9555 ), the issue does not result in an immediate error. Instead, after exiting sleep mode and reloading the model, inference performance drops significantly, suggesting that the model may not be fully loaded onto the GPU or that some computations are no longer being executed on the GPU as expected. The corresponding logs are also included in the Log Output section below.
First Bad Commit
b9557
Relevant log output
Logs
Logs ( old version )
```[56280] 4.11.090.386 I srv handle_sleep: server is exiting sleeping state
[56280] 4.11.090.387 I srv load_model: loading model 'D:/LLMs/Qwen3.6-27B-MTP-IQ4_NL.gguf'
[56280] 4.12.240.981 I srv load_model: [mtmd] estimated worst-case memory usage of mmproj is 1161.02 MiB
[56280] 4.12.591.172 I srv load_model: [spec] estimated memory usage of MTP context is 420.52 MiB
[56280] 4.12.591.189 I common_init_result: fitting params to device memory ...
[56280] 4.12.591.190 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
[56280] 4.19.470.426 W llama_context: n_ctx_seq (72192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[56280] 4.19.627.377 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[56280] 4.19.839.289 I srv load_model: creating MTP draft context against the target model 'D:/LLMs/Qwen3.6-27B-MTP-IQ4_NL.gguf'
[56280] 4.19.839.333 W llama_context: n_ctx_seq (72192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[56280] 4.19.867.800 W load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
[56280] 4.19.867.803 W load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
[56280] 4.19.867.803 W load_hparams: more info: #16842
[56280]
[56280] 4.20.655.546 I srv load_model: loaded multimodal model, 'D:/LLMs/Qwen3.6-27B-MTP-IQ4_NL-mmproj.gguf'
[56280] 4.20.655.552 I srv load_model: initializing slots, n_slots = 1
[56280] 4.20.733.251 I common_context_can_seq_rm: the context supports bounded partial sequence removal
[56280] 4.20.765.380 I common_speculative_impl_ngram_mod: adding speculative implementation 'ngram-mod'
[56280] 4.20.765.383 I common_speculative_impl_ngram_mod: - n_match=24, n_max=48, n_min=16
[56280] 4.20.765.386 I common_speculative_impl_ngram_mod: - mod size=4194304 (16.000 MB)
[56280] 4.20.765.389 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
[56280] 4.20.765.390 I common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00, n_embd=5120, backend_sampling=1
[56280] 4.20.765.391 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=f16, cache_v=f16, ctx_tgt=yes, ctx_dft=yes, devices=[default]
[56280] 4.20.794.929 I srv load_model: speculative decoding context initialized
[56280] 4.20.794.933 I slot load_model: id 0 | task -1 | new slot, n_ctx = 72192
[56280] 4.20.794.998 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
[56280] 4.20.795.001 I srv load_model: use --cache-ram 0 to disable the prompt cache
[56280] 4.20.802.534 I srv load_model: for more info see #16391
[56280] 4.20.802.537 I srv load_model: context checkpoints enabled, max = 32, min spacing = 256
[56280] 4.20.802.547 I srv update_slots: all slots are idle
[56280] 4.20.838.673 I srv params_from_: Chat format: peg-native
[56280] 4.20.838.856 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
[56280] 4.20.838.859 I srv get_availabl: updating prompt cache
[56280] 4.20.838.863 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
[56280] 4.20.838.865 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 72192 tokens, 8589934592 est)
[56280] 4.20.838.866 I srv get_availabl: prompt cache update took 0.01 ms
[56280] 4.20.838.913 I slot launch_slot_: id 0 | task 1973 | processing task, is_child = 0
[56280] 4.25.371.386 I slot print_timing: id 0 | task 1973 | prompt processing, n_tokens = 4096, progress = 0.49, t = 4.53 s / 903.70 tokens per second
[56280] 4.27.696.266 I slot print_timing: id 0 | task 1973 | prompt processing, n_tokens = 6144, progress = 0.74, t = 6.86 s / 895.97 tokens per second
[56280] 4.29.735.945 I slot print_timing: id 0 | task 1973 | prompt processing, n_tokens = 7791, progress = 0.94, t = 8.90 s / 875.69 tokens per second
[56280] 4.29.763.765 I slot create_check: id 0 | task 1973 | created context checkpoint 1 of 32 (pos_min = 7790, pos_max = 7790, n_tokens = 7791, size = 180.208 MiB)
[56280] 4.30.355.618 I slot print_timing: id 0 | task 1973 | prompt processing, n_tokens = 8303, progress = 1.00, t = 9.52 s / 872.47 tokens per second
[56280] 4.30.383.988 I slot create_check: id 0 | task 1973 | created context checkpoint 2 of 32 (pos_min = 8302, pos_max = 8302, n_tokens = 8303, size = 182.218 MiB)
[56280] 4.30.485.285 I begin: ngram_mod occupancy = 5955/4194304 (0.00)
[56280] 4.34.885.250 I slot print_timing: id 0 | task 1973 | n_decoded = 100, tg = 22.73 t/s
[56280] 4.38.010.444 I slot print_timing: id 0 | task 1973 | n_decoded = 189, tg = 25.12 t/s
...
[56280] 6.03.464.444 I slot print_timing: id 0 | task 1973 | n_decoded = 3309, tg = 35.59 t/s
[56280] 6.05.774.810 I slot print_timing: id 0 | task 1973 | prompt eval time = 9646.85 ms / 8307 tokens ( 1.16 ms per token, 861.11 tokens per second)
[56280] 6.05.774.816 I slot print_timing: id 0 | task 1973 | eval time = 95288.92 ms / 3382 tokens ( 28.18 ms per token, 35.49 tokens per second)
[56280] 6.05.774.817 I slot print_timing: id 0 | task 1973 | total time = 104935.77 ms / 11689 tokens
[56280] 6.05.774.817 I slot print_timing: id 0 | task 1973 | graphs reused = 740
[56280] 6.05.774.818 I slot print_timing: id 0 | task 1973 | draft acceptance = 0.69849 ( 2590 accepted / 3708 generated)
[56280] 6.05.774.836 I statistics ngram-mod: #calls(b,g,a) = 1 791 31, #gen drafts = 31, #acc drafts = 31, #gen tokens = 1428, #acc tokens = 873, dur(b,g,a) = 0.370, 1.646, 0.015 ms
[56280] 6.05.774.841 I statistics draft-mtp: #calls(b,g,a) = 1 760 760, #gen drafts = 760, #acc drafts = 669, #gen tokens = 2280, #acc tokens = 1717, dur(b,g,a) = 0.001, 6319.715, 0.773 ms
[56280] 6.05.775.201 I slot release: id 0 | task 1973 | stop processing: n_tokens = 11688, truncated = 0
[56280] 6.05.775.225 I srv update_slots: all slots are idle �[34m6.18.294.761�[0m �[32mI �[0msrv proxy_reques: proxying request to model Qwen27B/Qwen3.6-27B-MTP on port 56280
[56280] 6.05.817.120 I srv params_from_: Chat format: peg-native
[56280] 6.05.819.299 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 111031212
[56280] 6.05.819.304 I srv get_availabl: updating prompt cache
[56280] 6.05.820.428 W srv prompt_save: - saving prompt with length 11688, total state size = 583.807 MiB (draft: 45.879 MiB)
[56280] 6.05.938.037 I srv load: - looking for better prompt, base f_keep = 0.000, sim = 0.001
[56280] 6.05.938.044 I srv update: - cache state: 1 prompts, 946.233 MiB (limits: 8192.000 MiB, 72192 tokens, 101188 est)
[56280] 6.05.938.045 I srv update: - prompt 00000163110B7CD0: 11688 tokens, checkpoints: 2, 946.233 MiB
[56280] 6.05.938.048 I srv get_availabl: prompt cache update took 118.74 ms
[56280] 6.05.938.177 I slot launch_slot_: id 0 | task 2790 | processing task, is_child = 0
[56280] 6.05.938.192 I slot update_slots: id 0 | task 2790 | Checking checkpoint with [8302, 8302] against 3...
[56280] 6.05.938.193 I slot update_slots: id 0 | task 2790 | Checking checkpoint with [7790, 7790] against 3...
[56280] 6.05.938.194 W slot update_slots: id 0 | task 2790 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com//pull/13194#issuecomment-2868343055)
[56280] 6.05.938.197 W slot update_slots: id 0 | task 2790 | erased invalidated context checkpoint (pos_min = 7790, pos_max = 7790, n_tokens = 7791, n_swa = 0, pos_next = 0, size = 180.208 MiB)
[56280] 6.05.942.834 W slot update_slots: id 0 | task 2790 | erased invalidated context checkpoint (pos_min = 8302, pos_max = 8302, n_tokens = 8303, n_swa = 0, pos_next = 0, size = 182.218 MiB)
[56280] 6.07.443.957 I slot create_check: id 0 | task 2790 | created context checkpoint 1 of 32 (pos_min = 1223, pos_max = 1223, n_tokens = 1224, size = 154.431 MiB)
[56280] 6.09.726.661 I slot print_timing: id 0 | task 2790 | prompt processing, n_tokens = 3272, progress = 0.79, t = 3.79 s / 863.68 tokens per second
[56280] 6.10.200.108 I slot print_timing: id 0 | task 2790 | prompt processing, n_tokens = 3629, progress = 0.88, t = 4.26 s / 851.50 tokens per second
[56280] 6.10.225.165 I slot create_check: id 0 | task 2790 | created context checkpoint 2 of 32 (pos_min = 3628, pos_max = 3628, n_tokens = 3629, size = 163.871 MiB)
[56280] 6.10.806.087 I slot print_timing: id 0 | task 2790 | prompt processing, n_tokens = 4141, progress = 1.00, t = 4.87 s / 850.68 tokens per second
[56280] 6.10.833.328 I slot create_check: id 0 | task 2790 | created context checkpoint 3 of 32 (pos_min = 4140, pos_max = 4140, n_tokens = 4141, size = 165.881 MiB)
[56280] 6.10.926.252 I begin: ngram_mod occupancy = 8862/4194304 (0.00)
[56280] 6.14.281.611 I slot print_timing: id 0 | task 2790 | n_decoded = 102, tg = 30.41 t/s
[56280] 6.17.340.648 I slot print_timing: id 0 | task 2790 | n_decoded = 180, tg = 28.07 t/s
...
[56280] 8.29.197.005 I slot print_timing: id 0 | task 2790 | n_decoded = 3707, tg = 26.81 t/s
[56280] 8.30.117.777 I slot print_timing: id 0 | task 2790 | prompt eval time = 4988.76 ms / 4145 tokens ( 1.20 ms per token, 830.87 tokens per second)
[56280] 8.30.117.781 I slot print_timing: id 0 | task 2790 | eval time = 139190.75 ms / 3737 tokens ( 37.25 ms per token, 26.85 tokens per second)
[56280] 8.30.117.782 I slot print_timing: id 0 | task 2790 | total time = 144179.51 ms / 7882 tokens
[56280] 8.30.117.782 I slot print_timing: id 0 | task 2790 | graphs reused = 1958
[56280] 8.30.117.783 I slot print_timing: id 0 | task 2790 | draft acceptance = 0.56290 ( 2479 accepted / 4404 generated)
[56280] 8.30.117.796 I statistics ngram-mod: #calls(b,g,a) = 2 2049 45, #gen drafts = 45, #acc drafts = 45, #gen tokens = 2100, #acc tokens = 1006, dur(b,g,a) = 0.631, 4.065, 1.182 ms
[56280] 8.30.117.800 I statistics draft-mtp: #calls(b,g,a) = 2 2004 2004, #gen drafts = 2004, #acc drafts = 1646, #gen tokens = 6012, #acc tokens = 4063, dur(b,g,a) = 0.002, 16293.805, 2.227 ms
[56280] 8.30.117.954 I slot release: id 0 | task 2790 | stop processing: n_tokens = 7882, truncated = 0
[56280] 8.30.117.980 I srv update_slots: all slots are idle
[56280] 8.50.327.489 I que start_loop: entering sleeping state
[56280] cmd_child_to_router:sleep
[56280] 8.50.327.750 I srv handle_sleep: server is entering sleeping state