-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Closed
Labels
Description
LocalAI version:
v3.8.0 in container (HIP-BLAS)
Describe the bug
If llama.cpp fails to load model, attempt to load another model (any) will fail.
To Reproduce
- Try to load any model that will cause load failure - model in unsupported architecture, model that will consume too much VRAM, doesn't matter, as long as load fails.
- Try to load any other model that it's known to load properly under.
- Observer failure.
Logs
Sequence of loading of Devstral 2 (not yet supported), then loading Devstral 1 (supported, working otherwise):
DBG GRPC(Devstral-Small-2-24B-Instruct-2512-Q4_K_XL-256K-CTX-127.0.0.1:46589): stderr common_init_from_params: failed to load model '/models/Devstral-Small-2-24B-Instruct-2512-UD-Q4_K_XL.gguf', try reducing --n-gpu-layers if you're running out of VRAM
2025-12-10T13:31:09.126595216+01:00 12:31PM DBG GRPC(Devstral-Small-2-24B-Instruct-2512-Q4_K_XL-256K-CTX-127.0.0.1:46589): stderr srv load_model: failed to load model, '/models/Devstral-Small-2-24B-Instruct-2512-UD-Q4_K_XL.gguf'
2025-12-10T13:31:09.128167056+01:00 12:31PM ERR Failed to load model Devstral-Small-2-24B-Instruct-2512-Q4_K_XL-256K-CTX with backend llama-cpp error="failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = " modelID=Devstral-Small-2-24B-Instruct-2512-Q4_K_XL-256K-CTX
2025-12-10T13:31:09.128201730+01:00 12:31PM DBG No choices in the response, skipping
2025-12-10T13:31:09.128201730+01:00 12:31PM DBG No choices in the response, skipping
2025-12-10T13:31:09.128201730+01:00 12:31PM DBG No choices in the response, skipping
2025-12-10T13:31:09.128214287+01:00 12:31PM ERR Stream ended with error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc =
2025-12-10T13:31:09.128363460+01:00 12:31PM INF HTTP request method=POST path=/v1/chat/completions status=200
2025-12-10T13:31:24.084600914+01:00 12:31PM DBG context local model name not found, setting to the first model first model name=Qwen3-Coder-REAP-363B-A35B-TQ1_0
2025-12-10T13:31:25.257372788+01:00 12:31PM DBG guessDefaultsFromFile: NGPULayers set NGPULayers=99999999
2025-12-10T13:31:25.257372788+01:00 12:31PM DBG guessDefaultsFromFile: template already set name=mistralai_devstral-small-2507
2025-12-10T13:31:25.257637398+01:00 12:31PM DBG Chat endpoint configuration read: &{modelConfigFile:/models/mistralai_devstral-small-2507.yaml PredictionOptions:{BasicModelRequest:{Model:mistralai_Devstral-Small-2507-Q4_K_M.gguf} Language: Translate:false N:0 TopP:0xc000f48640 TopK:0xc000f48648 Temperature:0xc000f48650 Maxtokens:0xc000f48680 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000f48678 TypicalP:0xc000f48670 Seed:0xc000f48690 Logprobs:{Enabled:false} TopLogprobs:<nil> LogitBias:map[] NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 ClipSkip:0 Tokenizer:} Name:mistralai_devstral-small-2507 F16:0xc000f485e8 Threads:0xc000f48630 Debug:0xc0028a33e0 Roles:map[] Embeddings:0xc000f48689 Backend:llama-cpp TemplateConfig:{Chat:{{.Input -}}
2025-12-10T13:31:25.257637398+01:00 ChatMessage:{{if eq .RoleName "user" -}}
2025-12-10T13:31:25.257637398+01:00 [INST] {{.Content }} [/INST]
2025-12-10T13:31:25.257637398+01:00 {{- else if .FunctionCall -}}
2025-12-10T13:31:25.257637398+01:00 [TOOL_CALLS] {{toJson .FunctionCall}} [/TOOL_CALLS]
2025-12-10T13:31:25.257637398+01:00 {{- else if eq .RoleName "tool" -}}
2025-12-10T13:31:25.257637398+01:00 [TOOL_RESULTS] {{.Content}} [/TOOL_RESULTS]
2025-12-10T13:31:25.257637398+01:00 {{- else -}}
2025-12-10T13:31:25.257637398+01:00 {{ .Content -}}
2025-12-10T13:31:25.257637398+01:00 {{ end -}} Completion:{{.Input}}
2025-12-10T13:31:25.257637398+01:00 Edit: Functions:[AVAILABLE_TOOLS] [{{range .Functions}}{"type": "function", "function": {"name": "{{.Name}}", "description": "{{.Description}}", "parameters": {{toJson .Parameters}} }}{{end}} ] [/AVAILABLE_TOOLS]{{.Input }} UseTokenizerTemplate:false JoinChatMessagesByCharacter:0xc000f50220 Multimodal: ReplyPrefix:} KnownUsecaseStrings:[FLAG_COMPLETION FLAG_CHAT] KnownUsecases:0xc00221b708 Pipeline:{TTS: LLM: Transcription: VAD:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:true GrammarConfig:{ParallelCalls:true DisableParallelNewLines:true MixedMode:false NoMixedFreeString:false NoGrammar:true Prefix: ExpectStringsAfterJSON:false PropOrder: SchemaType: GrammarTriggers:[]} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[(?s)\[TOOL\_CALLS\](.*)] ArgumentRegex:[] ArgumentRegexKey: ArgumentRegexValue: ReplaceFunctionResults:[{Key:(?s)^[^{\[]* Value:} {Key:(?s)[^}\]]*$ Value:} {Key:(?s)\[TOOL\_CALLS\] Value:} {Key:(?s)\[\/TOOL\_CALLS\] Value:}] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000f48668 MirostatTAU:0xc000f48660 Mirostat:0xc000f485f0 NGPULayers:0xc000d6aab8 MMap:0xc000f48620 MMlock:0xc000f48689 LowVRAM:0xc000f48689 Reranking:0xc000f48689 Grammar: StopWords:[<|im_end|> <dummy32000> </tool_call> <|eot_id|> <|end_of_text|> </s> [/TOOL_CALLS] [/ACTIONS]] Cutstrings:[] ExtractRegex:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000f485d8 NUMA:false LoraAdapter: LoraBase: LoraAdapters:[] LoraScales:[] LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: LoadFormat: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 DisableLogStatus:false DType: LimitMMPerPrompt:{LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0} MMProj: FlashAttention:0xc000f11d00 NoKVOffloading:false CacheTypeK:q4_0 CacheTypeV:q4_0 RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 CFGScale:0} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: AudioPath:} CUDA:false DownloadFiles:[] Description: Usage: Options:[gpu] Overrides:[] MCP:{Servers: Stdio:} Agent:{MaxAttempts:0 MaxIterations:0 EnableReasoning:false EnablePlanning:false EnableMCPPrompts:false EnablePlanReEvaluator:false}}
2025-12-10T13:31:25.257719114+01:00 12:31PM DBG Parameters: &{modelConfigFile:/models/mistralai_devstral-small-2507.yaml PredictionOptions:{BasicModelRequest:{Model:mistralai_Devstral-Small-2507-Q4_K_M.gguf} Language: Translate:false N:0 TopP:0xc000f48640 TopK:0xc000f48648 Temperature:0xc000f48650 Maxtokens:0xc000f48680 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000f48678 TypicalP:0xc000f48670 Seed:0xc000f48690 Logprobs:{Enabled:false} TopLogprobs:<nil> LogitBias:map[] NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 ClipSkip:0 Tokenizer:} Name:mistralai_devstral-small-2507 F16:0xc000f485e8 Threads:0xc000f48630 Debug:0xc0028a33e0 Roles:map[] Embeddings:0xc000f48689 Backend:llama-cpp TemplateConfig:{Chat:{{.Input -}}
2025-12-10T13:31:25.257719114+01:00 ChatMessage:{{if eq .RoleName "user" -}}
2025-12-10T13:31:25.257719114+01:00 [INST] {{.Content }} [/INST]
2025-12-10T13:31:25.257719114+01:00 {{- else if .FunctionCall -}}
2025-12-10T13:31:25.257719114+01:00 [TOOL_CALLS] {{toJson .FunctionCall}} [/TOOL_CALLS]
2025-12-10T13:31:25.257719114+01:00 {{- else if eq .RoleName "tool" -}}
2025-12-10T13:31:25.257719114+01:00 [TOOL_RESULTS] {{.Content}} [/TOOL_RESULTS]
2025-12-10T13:31:25.257719114+01:00 {{- else -}}
2025-12-10T13:31:25.257719114+01:00 {{ .Content -}}
2025-12-10T13:31:25.257719114+01:00 {{ end -}} Completion:{{.Input}}
2025-12-10T13:31:25.257719114+01:00 Edit: Functions:[AVAILABLE_TOOLS] [{{range .Functions}}{"type": "function", "function": {"name": "{{.Name}}", "description": "{{.Description}}", "parameters": {{toJson .Parameters}} }}{{end}} ] [/AVAILABLE_TOOLS]{{.Input }} UseTokenizerTemplate:false JoinChatMessagesByCharacter:0xc000f50220 Multimodal: ReplyPrefix:} KnownUsecaseStrings:[FLAG_COMPLETION FLAG_CHAT] KnownUsecases:0xc00221b708 Pipeline:{TTS: LLM: Transcription: VAD:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:true GrammarConfig:{ParallelCalls:true DisableParallelNewLines:true MixedMode:false NoMixedFreeString:false NoGrammar:true Prefix: ExpectStringsAfterJSON:false PropOrder: SchemaType: GrammarTriggers:[]} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[(?s)\[TOOL\_CALLS\](.*)] ArgumentRegex:[] ArgumentRegexKey: ArgumentRegexValue: ReplaceFunctionResults:[{Key:(?s)^[^{\[]* Value:} {Key:(?s)[^}\]]*$ Value:} {Key:(?s)\[TOOL\_CALLS\] Value:} {Key:(?s)\[\/TOOL\_CALLS\] Value:}] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000f48668 MirostatTAU:0xc000f48660 Mirostat:0xc000f485f0 NGPULayers:0xc000d6aab8 MMap:0xc000f48620 MMlock:0xc000f48689 LowVRAM:0xc000f48689 Reranking:0xc000f48689 Grammar: StopWords:[<|im_end|> <dummy32000> </tool_call> <|eot_id|> <|end_of_text|> </s> [/TOOL_CALLS] [/ACTIONS]] Cutstrings:[] ExtractRegex:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000f485d8 NUMA:false LoraAdapter: LoraBase: LoraAdapters:[] LoraScales:[] LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: LoadFormat: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 DisableLogStatus:false DType: LimitMMPerPrompt:{LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0} MMProj: FlashAttention:0xc000f11d00 NoKVOffloading:false CacheTypeK:q4_0 CacheTypeV:q4_0 RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 CFGScale:0} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: AudioPath:} CUDA:false DownloadFiles:[] Description: Usage: Options:[gpu] Overrides:[] MCP:{Servers: Stdio:} Agent:{MaxAttempts:0 MaxIterations:0 EnableReasoning:false EnablePlanning:false EnableMCPPrompts:false EnablePlanReEvaluator:false}}
2025-12-10T13:31:25.257976563+01:00 12:31PM DBG templated message for chat: [INST] Hi [/INST]
2025-12-10T13:31:25.257976563+01:00 12:31PM DBG templated message for chat: Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc =
2025-12-10T13:31:25.257992157+01:00 12:31PM DBG templated message for chat: [INST] hi [/INST]
2025-12-10T13:31:25.258004826+01:00 12:31PM DBG templated message for chat: <span class='error'>Error: Failed to process stream</span>
2025-12-10T13:31:25.258016260+01:00 12:31PM DBG templated message for chat: [INST] hi [/INST]
2025-12-10T13:31:25.258049272+01:00 12:31PM DBG templated message for chat: Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc =
2025-12-10T13:31:25.258049272+01:00 12:31PM DBG templated message for chat: [INST] hi [/INST]
2025-12-10T13:31:25.258063174+01:00 12:31PM DBG templated message for chat: Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc =
2025-12-10T13:31:25.258075357+01:00 12:31PM DBG templated message for chat: [INST] hi [/INST]
2025-12-10T13:31:25.258103927+01:00 12:31PM DBG Prompt (before templating): [INST] Hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]<span class='error'>Error: Failed to process stream</span>[INST] hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]
2025-12-10T13:31:25.258225843+01:00 12:31PM DBG Template found, input modified to: [INST] Hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]<span class='error'>Error: Failed to process stream</span>[INST] hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]
2025-12-10T13:31:25.258225843+01:00 12:31PM DBG Prompt (after templating): [INST] Hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]<span class='error'>Error: Failed to process stream</span>[INST] hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]
2025-12-10T13:31:25.258225843+01:00 12:31PM DBG Stream request received
2025-12-10T13:31:25.258311895+01:00 12:31PM DBG Sending chunk: {"created":1765369885,"object":"chat.completion.chunk","id":"e3f2b78b-7048-458b-af1c-407aebfddc0a","model":"mistralai_devstral-small-2507","choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant","content":null}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}