Skip to content

Failure to load model prevents loading of further models #7503

@Expro

Description

@Expro

LocalAI version:
v3.8.0 in container (HIP-BLAS)

Describe the bug
If llama.cpp fails to load model, attempt to load another model (any) will fail.

To Reproduce

  1. Try to load any model that will cause load failure - model in unsupported architecture, model that will consume too much VRAM, doesn't matter, as long as load fails.
  2. Try to load any other model that it's known to load properly under.
  3. Observer failure.

Logs
Sequence of loading of Devstral 2 (not yet supported), then loading Devstral 1 (supported, working otherwise):


 DBG GRPC(Devstral-Small-2-24B-Instruct-2512-Q4_K_XL-256K-CTX-127.0.0.1:46589): stderr common_init_from_params: failed to load model '/models/Devstral-Small-2-24B-Instruct-2512-UD-Q4_K_XL.gguf', try reducing --n-gpu-layers if you're running out of VRAM
2025-12-10T13:31:09.126595216+01:00 12:31PM DBG GRPC(Devstral-Small-2-24B-Instruct-2512-Q4_K_XL-256K-CTX-127.0.0.1:46589): stderr srv    load_model: failed to load model, '/models/Devstral-Small-2-24B-Instruct-2512-UD-Q4_K_XL.gguf'
2025-12-10T13:31:09.128167056+01:00 12:31PM ERR Failed to load model Devstral-Small-2-24B-Instruct-2512-Q4_K_XL-256K-CTX with backend llama-cpp error="failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = " modelID=Devstral-Small-2-24B-Instruct-2512-Q4_K_XL-256K-CTX
2025-12-10T13:31:09.128201730+01:00 12:31PM DBG No choices in the response, skipping
2025-12-10T13:31:09.128201730+01:00 12:31PM DBG No choices in the response, skipping
2025-12-10T13:31:09.128201730+01:00 12:31PM DBG No choices in the response, skipping
2025-12-10T13:31:09.128214287+01:00 12:31PM ERR Stream ended with error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = 
2025-12-10T13:31:09.128363460+01:00 12:31PM INF HTTP request method=POST path=/v1/chat/completions status=200
2025-12-10T13:31:24.084600914+01:00 12:31PM DBG context local model name not found, setting to the first model first model name=Qwen3-Coder-REAP-363B-A35B-TQ1_0
2025-12-10T13:31:25.257372788+01:00 12:31PM DBG guessDefaultsFromFile: NGPULayers set NGPULayers=99999999
2025-12-10T13:31:25.257372788+01:00 12:31PM DBG guessDefaultsFromFile: template already set name=mistralai_devstral-small-2507
2025-12-10T13:31:25.257637398+01:00 12:31PM DBG Chat endpoint configuration read: &{modelConfigFile:/models/mistralai_devstral-small-2507.yaml PredictionOptions:{BasicModelRequest:{Model:mistralai_Devstral-Small-2507-Q4_K_M.gguf} Language: Translate:false N:0 TopP:0xc000f48640 TopK:0xc000f48648 Temperature:0xc000f48650 Maxtokens:0xc000f48680 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000f48678 TypicalP:0xc000f48670 Seed:0xc000f48690 Logprobs:{Enabled:false} TopLogprobs:<nil> LogitBias:map[] NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 ClipSkip:0 Tokenizer:} Name:mistralai_devstral-small-2507 F16:0xc000f485e8 Threads:0xc000f48630 Debug:0xc0028a33e0 Roles:map[] Embeddings:0xc000f48689 Backend:llama-cpp TemplateConfig:{Chat:{{.Input -}}
2025-12-10T13:31:25.257637398+01:00  ChatMessage:{{if eq .RoleName "user" -}}
2025-12-10T13:31:25.257637398+01:00 [INST] {{.Content }} [/INST]
2025-12-10T13:31:25.257637398+01:00 {{- else if .FunctionCall -}}
2025-12-10T13:31:25.257637398+01:00 [TOOL_CALLS] {{toJson .FunctionCall}} [/TOOL_CALLS]
2025-12-10T13:31:25.257637398+01:00 {{- else if eq .RoleName "tool" -}}
2025-12-10T13:31:25.257637398+01:00 [TOOL_RESULTS] {{.Content}} [/TOOL_RESULTS]
2025-12-10T13:31:25.257637398+01:00 {{- else -}}
2025-12-10T13:31:25.257637398+01:00 {{ .Content -}}
2025-12-10T13:31:25.257637398+01:00 {{ end -}} Completion:{{.Input}}
2025-12-10T13:31:25.257637398+01:00  Edit: Functions:[AVAILABLE_TOOLS] [{{range .Functions}}{"type": "function", "function": {"name": "{{.Name}}", "description": "{{.Description}}", "parameters": {{toJson .Parameters}} }}{{end}} ] [/AVAILABLE_TOOLS]{{.Input }} UseTokenizerTemplate:false JoinChatMessagesByCharacter:0xc000f50220 Multimodal: ReplyPrefix:} KnownUsecaseStrings:[FLAG_COMPLETION FLAG_CHAT] KnownUsecases:0xc00221b708 Pipeline:{TTS: LLM: Transcription: VAD:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:true GrammarConfig:{ParallelCalls:true DisableParallelNewLines:true MixedMode:false NoMixedFreeString:false NoGrammar:true Prefix: ExpectStringsAfterJSON:false PropOrder: SchemaType: GrammarTriggers:[]} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[(?s)\[TOOL\_CALLS\](.*)] ArgumentRegex:[] ArgumentRegexKey: ArgumentRegexValue: ReplaceFunctionResults:[{Key:(?s)^[^{\[]* Value:} {Key:(?s)[^}\]]*$ Value:} {Key:(?s)\[TOOL\_CALLS\] Value:} {Key:(?s)\[\/TOOL\_CALLS\] Value:}] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000f48668 MirostatTAU:0xc000f48660 Mirostat:0xc000f485f0 NGPULayers:0xc000d6aab8 MMap:0xc000f48620 MMlock:0xc000f48689 LowVRAM:0xc000f48689 Reranking:0xc000f48689 Grammar: StopWords:[<|im_end|> <dummy32000> </tool_call> <|eot_id|> <|end_of_text|> </s> [/TOOL_CALLS] [/ACTIONS]] Cutstrings:[] ExtractRegex:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000f485d8 NUMA:false LoraAdapter: LoraBase: LoraAdapters:[] LoraScales:[] LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: LoadFormat: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 DisableLogStatus:false DType: LimitMMPerPrompt:{LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0} MMProj: FlashAttention:0xc000f11d00 NoKVOffloading:false CacheTypeK:q4_0 CacheTypeV:q4_0 RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 CFGScale:0} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: AudioPath:} CUDA:false DownloadFiles:[] Description: Usage: Options:[gpu] Overrides:[] MCP:{Servers: Stdio:} Agent:{MaxAttempts:0 MaxIterations:0 EnableReasoning:false EnablePlanning:false EnableMCPPrompts:false EnablePlanReEvaluator:false}}
2025-12-10T13:31:25.257719114+01:00 12:31PM DBG Parameters: &{modelConfigFile:/models/mistralai_devstral-small-2507.yaml PredictionOptions:{BasicModelRequest:{Model:mistralai_Devstral-Small-2507-Q4_K_M.gguf} Language: Translate:false N:0 TopP:0xc000f48640 TopK:0xc000f48648 Temperature:0xc000f48650 Maxtokens:0xc000f48680 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000f48678 TypicalP:0xc000f48670 Seed:0xc000f48690 Logprobs:{Enabled:false} TopLogprobs:<nil> LogitBias:map[] NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 ClipSkip:0 Tokenizer:} Name:mistralai_devstral-small-2507 F16:0xc000f485e8 Threads:0xc000f48630 Debug:0xc0028a33e0 Roles:map[] Embeddings:0xc000f48689 Backend:llama-cpp TemplateConfig:{Chat:{{.Input -}}
2025-12-10T13:31:25.257719114+01:00  ChatMessage:{{if eq .RoleName "user" -}}
2025-12-10T13:31:25.257719114+01:00 [INST] {{.Content }} [/INST]
2025-12-10T13:31:25.257719114+01:00 {{- else if .FunctionCall -}}
2025-12-10T13:31:25.257719114+01:00 [TOOL_CALLS] {{toJson .FunctionCall}} [/TOOL_CALLS]
2025-12-10T13:31:25.257719114+01:00 {{- else if eq .RoleName "tool" -}}
2025-12-10T13:31:25.257719114+01:00 [TOOL_RESULTS] {{.Content}} [/TOOL_RESULTS]
2025-12-10T13:31:25.257719114+01:00 {{- else -}}
2025-12-10T13:31:25.257719114+01:00 {{ .Content -}}
2025-12-10T13:31:25.257719114+01:00 {{ end -}} Completion:{{.Input}}
2025-12-10T13:31:25.257719114+01:00  Edit: Functions:[AVAILABLE_TOOLS] [{{range .Functions}}{"type": "function", "function": {"name": "{{.Name}}", "description": "{{.Description}}", "parameters": {{toJson .Parameters}} }}{{end}} ] [/AVAILABLE_TOOLS]{{.Input }} UseTokenizerTemplate:false JoinChatMessagesByCharacter:0xc000f50220 Multimodal: ReplyPrefix:} KnownUsecaseStrings:[FLAG_COMPLETION FLAG_CHAT] KnownUsecases:0xc00221b708 Pipeline:{TTS: LLM: Transcription: VAD:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:true GrammarConfig:{ParallelCalls:true DisableParallelNewLines:true MixedMode:false NoMixedFreeString:false NoGrammar:true Prefix: ExpectStringsAfterJSON:false PropOrder: SchemaType: GrammarTriggers:[]} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[(?s)\[TOOL\_CALLS\](.*)] ArgumentRegex:[] ArgumentRegexKey: ArgumentRegexValue: ReplaceFunctionResults:[{Key:(?s)^[^{\[]* Value:} {Key:(?s)[^}\]]*$ Value:} {Key:(?s)\[TOOL\_CALLS\] Value:} {Key:(?s)\[\/TOOL\_CALLS\] Value:}] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000f48668 MirostatTAU:0xc000f48660 Mirostat:0xc000f485f0 NGPULayers:0xc000d6aab8 MMap:0xc000f48620 MMlock:0xc000f48689 LowVRAM:0xc000f48689 Reranking:0xc000f48689 Grammar: StopWords:[<|im_end|> <dummy32000> </tool_call> <|eot_id|> <|end_of_text|> </s> [/TOOL_CALLS] [/ACTIONS]] Cutstrings:[] ExtractRegex:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000f485d8 NUMA:false LoraAdapter: LoraBase: LoraAdapters:[] LoraScales:[] LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: LoadFormat: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 DisableLogStatus:false DType: LimitMMPerPrompt:{LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0} MMProj: FlashAttention:0xc000f11d00 NoKVOffloading:false CacheTypeK:q4_0 CacheTypeV:q4_0 RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 CFGScale:0} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: AudioPath:} CUDA:false DownloadFiles:[] Description: Usage: Options:[gpu] Overrides:[] MCP:{Servers: Stdio:} Agent:{MaxAttempts:0 MaxIterations:0 EnableReasoning:false EnablePlanning:false EnableMCPPrompts:false EnablePlanReEvaluator:false}}
2025-12-10T13:31:25.257976563+01:00 12:31PM DBG templated message for chat: [INST] Hi [/INST]
2025-12-10T13:31:25.257976563+01:00 12:31PM DBG templated message for chat: Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = 
2025-12-10T13:31:25.257992157+01:00 12:31PM DBG templated message for chat: [INST] hi [/INST]
2025-12-10T13:31:25.258004826+01:00 12:31PM DBG templated message for chat: <span class='error'>Error: Failed to process stream</span>
2025-12-10T13:31:25.258016260+01:00 12:31PM DBG templated message for chat: [INST] hi [/INST]
2025-12-10T13:31:25.258049272+01:00 12:31PM DBG templated message for chat: Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = 
2025-12-10T13:31:25.258049272+01:00 12:31PM DBG templated message for chat: [INST] hi [/INST]
2025-12-10T13:31:25.258063174+01:00 12:31PM DBG templated message for chat: Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = 
2025-12-10T13:31:25.258075357+01:00 12:31PM DBG templated message for chat: [INST] hi [/INST]
2025-12-10T13:31:25.258103927+01:00 12:31PM DBG Prompt (before templating): [INST] Hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]<span class='error'>Error: Failed to process stream</span>[INST] hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]
2025-12-10T13:31:25.258225843+01:00 12:31PM DBG Template found, input modified to: [INST] Hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]<span class='error'>Error: Failed to process stream</span>[INST] hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]
2025-12-10T13:31:25.258225843+01:00 12:31PM DBG Prompt (after templating): [INST] Hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]<span class='error'>Error: Failed to process stream</span>[INST] hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc = [INST] hi [/INST]
2025-12-10T13:31:25.258225843+01:00 12:31PM DBG Stream request received
2025-12-10T13:31:25.258311895+01:00 12:31PM DBG Sending chunk: {"created":1765369885,"object":"chat.completion.chunk","id":"e3f2b78b-7048-458b-af1c-407aebfddc0a","model":"mistralai_devstral-small-2507","choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant","content":null}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions