I’m asking for advice because there’s an issue with local models that I just can’t solve. Since it happens with all the models I’ve tested, I’m fairly sure it’s a configuration or code problem rather than something model-specific.
At first, the model works correctly. For a while, it executes tools properly, writes to files, and reads them without any issue. But after some conversation, instead of actually writing to the files, it only writes in the chat while behaving as if it had used write_file or read_file. The hardware side worked fine. What I had completely underestimated was context management.
The problem isn’t that local models are bad at long contexts. Qwen, on paper, supports 128,000 tokens. The issue is what happens to quality as that window fills up. Around 60–70% of capacity, the model starts to ignore information that was read earlier. It doesn’t fail dramatically; it simply and silently forgets the constraints set at the beginning of the prompt. You end up with output that looks plausible but does not satisfy requirements specified 10,000 tokens earlier.
I realized this because the pipeline was producing technically correct outputs, but they violated a formatting rule I had defined in the system prompt. It took me two days to understand that it wasn’t a logical error — the model simply could no longer “see” the beginning of its own context.
Is there a way to solve this issue so the local model doesn’t break down or start hallucinating?
I’m asking for advice because there’s an issue with local models that I just can’t solve. Since it happens with all the models I’ve tested, I’m fairly sure it’s a configuration or code problem rather than something model-specific.
At first, the model works correctly. For a while, it executes tools properly, writes to files, and reads them without any issue. But after some conversation, instead of actually writing to the files, it only writes in the chat while behaving as if it had used
write_fileorread_file. The hardware side worked fine. What I had completely underestimated was context management.The problem isn’t that local models are bad at long contexts. Qwen, on paper, supports 128,000 tokens. The issue is what happens to quality as that window fills up. Around 60–70% of capacity, the model starts to ignore information that was read earlier. It doesn’t fail dramatically; it simply and silently forgets the constraints set at the beginning of the prompt. You end up with output that looks plausible but does not satisfy requirements specified 10,000 tokens earlier.
I realized this because the pipeline was producing technically correct outputs, but they violated a formatting rule I had defined in the system prompt. It took me two days to understand that it wasn’t a logical error — the model simply could no longer “see” the beginning of its own context.
Is there a way to solve this issue so the local model doesn’t break down or start hallucinating?