Skip to content

feat: token counter for conversation length validation #6428

@qnixsynapse

Description

@qnixsynapse

Problem Statement

Jan’s chat application currently has no reliable way to detect when a conversation is exceeding the model’s context window.

  • Previously we relied on error messages returned by llama.cpp (e.g., “out of context”).
  • llama.cpp no longer emits these errors; instead it returns stop_reason: "length" in the final OpenAI‑compatible chunk.
  • As a result the existing “out of context” handling is broken:
    • When the number of input tokens exceeds the model’s context length, a popup still appears, but clicking “truncate input” does not actually truncate the prompt.
    • Llama.cpp then enables context shifting, yet it still throws an error because the input token count remains larger than the allowed context size.

Feature Idea

Introduce a token counter that runs on every incoming user message (and on system messages added to the conversation) to:

  1. Calculate the cumulative token count of the entire conversation (including system, user, and assistant messages) using the same tokeniser that the backend model uses.
  2. Validate against the model’s max context length before sending the request to llama.cpp.
  3. If the upcoming request would exceed the limit, apply one of the following strategies (configurable):
    • Truncate the oldest user/assistant messages until the token budget fits.
    • Summarise the truncated portion (optional future enhancement).
    • Show a UI warning with an actionable “Truncate input” button that now actually performs the truncation based on the token counter.
  4. Update the UI to reflect the current token usage (e.g., “Tokens: 3 200 / 4 096”).
  5. Fallback handling – if, for any reason, llama.cpp still returns stop_reason: "length", gracefully recover by re‑truncating and resubmitting the request.

Acceptance Criteria

# Condition Expected Outcome
1 Token count > model context length before request UI blocks send, shows warning, and either truncates automatically or after user confirmation
2 User clicks “Truncate input” Oldest messages are removed until token count ≤ context limit; request proceeds without error
3 Token counter stays in sync with llama.cpp tokeniser Token counts reported in UI match the actual tokens sent to the backend
4 stop_reason: "length" still returned System detects it, re‑applies truncation, and retries transparently
5 Normal conversation flow (token count ≤ limit) No warning shown; token usage indicator updates live

Implementing this token counter will restore robust length validation, prevent unhandled errors, and give users clear visibility into token consumption.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

Status

No status

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions