feat: token counter for conversation length validation

## Problem Statement  
  

Jan’s chat application currently has no reliable way to detect when a conversation is exceeding the model’s context window.  

* Previously we relied on error messages returned by **llama.cpp** (e.g., “out of context”).  
* **llama.cpp** no longer emits these errors; instead it returns `stop_reason: "length"` in the final OpenAI‑compatible chunk.  
* As a result the existing “out of context” handling is broken:  
  * When the number of input tokens exceeds the model’s context length, a popup still appears, but clicking **“truncate input”** does **not** actually truncate the prompt.  
  * Llama.cpp then enables *context shifting*, yet it still throws an error because the input token count remains larger than the allowed context size.  


## Feature Idea  
  

Introduce a **token counter** that runs on every incoming user message (and on system messages added to the conversation) to:

1. **Calculate the cumulative token count** of the entire conversation (including system, user, and assistant messages) using the same tokeniser that the backend model uses.  
2. **Validate against the model’s max context length** before sending the request to llama.cpp.  
3. If the upcoming request would exceed the limit, apply one of the following strategies (configurable):  
   * **Truncate the oldest user/assistant messages** until the token budget fits.  
   * **Summarise** the truncated portion (optional future enhancement).  
   * **Show a UI warning** with an actionable “Truncate input” button that now actually performs the truncation based on the token counter.  
4. **Update the UI** to reflect the current token usage (e.g., “Tokens: 3 200 / 4 096”).  
5. **Fallback handling** – if, for any reason, llama.cpp still returns `stop_reason: "length"`, gracefully recover by re‑truncating and resubmitting the request.

### Acceptance Criteria  

| # | Condition | Expected Outcome |
|---|-----------|------------------|
| 1 | Token count > model context length before request | UI blocks send, shows warning, and either truncates automatically or after user confirmation |
| 2 | User clicks “Truncate input” | Oldest messages are removed until token count ≤ context limit; request proceeds without error |
| 3 | Token counter stays in sync with llama.cpp tokeniser | Token counts reported in UI match the actual tokens sent to the backend |
| 4 | `stop_reason: "length"` still returned | System detects it, re‑applies truncation, and retries transparently |
| 5 | Normal conversation flow (token count ≤ limit) | No warning shown; token usage indicator updates live |

Implementing this token counter will restore robust length validation, prevent unhandled errors, and give users clear visibility into token consumption.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: token counter for conversation length validation #6428

Problem Statement

Feature Idea

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Condition	Expected Outcome
1	Token count > model context length before request	UI blocks send, shows warning, and either truncates automatically or after user confirmation
2	User clicks “Truncate input”	Oldest messages are removed until token count ≤ context limit; request proceeds without error
3	Token counter stays in sync with llama.cpp tokeniser	Token counts reported in UI match the actual tokens sent to the backend
4	`stop_reason: "length"` still returned	System detects it, re‑applies truncation, and retries transparently
5	Normal conversation flow (token count ≤ limit)	No warning shown; token usage indicator updates live

feat: token counter for conversation length validation #6428

Description

Problem Statement

Feature Idea

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions