Skip to content

Commit 3cd166f

Browse files
authored
Merge pull request #236 from Pratiyush/feat/50-prompt-caching-batch
feat: prompt caching + batch API scaffold (#50)
2 parents 8ff4eea + 9cf116b commit 3cd166f

6 files changed

Lines changed: 1076 additions & 1 deletion

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ dist-site/
6868
.llmwiki-synth-state.json
6969
.llmwiki-queue.json
7070
.llmwiki-dream-state.json
71+
.llmwiki-batch-state.json
7172

7273
# v1.0 (#160): multi-agent skill mirrors are derived from .claude/skills/
7374
# by `llmwiki install-skills`. The canonical source ships with the repo;

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ Versions below 1.0 are pre-production — API and file formats may change.
1515

1616
### Added
1717

18+
- **Prompt caching + batch API scaffold** (#50) — new `llmwiki/cache.py` module lands the plumbing for Anthropic `cache_control: {type: "ephemeral"}` usage on the stable ingest prefix (CLAUDE.md schema + `wiki/index.md` + `wiki/overview.md`). Public surface: `make_cached_block()`, `make_plain_block()`, `CachedPrompt` (frozen dataclass with `stable_prefix` / `dynamic_suffix`), `build_messages()` that emits the Anthropic-shaped message array with the header on the prefix block only. Cost preview: `estimate_tokens()` (char/4 heuristic, stdlib-only — no tokenizer dep), `estimate_cost()` returning a `CostEstimate` with per-bucket (prefix / fresh / output) breakdown, `format_estimate()` for the `--estimate` CLI output, `warn_prefix_too_small()` that flags prefixes below the 1024-token cache floor, `MODEL_PRICING` rate card for Sonnet 4.6 / Haiku 4 / Opus 4 (input, cached_input, cache_write, output USD/MTok). Batch state persistence: `BatchJob`, `BatchState`, `load_batch_state()`, `save_batch_state()`, `add_pending()` (dedup by batch_id), `mark_completed()` — all round-tripped through `.llmwiki-batch-state.json` (gitignored). New `llmwiki synthesize --estimate` CLI flag walks the discovered raw sessions, prices the batch assuming the first call is a cache write and the rest are hits, prints a line-item breakdown plus total. Docs: `docs/reference/prompt-caching.md`. 49 tests cover: cache-block shape, CachedPrompt empty-edge cases, build_messages structure, token/cost math (invariant: cached_input < input for every model, breakdown sums to total, rejects unknown models + negative tokens), batch-state round-trip, `add_pending` dedup, CLI wiring.
19+
1820
- **Ollama backend scaffold for local LLM synthesis** (#35) — new `llmwiki/synth/ollama.py` delivers the `OllamaSynthesizer` backend against the existing `BaseSynthesizer` contract. Stdlib-only HTTP via `urllib` (no new dependency). Configurable through `sessions_config.json` → `synthesis.backend = "ollama"` with `model` / `base_url` / `timeout` / `max_retries` fields (defaults: `llama3.1:8b` at `http://127.0.0.1:11434`, 60s timeout, 3 retries with exponential backoff). Privacy-by-default: loopback host only; a warning logs once if the user points the backend at a non-local host. `is_available()` probes `/api/tags` so callers can branch before long synthesis runs. Graceful error handling: `OllamaUnavailableError` (connection refused / DNS failure — no retries, caller skips), `OllamaHTTPError` (non-2xx after retries), `OllamaError` (non-JSON body, non-string response field). New `resolve_backend()` in `pipeline.py` selects backend from config (`dummy` | `ollama`); unknown names fall back to dummy with a warning. New `llmwiki synthesize [--check | --dry-run | --force]` CLI subcommand surfaces backend status without running synthesis. 43 tests (mocked HTTP — no network in CI): config parsing, URL construction, availability probing, retry + backoff on 5xx and socket timeout, no-retry on 4xx / connection refused, non-JSON response handling, unicode round-trip, curly-brace-safe prompt rendering, CLI registration, resolver fallback.
1921

2022
- **`wiki/candidates/` approval workflow** (#51) — new `llmwiki/candidates.py` module with `list`, `promote`, `merge`, `discard`, and `stale_candidates` primitives. New pages from `/wiki-ingest` that represent brand-new entities/concepts can now land in `wiki/candidates/<kind>/<slug>.md` with `status: candidate` instead of going straight into the trusted wiki. `/wiki-review` slash command (`.claude/commands/wiki-review.md`) + `llmwiki candidates <action>` CLI walk through the queue. Merge folds the candidate's body under a `## Candidate merge — <date>` heading in the target and archives the source. Discard moves to `wiki/archive/candidates/<timestamp>/` with a timestamped `.reason.txt` audit file. New `stale_candidates` lint rule (12th overall) flags candidates sitting idle > 30 days. 34 tests cover: all 4 action paths, frontmatter status rewrite, staleness computation, kind inference, error handling.

docs/reference/prompt-caching.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Prompt caching + batch API
2+
3+
> Status: scaffold (v1.1.0 · #50). The plumbing — cache-block
4+
> construction, token estimator, batch-state store — lives in
5+
> `llmwiki/cache.py`. The actual Anthropic backend that consumes it
6+
> lands in v1.2 behind a separate PR.
7+
8+
## Why cache prompts?
9+
10+
Every `/wiki-sync` and `/wiki-ingest` bundles the same stable prefix
11+
with every source file it asks the model to summarize:
12+
13+
- The `CLAUDE.md` schema (~3 k tokens)
14+
- The current `wiki/index.md` (grows with the wiki)
15+
- The current `wiki/overview.md`
16+
17+
On a 500-page wiki that prefix is ≈ 30 k tokens **per request**. Marking
18+
the prefix with `cache_control: { type: "ephemeral" }` tells Anthropic
19+
to cache it server-side; subsequent calls pay the `cached_input` rate
20+
(10 % of the fresh `input` rate) instead of the full input rate.
21+
22+
sage-wiki reports **50–90 % savings** on bulk ingest with this pattern.
23+
24+
## Build a cached prompt
25+
26+
```python
27+
from llmwiki.cache import CachedPrompt, build_messages
28+
29+
prompt = CachedPrompt(
30+
stable_prefix=claude_md_schema + current_index + current_overview,
31+
dynamic_suffix=session_body,
32+
)
33+
34+
messages = build_messages(prompt)
35+
# [
36+
# {
37+
# "role": "user",
38+
# "content": [
39+
# {"type": "text", "text": "...schema + index + overview...",
40+
# "cache_control": {"type": "ephemeral"}},
41+
# {"type": "text", "text": "...session body..."},
42+
# ],
43+
# },
44+
# ]
45+
```
46+
47+
The cache header lives on the *last* block you want cached, so
48+
`make_cached_block()` always puts the prefix before the dynamic suffix.
49+
50+
## Estimate cost before you spend
51+
52+
```
53+
$ llmwiki synthesize --estimate
54+
627 new sessions, prefix 3,944 tok
55+
Model: claude-sonnet-4-6 (first write)
56+
Prefix: 3,944 tok $0.0148
57+
Fresh: 1,274 tok $0.0038
58+
Output: 1,000 tok $0.0150
59+
Total: $0.0336
60+
+ 626 subsequent sessions (cache hit): $17.9484
61+
62+
Batch total: $17.9820 (model claude-sonnet-4-6)
63+
```
64+
65+
`--estimate` never calls the API — it uses the `char / 4` heuristic
66+
from `estimate_tokens()` and the rate table in `MODEL_PRICING`. Treat
67+
it as ± 20 %; the real numbers come back in `usage` on each response.
68+
69+
If the prefix is below Anthropic's minimum cache size (1 024 tokens),
70+
`--estimate` prints a warning:
71+
72+
```
73+
warning: prefix is 400 tok (< 1024 min) — Anthropic will not cache it;
74+
savings estimate is best-case only.
75+
```
76+
77+
## Batch submission
78+
79+
Large backfills can go through Anthropic's `message_batches` endpoint
80+
(up to 50 % cheaper and no per-request rate limit). The scaffolding
81+
tracks in-flight batches on disk:
82+
83+
```python
84+
from pathlib import Path
85+
from llmwiki.cache import (
86+
BatchJob,
87+
BatchState,
88+
add_pending,
89+
load_batch_state,
90+
mark_completed,
91+
save_batch_state,
92+
)
93+
94+
repo = Path("/path/to/llm-wiki")
95+
state = load_batch_state(repo)
96+
97+
add_pending(state, BatchJob(
98+
batch_id="batch_abc",
99+
source_slugs=["sess-1", "sess-2"],
100+
submitted_at="2026-04-17T10:00:00Z",
101+
))
102+
save_batch_state(repo, state)
103+
104+
# ... later, when you poll and find it done:
105+
mark_completed(state, "batch_abc")
106+
save_batch_state(repo, state)
107+
```
108+
109+
The state file (`.llmwiki-batch-state.json`) is small JSON — safe to
110+
grep, diff, and commit if you want to audit what's been submitted.
111+
112+
## Rate card
113+
114+
From `llmwiki/cache.py :: MODEL_PRICING` (USD per 1 M tokens, as of
115+
v1.1.0):
116+
117+
| Model | input | cached_input | cache_write | output |
118+
|-------------------|------:|-------------:|------------:|-------:|
119+
| claude-sonnet-4-6 | 3.00 | 0.30 | 3.75 | 15.00 |
120+
| claude-haiku-4 | 0.80 | 0.08 | 1.00 | 4.00 |
121+
| claude-opus-4 | 15.00 | 1.50 | 18.75 | 75.00 |
122+
123+
These are the rates `estimate_cost()` uses. Update them in one place
124+
(`MODEL_PRICING`) when Anthropic publishes new ones.
125+
126+
## What's still to do (v1.2)
127+
128+
- The actual Anthropic backend that wires `CachedPrompt` into
129+
`client.messages.create(...)`.
130+
- `llmwiki sync --batch` that submits through `message_batches` and
131+
polls for completion.
132+
- Write-through updating of `MODEL_PRICING` from Anthropic's pricing
133+
JSON.
134+
- Gemini / OpenAI cache header mapping (separate PR — different
135+
semantics).
136+
137+
See #50 for the tracking issue.

0 commit comments

Comments
 (0)