Pratiyush
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 2 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/reference/prompt-caching.md‎
Lines changed: 137 additions & 0 deletions b/‎docs/reference/prompt-caching.md‎
Lines changed: 137 additions & 0 deletions
@@ -68,6 +68,7 @@ dist-site/
 .llmwiki-synth-state.json
 .llmwiki-queue.json
 .llmwiki-dream-state.json
+.llmwiki-batch-state.json
 
 # v1.0 (#160): multi-agent skill mirrors are derived from .claude/skills/
 # by `llmwiki install-skills`. The canonical source ships with the repo;
 
@@ -15,6 +15,8 @@ Versions below 1.0 are pre-production — API and file formats may change.
 
 ### Added
 
+- **Prompt caching + batch API scaffold** (#50) — new `llmwiki/cache.py` module lands the plumbing for Anthropic `cache_control: {type: "ephemeral"}` usage on the stable ingest prefix (CLAUDE.md schema + `wiki/index.md` + `wiki/overview.md`). Public surface: `make_cached_block()`, `make_plain_block()`, `CachedPrompt` (frozen dataclass with `stable_prefix` / `dynamic_suffix`), `build_messages()` that emits the Anthropic-shaped message array with the header on the prefix block only. Cost preview: `estimate_tokens()` (char/4 heuristic, stdlib-only — no tokenizer dep), `estimate_cost()` returning a `CostEstimate` with per-bucket (prefix / fresh / output) breakdown, `format_estimate()` for the `--estimate` CLI output, `warn_prefix_too_small()` that flags prefixes below the 1024-token cache floor, `MODEL_PRICING` rate card for Sonnet 4.6 / Haiku 4 / Opus 4 (input, cached_input, cache_write, output USD/MTok). Batch state persistence: `BatchJob`, `BatchState`, `load_batch_state()`, `save_batch_state()`, `add_pending()` (dedup by batch_id), `mark_completed()` — all round-tripped through `.llmwiki-batch-state.json` (gitignored). New `llmwiki synthesize --estimate` CLI flag walks the discovered raw sessions, prices the batch assuming the first call is a cache write and the rest are hits, prints a line-item breakdown plus total. Docs: `docs/reference/prompt-caching.md`. 49 tests cover: cache-block shape, CachedPrompt empty-edge cases, build_messages structure, token/cost math (invariant: cached_input < input for every model, breakdown sums to total, rejects unknown models + negative tokens), batch-state round-trip, `add_pending` dedup, CLI wiring.
+
 - **Ollama backend scaffold for local LLM synthesis** (#35) — new `llmwiki/synth/ollama.py` delivers the `OllamaSynthesizer` backend against the existing `BaseSynthesizer` contract. Stdlib-only HTTP via `urllib` (no new dependency). Configurable through `sessions_config.json` → `synthesis.backend = "ollama"` with `model` / `base_url` / `timeout` / `max_retries` fields (defaults: `llama3.1:8b` at `http://127.0.0.1:11434`, 60s timeout, 3 retries with exponential backoff). Privacy-by-default: loopback host only; a warning logs once if the user points the backend at a non-local host. `is_available()` probes `/api/tags` so callers can branch before long synthesis runs. Graceful error handling: `OllamaUnavailableError` (connection refused / DNS failure — no retries, caller skips), `OllamaHTTPError` (non-2xx after retries), `OllamaError` (non-JSON body, non-string response field). New `resolve_backend()` in `pipeline.py` selects backend from config (`dummy` | `ollama`); unknown names fall back to dummy with a warning. New `llmwiki synthesize [--check | --dry-run | --force]` CLI subcommand surfaces backend status without running synthesis. 43 tests (mocked HTTP — no network in CI): config parsing, URL construction, availability probing, retry + backoff on 5xx and socket timeout, no-retry on 4xx / connection refused, non-JSON response handling, unicode round-trip, curly-brace-safe prompt rendering, CLI registration, resolver fallback.
 
 - **`wiki/candidates/` approval workflow** (#51) — new `llmwiki/candidates.py` module with `list`, `promote`, `merge`, `discard`, and `stale_candidates` primitives. New pages from `/wiki-ingest` that represent brand-new entities/concepts can now land in `wiki/candidates/<kind>/<slug>.md` with `status: candidate` instead of going straight into the trusted wiki. `/wiki-review` slash command (`.claude/commands/wiki-review.md`) + `llmwiki candidates <action>` CLI walk through the queue. Merge folds the candidate's body under a `## Candidate merge — <date>` heading in the target and archives the source. Discard moves to `wiki/archive/candidates/<timestamp>/` with a timestamped `.reason.txt` audit file. New `stale_candidates` lint rule (12th overall) flags candidates sitting idle > 30 days. 34 tests cover: all 4 action paths, frontmatter status rewrite, staleness computation, kind inference, error handling.
 
@@ -0,0 +1,137 @@
+# Prompt caching + batch API
+
+> Status: scaffold (v1.1.0 · #50). The plumbing — cache-block
+> construction, token estimator, batch-state store — lives in
+> `llmwiki/cache.py`. The actual Anthropic backend that consumes it
+> lands in v1.2 behind a separate PR.
+
+## Why cache prompts?
+
+Every `/wiki-sync` and `/wiki-ingest` bundles the same stable prefix
+with every source file it asks the model to summarize:
+
+- The `CLAUDE.md` schema (~3 k tokens)
+- The current `wiki/index.md` (grows with the wiki)
+- The current `wiki/overview.md`
+
+On a 500-page wiki that prefix is ≈ 30 k tokens **per request**. Marking
+the prefix with `cache_control: { type: "ephemeral" }` tells Anthropic
+to cache it server-side; subsequent calls pay the `cached_input` rate
+(10 % of the fresh `input` rate) instead of the full input rate.
+
+sage-wiki reports **50–90 % savings** on bulk ingest with this pattern.
+
+## Build a cached prompt
+
+```python
+from llmwiki.cache import CachedPrompt, build_messages
+
+prompt = CachedPrompt(
+    stable_prefix=claude_md_schema + current_index + current_overview,
+    dynamic_suffix=session_body,
+)
+
+messages = build_messages(prompt)
+# [
+#   {
+#     "role": "user",
+#     "content": [
+#       {"type": "text", "text": "...schema + index + overview...",
+#        "cache_control": {"type": "ephemeral"}},
+#       {"type": "text", "text": "...session body..."},
+#     ],
+#   },
+# ]
+```
+
+The cache header lives on the *last* block you want cached, so
+`make_cached_block()` always puts the prefix before the dynamic suffix.
+
+## Estimate cost before you spend
+
+```
+$ llmwiki synthesize --estimate
+627 new sessions, prefix 3,944 tok
+Model: claude-sonnet-4-6 (first write)
+  Prefix:   3,944 tok  $0.0148
+  Fresh:    1,274 tok  $0.0038
+  Output:   1,000 tok  $0.0150
+  Total:                $0.0336
+  + 626 subsequent sessions (cache hit):  $17.9484
+
+Batch total: $17.9820 (model claude-sonnet-4-6)
+```
+
+`--estimate` never calls the API — it uses the `char / 4` heuristic
+from `estimate_tokens()` and the rate table in `MODEL_PRICING`. Treat
+it as ± 20 %; the real numbers come back in `usage` on each response.
+
+If the prefix is below Anthropic's minimum cache size (1 024 tokens),
+`--estimate` prints a warning:
+
+```
+warning: prefix is 400 tok (< 1024 min) — Anthropic will not cache it;
+savings estimate is best-case only.
+```
+
+## Batch submission
+
+Large backfills can go through Anthropic's `message_batches` endpoint
+(up to 50 % cheaper and no per-request rate limit). The scaffolding
+tracks in-flight batches on disk:
+
+```python
+from pathlib import Path
+from llmwiki.cache import (
+    BatchJob,
+    BatchState,
+    add_pending,
+    load_batch_state,
+    mark_completed,
+    save_batch_state,
+)
+
+repo = Path("/path/to/llm-wiki")
+state = load_batch_state(repo)
+
+add_pending(state, BatchJob(
+    batch_id="batch_abc",
+    source_slugs=["sess-1", "sess-2"],
+    submitted_at="2026-04-17T10:00:00Z",
+))
+save_batch_state(repo, state)
+
+# ... later, when you poll and find it done:
+mark_completed(state, "batch_abc")
+save_batch_state(repo, state)
+```
+
+The state file (`.llmwiki-batch-state.json`) is small JSON — safe to
+grep, diff, and commit if you want to audit what's been submitted.
+
+## Rate card
+
+From `llmwiki/cache.py :: MODEL_PRICING` (USD per 1 M tokens, as of
+v1.1.0):
+
+| Model             | input | cached_input | cache_write | output |
+|-------------------|------:|-------------:|------------:|-------:|
+| claude-sonnet-4-6 |  3.00 |         0.30 |        3.75 |  15.00 |
+| claude-haiku-4    |  0.80 |         0.08 |        1.00 |   4.00 |
+| claude-opus-4     | 15.00 |         1.50 |       18.75 |  75.00 |
+
+These are the rates `estimate_cost()` uses. Update them in one place
+(`MODEL_PRICING`) when Anthropic publishes new ones.
+
+## What's still to do (v1.2)
+
+- The actual Anthropic backend that wires `CachedPrompt` into
+  `client.messages.create(...)`.
+- `llmwiki sync --batch` that submits through `message_batches` and
+  polls for completion.
+- Write-through updating of `MODEL_PRICING` from Anthropic's pricing
+  JSON.
+- Gemini / OpenAI cache header mapping (separate PR — different
+  semantics).
+
+See #50 for the tracking issue.