Skip to content

feat: enable Anthropic prompt caching via automatic cache_control injection#660

Merged
ilblackdragon merged 16 commits intomainfrom
takeover/291-anthropic-prompt-caching
Mar 7, 2026
Merged

feat: enable Anthropic prompt caching via automatic cache_control injection#660
ilblackdragon merged 16 commits intomainfrom
takeover/291-anthropic-prompt-caching

Conversation

@ilblackdragon
Copy link
Copy Markdown
Member

Summary

Continuation of #291 by @Canvinus.

Enable Anthropic prompt caching for the direct Anthropic backend with configurable cache retention and accurate write surcharge tracking. Uses Anthropic's automatic caching — a top-level cache_control field that the API uses to auto-place cache breakpoints at the last cacheable block.

Configuration

# .env
ANTHROPIC_CACHE_RETENTION=short   # default
Value TTL Write Cost Read Discount
none disabled
short 5 min 1.25× (125%) 0.1× (90% off)
long 1 hour 2.0× (200%) 0.1× (90% off)

Only the direct Anthropic backend (LLM_BACKEND=anthropic) benefits. Other backends pass through zeroed cache fields.

Changes from original

  • Merged with latest main (resolved 3 conflicts from the declarative provider registry refactor in feat(llm): declarative provider registry #618)
  • Adapted CacheRetention config and cache injection to work with RegistryProviderConfig (replaces removed AnthropicDirectConfig, LlmBackend enum, and per-backend config types)
  • ANTHROPIC_CACHE_RETENTION env var parsed in create_anthropic_from_registry() instead of the removed LlmConfig::resolve() Anthropic branch
  • Added missing cache_read_input_tokens / cache_creation_input_tokens fields to mock providers added on main after PR feat: enable Anthropic prompt caching via cache_control injection #291 branched (response_cache.rs, dispatcher.rs, provider_chaos.rs, trace_llm.rs)
  • Suppressed clippy::too_many_arguments on record_llm_call and build_rig_request

Original PR

#291 — feat: enable Anthropic prompt caching via cache_control injection

Review comments addressed

All 11 review comments from Copilot and Gemini were already resolved by @Canvinus in the original PR's follow-up commits (model validation, cost tracking, proxy passthrough, overflow protection, etc.)

Test plan

  • All 2,101 lib tests pass (2 new cache injection tests + 1 cache_write_multiplier test)
  • cargo clippy --all --all-features zero warnings
  • cargo fmt clean
  • Short TTL injects cache_control: {"type": "ephemeral"} via additional_params
  • Long TTL injects cache_control: {"type": "ephemeral", "ttl": "1h"}
  • None retention skips cache_control entirely
  • Write surcharge 5m: 25% increase verified
  • Write surcharge 1h: 100% increase verified
  • Cache read discount: 90% savings verified

Co-Authored-By: Canvinus 44225021+Canvinus@users.noreply.github.com

Generated with Claude Code

Canvinus and others added 13 commits February 21, 2026 15:45
- Inject cache_control via additional_params for Claude models in rig_adapter
- Add cache_read_input_tokens and cache_creation_input_tokens to
  CompletionResponse and ToolCompletionResponse
- Extract cached_input_tokens from rig-core unified Usage
- Add is_anthropic_model() detection helper with provider prefix support
- Log prompt cache hits at debug level (consistent with response_cache)
- Add 7 unit tests for cache injection and model detection
- Update all mock providers and test fixtures with new fields
…uard

- Add cache_read_input_tokens to TokenUsage so cache counts flow from
  CompletionResponse through the reasoning layer to the dispatcher
- Update CostGuard::record_llm_call() to accept cache_read_input_tokens:
  cached tokens are billed at 10% of the normal input rate
- Thread cache_read_input_tokens from dispatcher into CostGuard
- Add test_cache_discount_reduces_cost verifying exact savings match
  90% of input cost for fully-cached requests
- Update all existing test callers with zero-cache parameter
…e model support

- Replace model-name-based is_anthropic_model() with explicit
  enable_prompt_cache flag on RigAdapter, set only for the direct
  Anthropic backend via with_prompt_cache(true)
- Add supports_prompt_cache() to validate model names per Anthropic
  docs: only Claude 3+ models support caching; claude-2 and
  claude-instant are excluded to prevent 400 errors
- Warn when caching is enabled but model does not support it
- Replace is_anthropic_model tests with flag-based and model
  validation tests
…s through proxy

- Move supports_prompt_cache() check into with_prompt_cache() so
  unsupported models are detected once at construction, not per request
- Add cache_read_input_tokens and cache_creation_input_tokens to
  ProxyCompletionResponse and ProxyToolCompletionResponse with
  serde(default) for backward compatibility
- Pass cache metrics through orchestrator proxy instead of zeroing
- Use claude-opus-4-6 in cache discount test to match Anthropic
  semantics
- Add CacheRetention enum (none/short/long) to AnthropicDirectConfig
- Parse ANTHROPIC_CACHE_RETENTION env var (default: short)
- Inject TTL-aware cache_control (short=5m ephemeral, long=1h)
- Extract cache_creation_input_tokens from raw Anthropic response
- Add cache_write_multiplier() to LlmProvider trait (1.25x short, 2.0x long)
- Pipe dynamic write multiplier through dispatcher to CostGuard
- Add TokenUsage.cache_creation_input_tokens field
- Add tests for Long TTL injection, 5m and 1h write surcharges
- Document ANTHROPIC_CACHE_RETENTION in .env.example
- Add missing cost_per_token arg to cache test callsites
- Apply cargo fmt to long lines in tests and tracing macros
- Use saturating_add for cache token sum to prevent u32 overflow
- Tighten supports_prompt_cache to explicitly match claude-3+/claude-4+
  and named families (claude-sonnet/claude-opus/claude-haiku)
Merge the original prompt caching work from PR #291 by @Canvinus,
resolving 3 conflicts (config/llm.rs, config/mod.rs, llm/mod.rs)
caused by the declarative provider registry refactor on main.

Co-Authored-By: Canvinus <44225021+Canvinus@users.noreply.github.com>

[skip-regression-check]
…che fields

- Resolve merge conflicts: adapt CacheRetention and cache injection to
  the declarative provider registry (RegistryProviderConfig replaces
  AnthropicDirectConfig)
- Parse ANTHROPIC_CACHE_RETENTION env var in create_anthropic_from_registry()
- Use Anthropic automatic caching via top-level cache_control in
  additional_params (rig-core #[serde(flatten)] places it at request root)
- Add cache_read/creation_input_tokens fields to all mock LlmProviders
  added on main after PR #291 branched (response_cache, dispatcher,
  provider_chaos, trace_llm)
- Suppress clippy::too_many_arguments on record_llm_call and
  build_rig_request
- Add regression tests for cache injection (short/long/none) and
  cache_write_multiplier values

Co-Authored-By: Canvinus <44225021+Canvinus@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 7, 2026 07:55
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added scope: agent Agent core (agent loop, router, scheduler) scope: llm LLM integration scope: orchestrator Container orchestrator scope: worker Container worker size: XL 500+ changed lines labels Mar 7, 2026
@github-actions github-actions Bot added risk: medium Business logic, config, or moderate-risk modules contributor: core 20+ merged PRs labels Mar 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables Anthropic prompt caching for the direct Anthropic backend by injecting cache_control into requests via rig-core's additional_params, with configurable cache retention (none/short/long) and accurate cost tracking for cache write surcharges and read discounts. It continues work from PR #291 and adapts it to the new declarative provider registry from PR #618.

Changes:

  • Adds CacheRetention enum with FromStr/Display and a cache_write_multiplier method on LlmProvider trait, configurable via ANTHROPIC_CACHE_RETENTION env var
  • Extends CompletionResponse/ToolCompletionResponse/TokenUsage with cache_read_input_tokens and cache_creation_input_tokens fields, with proper cost accounting in CostGuard
  • Updates all mock/test providers and proxy response types to include the new cache token fields

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/config/llm.rs New CacheRetention enum with None/Short/Long variants, FromStr, Display
src/config/mod.rs Re-exports CacheRetention
src/llm/provider.rs Adds cache_read_input_tokens/cache_creation_input_tokens to response types and cache_write_multiplier() to LlmProvider trait
src/llm/rig_adapter.rs Core implementation: cache injection via additional_params, extract_cache_creation, supports_prompt_cache, cache debug logging, new tests
src/llm/mod.rs Anthropic provider factory reads ANTHROPIC_CACHE_RETENTION and calls with_cache_retention()
src/llm/reasoning.rs Propagates cache fields through TokenUsage
src/agent/dispatcher.rs Passes cache fields and write multiplier to CostGuard::record_llm_call
src/agent/cost_guard.rs Updated cost formula with cache read discount (10%) and write surcharge, new tests
src/llm/nearai_chat.rs Zero-fills cache fields for non-Anthropic provider
src/worker/api.rs Adds cache fields to proxy response types with #[serde(default)]
src/orchestrator/api.rs Passes cache fields through proxy responses
.env.example Documents ANTHROPIC_CACHE_RETENTION configuration
src/llm/failover.rs, src/llm/smart_routing.rs, src/llm/response_cache.rs Zero-fills cache fields in test mock providers
tests/support/trace_llm.rs, tests/provider_chaos.rs, tests/openai_compat_integration.rs, src/testing.rs Zero-fills cache fields in test providers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/llm/provider.rs
Comment on lines +344 to 351
/// Returns `1.0` by default (no surcharge). Anthropic providers return
/// `1.25` for 5-minute TTL or `2.0` for 1-hour TTL.
fn cache_write_multiplier(&self) -> Decimal {
Decimal::ONE
}
}

/// Sanitize a message list to ensure tool_use / tool_result integrity.
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache_write_multiplier() method has a default implementation returning Decimal::ONE, but none of the wrapper providers (FailoverProvider, SmartRoutingProvider, CachedProvider, CircuitBreakerProvider, RetryProvider, RecordingLlm) delegate it to their inner provider. Since build_provider_chain() wraps the base Anthropic RigAdapter in up to 6 decorator layers, calling self.llm().cache_write_multiplier() in the dispatcher (line 279) will always return Decimal::ONE instead of the actual 1.25× or 2.0× multiplier from the RigAdapter.

Each wrapper needs to delegate cache_write_multiplier like they already delegate cost_per_token. For example, in the RetryProvider impl: fn cache_write_multiplier(&self) -> Decimal { self.inner.cache_write_multiplier() }. This applies to all 6 wrapper types: RetryProvider, SmartRoutingProvider, FailoverProvider, CircuitBreakerProvider, CachedProvider, and RecordingLlm.

Suggested change
/// Returns `1.0` by default (no surcharge). Anthropic providers return
/// `1.25` for 5-minute TTL or `2.0` for 1-hour TTL.
fn cache_write_multiplier(&self) -> Decimal {
Decimal::ONE
}
}
/// Sanitize a message list to ensure tool_use / tool_result integrity.
/// Implementors should return `1.0` when there is no surcharge.
/// Anthropic providers return `1.25` for 5-minute TTL or `2.0` for 1-hour TTL.
fn cache_write_multiplier(&self) -> Decimal;
}
/// Sanitize a message list to ensure tool_use / tool_result integrity.
/// Sanitize a message list to ensure tool_use / tool_result integrity.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5797660. All 6 wrapper providers (RetryProvider, CircuitBreakerProvider, FailoverProvider, SmartRoutingProvider, CachedProvider, RecordingLlm) now delegate both cache_write_multiplier() and the new cache_read_discount() to their inner provider, matching how they already delegate cost_per_token().

Comment thread src/llm/rig_adapter.rs
Comment on lines +1060 to +1083
cache_write_multiplier_for(CacheRetention::None),
Decimal::ONE
);
// Short → 1.25× (25% surcharge)
assert_eq!(
cache_write_multiplier_for(CacheRetention::Short),
Decimal::new(125, 2)
);
// Long → 2.0× (100% surcharge)
assert_eq!(
cache_write_multiplier_for(CacheRetention::Long),
Decimal::TWO
);
}

/// Helper to compute the multiplier without constructing a full RigAdapter.
fn cache_write_multiplier_for(retention: CacheRetention) -> rust_decimal::Decimal {
match retention {
CacheRetention::None => rust_decimal::Decimal::ONE,
CacheRetention::Short => rust_decimal::Decimal::new(125, 2),
CacheRetention::Long => rust_decimal::Decimal::TWO,
}
}

Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_cache_write_multiplier_values test doesn't actually test the RigAdapter::cache_write_multiplier() method. It uses a standalone helper function cache_write_multiplier_for (lines 1076-1082) that duplicates the match logic. If the implementation in RigAdapter diverges from this helper, the test would still pass while the real code is wrong. Consider testing the actual trait method on a RigAdapter instance instead.

Suggested change
cache_write_multiplier_for(CacheRetention::None),
Decimal::ONE
);
// Short → 1.25× (25% surcharge)
assert_eq!(
cache_write_multiplier_for(CacheRetention::Short),
Decimal::new(125, 2)
);
// Long → 2.0× (100% surcharge)
assert_eq!(
cache_write_multiplier_for(CacheRetention::Long),
Decimal::TWO
);
}
/// Helper to compute the multiplier without constructing a full RigAdapter.
fn cache_write_multiplier_for(retention: CacheRetention) -> rust_decimal::Decimal {
match retention {
CacheRetention::None => rust_decimal::Decimal::ONE,
CacheRetention::Short => rust_decimal::Decimal::new(125, 2),
CacheRetention::Long => rust_decimal::Decimal::TWO,
}
}
RigAdapter::cache_write_multiplier(CacheRetention::None),
Decimal::ONE
);
// Short → 1.25× (25% surcharge)
assert_eq!(
RigAdapter::cache_write_multiplier(CacheRetention::Short),
Decimal::new(125, 2)
);
// Long → 2.0× (100% surcharge)
assert_eq!(
RigAdapter::cache_write_multiplier(CacheRetention::Long),
Decimal::TWO
);
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged in 5797660. Constructing a real RigAdapter requires a rig Model (which needs network/provider setup), so the test uses a standalone helper that mirrors the same match arms. Added a doc comment explaining this trade-off. The test_build_rig_request_* tests still exercise the full pipeline end-to-end as a safety net.

Comment thread src/agent/cost_guard.rs Outdated
// Uncached tokens = total input - cache reads - cache writes.
let cached_total = cache_read_input_tokens.saturating_add(cache_creation_input_tokens);
let uncached_input = input_tokens.saturating_sub(cached_total);
let cache_read_cost = input_rate * Decimal::from(cache_read_input_tokens) / dec!(10);
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache read discount is hardcoded to 90% (dividing by 10) on line 179, which is Anthropic-specific. OpenAI also reports cached_input_tokens (via rig-core's Usage::cached_input_tokens field) but uses a 50% discount instead of 90%. Since the RigAdapter populates cache_read_input_tokens from response.usage.cached_input_tokens for all providers (line 539/619), this will miscalculate costs when an OpenAI-compatible provider reports cached tokens.

Consider making the cache read discount configurable per-provider (similar to cache_write_multiplier) rather than hardcoding Anthropic's 10% rate.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5797660. Added a cache_read_discount() method to the LlmProvider trait (default: Decimal::ONE = no discount). RigAdapter overrides it to 10 for Anthropic (90% off). OpenAI providers can override to 2 (50% off) when cache support is added. CostGuard now uses the provider-supplied discount instead of hardcoding /10.

ilblackdragon and others added 2 commits March 7, 2026 00:16
…ke cache_read_discount configurable

The 6 decorator providers (Retry, CircuitBreaker, Failover, SmartRouting,
CachedProvider, RecordingLlm) did not delegate cache_write_multiplier()
to their inner provider, causing it to always return 1.0 instead of the
actual 1.25x/2.0x from RigAdapter. This fix adds delegation for both
cache_write_multiplier() and the new cache_read_discount() method.

Also makes the cache read discount per-provider instead of hardcoding
Anthropic's 90% discount (÷10). OpenAI uses 50% (÷2), so the discount
is now returned by each provider via the LlmProvider trait.

Addresses review feedback on PR #660.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 7, 2026 08:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/config/llm.rs
Comment on lines +30 to +43
impl std::str::FromStr for CacheRetention {
type Err = String;

fn from_str(s: &str) -> Result<Self, Self::Err> {
match s.to_lowercase().as_str() {
"none" | "off" | "disabled" => Ok(Self::None),
"short" | "5m" | "ephemeral" => Ok(Self::Short),
"long" | "1h" => Ok(Self::Long),
_ => Err(format!(
"invalid cache retention '{}', expected one of: none, short, long",
s
)),
}
}
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CacheRetention enum implements FromStr and Display with several aliases (e.g., "off", "disabled", "5m", "ephemeral", "1h"), but there are no unit tests for the parsing logic. The analogous SslMode enum in src/config/database.rs:202-226 has tests for round-trip serialization, case-insensitivity, and invalid input. Consider adding similar tests for CacheRetention::from_str to verify all the accepted aliases and the error case.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cddd796. Added 5 unit tests for CacheRetention::from_str: primary values, all aliases (off/disabled/5m/ephemeral/1h), case-insensitivity, invalid input error message, and Display round-trip.

Tests cover primary values, aliases (off/disabled/5m/ephemeral/1h),
case-insensitivity, invalid input error, and Display round-trip.

Addresses Copilot review feedback on PR #660.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ilblackdragon ilblackdragon merged commit 424a036 into main Mar 7, 2026
26 checks passed
@ilblackdragon ilblackdragon deleted the takeover/291-anthropic-prompt-caching branch March 7, 2026 09:10
bkutasi pushed a commit to bkutasi/ironclaw that referenced this pull request Mar 28, 2026
…ection (nearai#660)

* feat(llm): add Anthropic prompt caching and cache token tracking

- Inject cache_control via additional_params for Claude models in rig_adapter
- Add cache_read_input_tokens and cache_creation_input_tokens to
  CompletionResponse and ToolCompletionResponse
- Extract cached_input_tokens from rig-core unified Usage
- Add is_anthropic_model() detection helper with provider prefix support
- Log prompt cache hits at debug level (consistent with response_cache)
- Add 7 unit tests for cache injection and model detection
- Update all mock providers and test fixtures with new fields

* feat(cost): apply 90% cache discount to prompt-cached tokens in CostGuard

- Add cache_read_input_tokens to TokenUsage so cache counts flow from
  CompletionResponse through the reasoning layer to the dispatcher
- Update CostGuard::record_llm_call() to accept cache_read_input_tokens:
  cached tokens are billed at 10% of the normal input rate
- Thread cache_read_input_tokens from dispatcher into CostGuard
- Add test_cache_discount_reduces_cost verifying exact savings match
  90% of input cost for fully-cached requests
- Update all existing test callers with zero-cache parameter

* refactor(cache): scope cache_control to Anthropic backend and validate model support

- Replace model-name-based is_anthropic_model() with explicit
  enable_prompt_cache flag on RigAdapter, set only for the direct
  Anthropic backend via with_prompt_cache(true)
- Add supports_prompt_cache() to validate model names per Anthropic
  docs: only Claude 3+ models support caching; claude-2 and
  claude-instant are excluded to prevent 400 errors
- Warn when caching is enabled but model does not support it
- Replace is_anthropic_model tests with flag-based and model
  validation tests

* fix(cache): validate model at construction and propagate cache metrics through proxy

- Move supports_prompt_cache() check into with_prompt_cache() so
  unsupported models are detected once at construction, not per request
- Add cache_read_input_tokens and cache_creation_input_tokens to
  ProxyCompletionResponse and ProxyToolCompletionResponse with
  serde(default) for backward compatibility
- Pass cache metrics through orchestrator proxy instead of zeroing
- Use claude-opus-4-6 in cache discount test to match Anthropic
  semantics

* feat(llm): add configurable cache retention with write surcharge

- Add CacheRetention enum (none/short/long) to AnthropicDirectConfig
- Parse ANTHROPIC_CACHE_RETENTION env var (default: short)
- Inject TTL-aware cache_control (short=5m ephemeral, long=1h)
- Extract cache_creation_input_tokens from raw Anthropic response
- Add cache_write_multiplier() to LlmProvider trait (1.25x short, 2.0x long)
- Pipe dynamic write multiplier through dispatcher to CostGuard
- Add TokenUsage.cache_creation_input_tokens field
- Add tests for Long TTL injection, 5m and 1h write surcharges
- Document ANTHROPIC_CACHE_RETENTION in .env.example

* docs: fix stale cache_retention field comment

* fix: resolve CI failures after upstream merge

- Add missing cost_per_token arg to cache test callsites
- Apply cargo fmt to long lines in tests and tracing macros

* fix: address Copilot review feedback

- Use saturating_add for cache token sum to prevent u32 overflow
- Tighten supports_prompt_cache to explicitly match claude-3+/claude-4+
  and named families (claude-sonnet/claude-opus/claude-haiku)

* fix: adapt prompt caching to registry architecture and add missing cache fields

- Resolve merge conflicts: adapt CacheRetention and cache injection to
  the declarative provider registry (RegistryProviderConfig replaces
  AnthropicDirectConfig)
- Parse ANTHROPIC_CACHE_RETENTION env var in create_anthropic_from_registry()
- Use Anthropic automatic caching via top-level cache_control in
  additional_params (rig-core #[serde(flatten)] places it at request root)
- Add cache_read/creation_input_tokens fields to all mock LlmProviders
  added on main after PR nearai#291 branched (response_cache, dispatcher,
  provider_chaos, trace_llm)
- Suppress clippy::too_many_arguments on record_llm_call and
  build_rig_request
- Add regression tests for cache injection (short/long/none) and
  cache_write_multiplier values

Co-Authored-By: Canvinus <44225021+Canvinus@users.noreply.github.com>

* fix: delegate cache_write_multiplier through provider wrappers and make cache_read_discount configurable

The 6 decorator providers (Retry, CircuitBreaker, Failover, SmartRouting,
CachedProvider, RecordingLlm) did not delegate cache_write_multiplier()
to their inner provider, causing it to always return 1.0 instead of the
actual 1.25x/2.0x from RigAdapter. This fix adds delegation for both
cache_write_multiplier() and the new cache_read_discount() method.

Also makes the cache read discount per-provider instead of hardcoding
Anthropic's 90% discount (÷10). OpenAI uses 50% (÷2), so the discount
is now returned by each provider via the LlmProvider trait.

Addresses review feedback on PR nearai#660.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: cargo fmt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add CacheRetention FromStr/Display unit tests

Tests cover primary values, aliases (off/disabled/5m/ephemeral/1h),
case-insensitivity, invalid input error, and Display round-trip.

Addresses Copilot review feedback on PR nearai#660.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Andrey <canvi@2bb.dev>
Co-authored-by: Andrey Gruzdev <44225021+Canvinus@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
drchirag1991 pushed a commit to drchirag1991/ironclaw that referenced this pull request Apr 8, 2026
…ection (nearai#660)

* feat(llm): add Anthropic prompt caching and cache token tracking

- Inject cache_control via additional_params for Claude models in rig_adapter
- Add cache_read_input_tokens and cache_creation_input_tokens to
  CompletionResponse and ToolCompletionResponse
- Extract cached_input_tokens from rig-core unified Usage
- Add is_anthropic_model() detection helper with provider prefix support
- Log prompt cache hits at debug level (consistent with response_cache)
- Add 7 unit tests for cache injection and model detection
- Update all mock providers and test fixtures with new fields

* feat(cost): apply 90% cache discount to prompt-cached tokens in CostGuard

- Add cache_read_input_tokens to TokenUsage so cache counts flow from
  CompletionResponse through the reasoning layer to the dispatcher
- Update CostGuard::record_llm_call() to accept cache_read_input_tokens:
  cached tokens are billed at 10% of the normal input rate
- Thread cache_read_input_tokens from dispatcher into CostGuard
- Add test_cache_discount_reduces_cost verifying exact savings match
  90% of input cost for fully-cached requests
- Update all existing test callers with zero-cache parameter

* refactor(cache): scope cache_control to Anthropic backend and validate model support

- Replace model-name-based is_anthropic_model() with explicit
  enable_prompt_cache flag on RigAdapter, set only for the direct
  Anthropic backend via with_prompt_cache(true)
- Add supports_prompt_cache() to validate model names per Anthropic
  docs: only Claude 3+ models support caching; claude-2 and
  claude-instant are excluded to prevent 400 errors
- Warn when caching is enabled but model does not support it
- Replace is_anthropic_model tests with flag-based and model
  validation tests

* fix(cache): validate model at construction and propagate cache metrics through proxy

- Move supports_prompt_cache() check into with_prompt_cache() so
  unsupported models are detected once at construction, not per request
- Add cache_read_input_tokens and cache_creation_input_tokens to
  ProxyCompletionResponse and ProxyToolCompletionResponse with
  serde(default) for backward compatibility
- Pass cache metrics through orchestrator proxy instead of zeroing
- Use claude-opus-4-6 in cache discount test to match Anthropic
  semantics

* feat(llm): add configurable cache retention with write surcharge

- Add CacheRetention enum (none/short/long) to AnthropicDirectConfig
- Parse ANTHROPIC_CACHE_RETENTION env var (default: short)
- Inject TTL-aware cache_control (short=5m ephemeral, long=1h)
- Extract cache_creation_input_tokens from raw Anthropic response
- Add cache_write_multiplier() to LlmProvider trait (1.25x short, 2.0x long)
- Pipe dynamic write multiplier through dispatcher to CostGuard
- Add TokenUsage.cache_creation_input_tokens field
- Add tests for Long TTL injection, 5m and 1h write surcharges
- Document ANTHROPIC_CACHE_RETENTION in .env.example

* docs: fix stale cache_retention field comment

* fix: resolve CI failures after upstream merge

- Add missing cost_per_token arg to cache test callsites
- Apply cargo fmt to long lines in tests and tracing macros

* fix: address Copilot review feedback

- Use saturating_add for cache token sum to prevent u32 overflow
- Tighten supports_prompt_cache to explicitly match claude-3+/claude-4+
  and named families (claude-sonnet/claude-opus/claude-haiku)

* fix: adapt prompt caching to registry architecture and add missing cache fields

- Resolve merge conflicts: adapt CacheRetention and cache injection to
  the declarative provider registry (RegistryProviderConfig replaces
  AnthropicDirectConfig)
- Parse ANTHROPIC_CACHE_RETENTION env var in create_anthropic_from_registry()
- Use Anthropic automatic caching via top-level cache_control in
  additional_params (rig-core #[serde(flatten)] places it at request root)
- Add cache_read/creation_input_tokens fields to all mock LlmProviders
  added on main after PR nearai#291 branched (response_cache, dispatcher,
  provider_chaos, trace_llm)
- Suppress clippy::too_many_arguments on record_llm_call and
  build_rig_request
- Add regression tests for cache injection (short/long/none) and
  cache_write_multiplier values

Co-Authored-By: Canvinus <44225021+Canvinus@users.noreply.github.com>

* fix: delegate cache_write_multiplier through provider wrappers and make cache_read_discount configurable

The 6 decorator providers (Retry, CircuitBreaker, Failover, SmartRouting,
CachedProvider, RecordingLlm) did not delegate cache_write_multiplier()
to their inner provider, causing it to always return 1.0 instead of the
actual 1.25x/2.0x from RigAdapter. This fix adds delegation for both
cache_write_multiplier() and the new cache_read_discount() method.

Also makes the cache read discount per-provider instead of hardcoding
Anthropic's 90% discount (÷10). OpenAI uses 50% (÷2), so the discount
is now returned by each provider via the LlmProvider trait.

Addresses review feedback on PR nearai#660.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: cargo fmt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add CacheRetention FromStr/Display unit tests

Tests cover primary values, aliases (off/disabled/5m/ephemeral/1h),
case-insensitivity, invalid input error, and Display round-trip.

Addresses Copilot review feedback on PR nearai#660.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Andrey <canvi@2bb.dev>
Co-authored-by: Andrey Gruzdev <44225021+Canvinus@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: medium Business logic, config, or moderate-risk modules scope: agent Agent core (agent loop, router, scheduler) scope: llm LLM integration scope: orchestrator Container orchestrator scope: worker Container worker size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants