HNT-1890: shared curated recommendation cache#1306
Conversation
5894b87 to
8e632b5
Compare
|
Returning a 201 code in the scenario where we do not have any (stale) data would improve reliability, by avoiding connections piling up. TODO: Check how Firefox Desktop & Mobile would handle 201 status code? Does it retry? |
Q: Are we consistent, or does weather and other providers hit Redis on every request? Are the use-cases different? Q: What's the existing pattern for testing with Redis? Q: What are the existing Redis patterns? |
| Key settings: | ||
| - `cache` — `"redis"` to enable, `"none"` to disable (default: disabled) | ||
| - `soft_ttl_sec` — when a cached entry is considered stale and triggers revalidation | ||
| - `hard_ttl_sec` — when Redis evicts the key entirely (safety net) |
There was a problem hiding this comment.
We probably want this to be high: more like 1-24 hours? Days? To match our response time.
| await self.primary.delete(key) | ||
| except RedisError as exc: | ||
| raise CacheAdapterError(f"Failed to DELETE `{repr(key)}` with error: `{exc}`") from exc | ||
|
|
There was a problem hiding this comment.
Again, check if this is a new pattern, and whether it should be one.
There was a problem hiding this comment.
Redis locks are a new pattern because the curated recommendations use-case is unique, in that a lot of the same requests arrive at the same time.
|
|
||
| # MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__HARD_TTL_SEC | ||
| # Hard TTL in seconds. Redis evicts the key after this. Should be much longer than soft TTL. | ||
| hard_ttl_sec = 600 |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
|
||
| # MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__LOCK_TTL_SEC | ||
| # Distributed lock TTL in seconds. Auto-releases if the lock holder crashes. | ||
| lock_ttl_sec = 30 |
There was a problem hiding this comment.
TODO: Check if there's a client network timeout of 30s?
There was a problem hiding this comment.
Checked: the HTTP client for Corpus API requests uses httpx.AsyncClient with a 5-second request timeout (default from merino/utils/http_client.py:12). The lock_ttl_sec = 30 here is for the Redis distributed lock auto-expiry, not for network timeouts.
So the worst case for a single get_or_fetch call with Redis timing out is bounded by the Redis socket_timeout_sec (configured in [default.redis]), not 30 seconds per API call.
| raise ValueError( | ||
| f"hard_ttl_sec ({self.hard_ttl_sec}) must be greater than " | ||
| f"lock_ttl_sec ({self.lock_ttl_sec})" | ||
| ) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
|
||
|
|
||
| def _build_data_key( | ||
| config: CorpusCacheConfig, backend_type: str, surface_id: str, *extra: str |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| str(days_offset), | ||
| fetch_fn=lambda: self._backend.fetch(surface_id, days_offset), | ||
| serialize_fn=lambda items: [item.model_dump(mode="json") for item in items], | ||
| deserialize_fn=lambda data: [CorpusItem.model_validate(d) for d in data], |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| return await self._redis_cache.get_or_fetch( | ||
| "scheduled", | ||
| surface_id.value, | ||
| str(days_offset), |
There was a problem hiding this comment.
Let's verify whether we still need this? Do we need to fetch yesterday's data for the schedule?
| config: CorpusCacheConfig, backend_type: str, surface_id: str, *extra: str | ||
| ) -> str: | ||
| """Build the Redis key for cached corpus data.""" | ||
| parts = [config.key_prefix, backend_type, surface_id, *extra] |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| "expires_at": time.time() + soft_ttl_sec, | ||
| "data": data, | ||
| } | ||
| return orjson.dumps(envelope) |
There was a problem hiding this comment.
Let's check if there's a pattern to create an envelope or have a separate expiration key?
There was a problem hiding this comment.
Discussed with Herraj: no existing pattern. Envelope is probably easiest.
| if await self._try_acquire_lock(lock_key): | ||
| return await self._revalidate(data_key, lock_key, fetch_fn, serialize_fn) | ||
| # Another pod is populating; wait briefly then retry Redis | ||
| await asyncio.sleep(0.1) |
There was a problem hiding this comment.
Return 201 if this happens?
| async def shutdown(self) -> None: | ||
| """Close resources owned by this provider.""" | ||
| if self._cache_adapter is not None: | ||
| await self._cache_adapter.close() |
There was a problem hiding this comment.
Why is this the first time we need to do a shutdown in the curated-recommendations provider?
There was a problem hiding this comment.
Because previous resources did not use external connections.
| return envelope["expires_at"], envelope["data"] | ||
|
|
||
|
|
||
| class _RedisCorpusCache: |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| try: | ||
| await _provider.shutdown() | ||
| except NameError: | ||
| pass |
There was a problem hiding this comment.
Is there any benefit to severing GCS connections here? Other providers may do this. What happens if we don't?
|
|
||
| - `merino/curated_recommendations/corpus_backends/redis_cache.py` — cache logic | ||
| - `merino/curated_recommendations/__init__.py` — wiring (`_init_corpus_cache`) | ||
| - `merino/configs/default.toml` — config section with defaults and documentation |
There was a problem hiding this comment.
Key files is probably not useful; I'd remove it.
8a17c0d to
7791e1c
Compare
| domain_data_source = "remote" | ||
|
|
||
| [stage.curated_recommendations.corpus_cache] | ||
| cache = "redis" |
There was a problem hiding this comment.
Initially roll out to the staging environment for testing.
|
Verified: The parameter is part of the pre-existing |
| - [Test Failures in CI](./operations/testfailures.md) | ||
| - [Configs](./operations/configs.md) | ||
| - [Elasticsearch](./operations/elasticsearch.md) | ||
| - [Curated Recommendations](./operations/curated-recommendations/corpus-cache.md) |
There was a problem hiding this comment.
I'm curious whether we'd like to see more documentation around the Curated Recommendations endpoint. I think yes, but out-of-scope for this PR.
|
|
||
| T = TypeVar("T") | ||
|
|
||
| BackendType = Literal["scheduled", "sections"] |
There was a problem hiding this comment.
This could be an enum, although I don't know if it matters. The scheduled backend is going away in 2026Q2.
| if is_fresh and items_data is not None: | ||
| try: | ||
| result = deserialize_fn(items_data) | ||
| self._metrics.increment("corpus_cache.hit") |
There was a problem hiding this comment.
I didn't differentiate between the two backend endpoints, because scheduled is going away in Q2, and the main goal is to reduce total Apollo calls.
| # quickly and reducing backend load. With the L2 Redis cache, the total worst-case staleness | ||
| # is L1 TTL + L2 soft TTL (~120s), comparable to the previous L1-only TTL of ~120s. | ||
| cache_time_to_live_min = timedelta(seconds=50) | ||
| cache_time_to_live_max = timedelta(seconds=70) |
There was a problem hiding this comment.
This reverts #1305, and keeps the total TTL including Redis the same at about 2 minutes.
| - `corpus_cache.hit` - | ||
| A counter for Redis cache hits (fresh data returned without calling the Corpus API). | ||
| - `corpus_cache.stale` - | ||
| A counter for stale Redis cache entries that trigger revalidation. | ||
| - `corpus_cache.miss` - | ||
| A counter for Redis cache misses (no data in Redis). |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
jpetto
left a comment
There was a problem hiding this comment.
requesting splitting up this PR - i think focusing on the merino/cache files (and relevant docs) would be a good first step to break these changes into smaller, easier to review chunks. set the groundwork for the actual implementation in one PR, then implement in a service in a follow-up.
| # MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__CACHE | ||
| # Shared corpus cache backend. "redis" enables Redis as an L2 cache | ||
| # between in-memory SWR (L1) and the Corpus GraphQL API. "none" disables it. | ||
| cache = "none" |
There was a problem hiding this comment.
i think it's worth mentioning that a value of none will increase corpus load by the number of k8s pods (300).
|
|
||
| # MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__SOFT_TTL_SEC | ||
| # Soft TTL in seconds. Balances propagating time-sensitive content quickly vs backend load. | ||
| soft_ttl_sec = 60 |
There was a problem hiding this comment.
can you add a comment about the practical use of this value?
|
|
||
| # MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__HARD_TTL_SEC | ||
| # Hard TTL in seconds. Safety net so data survives extended API outages (1 day). | ||
| hard_ttl_sec = 86400 |
There was a problem hiding this comment.
same here - add a comment on the practical use.
Introduce a shared Redis cache layer between the in-memory SWR cache (L1) and the Corpus GraphQL API to reduce API request volume by ~300x across Merino pods. Key design: - Distributed stale-while-revalidate: one pod revalidates while others serve stale data, using SET NX EX for distributed locking - Embedded soft TTL envelope (2min soft / 10min hard) stored as orjson - All Redis errors gracefully fall through to the wrapped backend - Gated behind config flag (corpus_cache.backend = "redis" / "none") - Adds set_nx and delete to CacheAdapter protocol Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix lock release on serialize_fn failure: wrap both fetch and serialize in the try/except that releases the distributed lock - Add Redis adapter cleanup on shutdown: store adapter in module-level var, close it in new shutdown() function called from main.py lifespan - Add config validation: CorpusCacheConfig.__post_init__ ensures hard_ttl > soft_ttl and hard_ttl > lock_ttl - Add tests: config validation, serialize error lock release, empty list caching behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix TypeError crash on corrupted envelope with non-numeric expires_at: now caught and treated as cache miss instead of 500ing the request - Fix lock leak on asyncio.CancelledError: use try/finally in _revalidate so lock is released even for BaseException subclasses - Broaden _redis_set exception catch to all Exception (not just CacheAdapterError) so serialization errors are also suppressed - Add graceful fallback in _init_corpus_cache: Redis init failure no longer aborts provider initialization - Guard shutdown() against uninitialized provider (NameError on startup failure) - Add defensive get_provider()/get_legacy_provider() with RuntimeError instead of silent NameError - Add pragma: no cover to new protocol methods for consistency - Add tests for non-numeric expires_at and CancelledError lock release Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace _make_corpus_item helper with generate_corpus_item from test_sections.py. Keep _make_corpus_section as a thin wrapper since no shared CorpusSection fixture exists yet. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- docs/operations/curated-recommendations/corpus-cache.md: operational doc with inline mermaid flow diagram, design decisions table, config reference, and rollout steps - docs/SUMMARY.md: add curated recommendations section to nav - PR_DESCRIPTION.md: PR description following repo template Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On cache miss when the lock is held by another pod, instead of calling the Corpus API directly (which defeats cross-pod coalescing), wait 0.1s and retry Redis. If the lock winner has written data, return it. If not, raise BackendError — the L1 SWR layer will handle the error and retry on the next request. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When Redis is down or the lock holder is slow, the L2 cache now falls back to calling the Corpus API directly instead of raising BackendError (which would surface as empty NewTab results). This ensures the optional L2 cache never degrades availability below what L1+API provide. The L1 SWR cache's per-pod asyncio.Lock coalescing prevents stampede. Also narrows the NameError catch in shutdown() to avoid masking bugs, and removes trailing blank line at EOF in tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Circuit breaker skips Redis after consecutive failures, recovers after timeout - Cold-miss lock losers get 503 + Retry-After instead of piling up connections - Replace dataclass validation with Dynaconf validators - Use typed BackendType and SurfaceId enum for cache keys - Increase hard TTL to 1 day (86400s) to survive extended API outages - Update corpus-cache docs with new behaviors and config table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove redundant L2 stale-while-revalidate paragraph - Fix circuit breaker description: full close, not half-open - Fix design decisions: Redis consulted on L1 miss or stale - Write out Stale-While-Revalidate fully (no SWR abbreviation) - Describe both L1 and L2 as stale-while-revalidate - Move retry/503 nodes inside background task in diagram - Add motivation for cacheability of corpus requests - Remove rollout section (ephemeral, now in PR description) - Fix ruff formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ker_skip) Emits counters at key decision points in the cache lookup: - corpus_cache.hit: fresh Redis hit, no API call needed - corpus_cache.stale: stale entry triggered revalidation - corpus_cache.miss: no data in Redis - corpus_cache.circuit_breaker_skip: Redis skipped due to open circuit Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Redis timeout caveat - Show L1 asyncio.Lock in diagram between L1 miss and L2 check - Increase cold-miss retry delay from 100ms to 500ms (P95 API latency is ~217ms) - Document intra-pod coalescing via asyncio.Lock in cold-miss prose - Document Redis timeout caveat: lock holder blocks, slowing circuit breaker Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move asyncio.Lock inside L1 subgraph for better layout - Rename terminal nodes to "Return data + update L1" consistently - Simplify 503 label to "503 Service Unavailable" - Pre-declare BackendError node near response nodes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Merge FRESH/STALE L1 responses into single "Respond with data" node - Consolidate three "Return data + update L1" terminal nodes into one - Combine "STALE" and "MISS" L2 branches into "STALE or MISS" - Reduces node count from 18 to 14, producing a much more compact layout Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… API fallthrough - Remove Retry-After: 60 header from 503 response (Firefox doesn't parse it) - Circuit breaker open now raises CorpusCacheUnavailable (503) instead of falling through to the Corpus API. L1 serves stale data in steady state, so this only affects cold starts with a degraded Redis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Uses testcontainers AsyncRedisContainer following existing patterns (weather, finance, yelp). Tests: - Cache miss then hit: verifies backend is called once, second call served from Redis - Distributed lock: concurrent callers on same key, only one fetches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduces chance of sequential execution masking concurrency bugs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The L1 asyncio.Lock already limits Redis traffic to one coroutine per cache entry per pod, making a circuit breaker unnecessary. In steady state, L1 serves stale data and only the background revalidation task hits Redis. On cold starts, the lock coalesces concurrent requests. Removes ~180 lines of circuit breaker implementation, configuration, tests, and documentation. Adds a "No circuit breaker" design decision explaining the rationale. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Staging has a Redis 7.2 instance (merino-stagepy in moz-fx-merino-nonprod-ee93). This enables the L2 cache there so we can validate before production. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- All logger.warning in redis_cache.py → logger.error so Sentry captures data corruption, serialization bugs, and Redis connectivity issues. - L2 soft TTL: 120s → 60s. L1 TTL: 110s → 60s. Total worst-case staleness stays comparable (~120s vs ~110s), but average staleness drops because any pod's revalidation refreshes L2 for all pods. - Improved TTL comments in default.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move CorpusCacheUnavailable from merino/exceptions.py to redis_cache.py (co-located with the code that raises it, matching CorpusGraphQLError pattern) - Remove Configuration section from corpus-cache.md (duplicates default.toml) - Parametrize key builder tests, add log assertions for silent error paths - Expand "SWR" to "stale-while-revalidate" in docstrings - Tighten concurrent lock integration test to assert all callers succeed - Clarify test docstrings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wiring-only module that creates GCS clients, HTTP clients, and Redis connections. Not unit-testable in a meaningful way; exercised by integration tests and the running app. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This reverts commit ea36cb4.
Cover get, set, sadd, sismember, scard, close (primary != replica), register_script, and run_script. Brings merino/cache/redis.py from 65% to 97% coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace opportunistic RedisAdapter coverage (sadd, sismember, scard, register_script, run_script, get, set, close) with targeted tests for _init_corpus_cache which this PR introduced. Covers all 3 paths: disabled, redis-enabled, and error fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace corpus_cache.hit/stale/miss with a single `cache` metric using tags: name=corpus, result=hit|stale|miss. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9e644e4 to
71315de
Compare
…tings Adds one-line annotations to cache, soft_ttl_sec, and hard_ttl_sec describing their real-world effects (Apollo volume / Corpus API load, avg propagation time, max stale-data age during outages). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
References
JIRA: HNT-1890
Description
Adds a shared Redis cache between the per-pod in-memory cache and the Corpus GraphQL API. ~300 pods currently hit the API independently every ~60s. With this, one pod fetches and the rest read from Redis.
How it works
Requests are always served from the in-memory cache (L1) — either fresh or stale. Stale data is served whenever possible so that the Redis and API calls never block the request. Revalidation happens in a background task: it checks Redis (L2) first, and only the pod that acquires the distributed lock fetches from the API.
flowchart TB req["Firefox NewTab Request"] subgraph L1 ["L1 — Per-Pod In-Memory Cache"] check_l1{{"Check in-memory cache"}} l1_lock{{"asyncio.Lock (per entry)"}} end respond["Respond with data"] return_error["BackendError"] subgraph bg ["Background Revalidation Task"] subgraph L2 ["L2 — Shared Redis"] check_l2{{"Check Redis cache"}} acquire_lock{{"Try distributed lock"}} end api["Fetch from Corpus GraphQL API"] write["Write to Redis + release lock"] retry_redis["Wait 500ms, retry Redis"] serve_stale["Serve stale + update L1"] return_data["Return data + update L1"] return_503["503 Service Unavailable"] end req --> check_l1 check_l1 -- "FRESH or STALE" --> respond check_l1 -. "MISS (cold start)" .-> l1_lock respond -. "if STALE, spawns task" .-> l1_lock l1_lock -- "acquired" --> check_l2 l1_lock -. "waited, value populated" .-> respond l1_lock -. "waited, fetch failed" .-> return_error check_l2 -- "FRESH HIT" --> return_data check_l2 -. "STALE or MISS" .-> acquire_lock acquire_lock -- "LOCK ACQUIRED" --> api acquire_lock -. "LOCK HELD + stale exists" .-> serve_stale acquire_lock -. "LOCK HELD + no data" .-> retry_redis retry_redis -- "HIT" --> return_data retry_redis -. "MISS" .-> return_503 api --> write --> return_data style req fill:#2c3e50,stroke:#1a252f,color:#ecf0f1,stroke-width:2px style check_l1 fill:#2980b9,stroke:#1f6da0,color:#fff,stroke-width:2px style l1_lock fill:#2980b9,stroke:#1f6da0,color:#fff,stroke-width:2px style check_l2 fill:#d35400,stroke:#a04000,color:#fff,stroke-width:2px style acquire_lock fill:#e67e22,stroke:#bf6516,color:#fff,stroke-width:2px style api fill:#1e8449,stroke:#145a32,color:#fff,stroke-width:2px style write fill:#1e8449,stroke:#145a32,color:#fff,stroke-width:2px style respond fill:#27ae60,stroke:#1e8449,color:#fff,stroke-width:2px style serve_stale fill:#f4d03f,stroke:#d4ac0f,color:#333 style retry_redis fill:#e67e22,stroke:#bf6516,color:#fff,stroke-width:2px style return_data fill:#27ae60,stroke:#1e8449,color:#fff style return_503 fill:#e74c3c,stroke:#c0392b,color:#fff style return_error fill:#e74c3c,stroke:#c0392b,color:#fff style L1 fill:#eaf2f8,stroke:#2980b9,stroke-width:2px,color:#2c3e50 style L2 fill:#fef5e7,stroke:#d35400,stroke-width:2px,color:#2c3e50 style bg fill:#f4f6f7,stroke:#95a5a6,stroke-width:2px,stroke-dasharray: 8 4,color:#2c3e50Design decisions
asyncio.Lockper cache entrySET NX EXwith TTLasyncio.Locklimits Redis trafficRollout
stage.toml. Monitor metrics, validate API call reduction.cache = "redis"toproduction.tomlor setMERINO_CURATED_RECOMMENDATIONS__CORPUS_CACHE__CACHE=redis.PR Review Checklist
Put an
xin the boxes that apply[DISCO-####], and has the same title (if applicable)[load test: (abort|skip|warn)]keywords are applied to the last commit message (if applicable)