HNT-1890: shared curated recommendation cache by mmiermans · Pull Request #1306 · mozilla-services/merino-py

mmiermans · 2026-03-12T01:08:45Z

References

Description

Adds a shared Redis cache between the per-pod in-memory cache and the Corpus GraphQL API. ~300 pods currently hit the API independently every ~60s. With this, one pod fetches and the rest read from Redis.

How it works

Requests are always served from the in-memory cache (L1) — either fresh or stale. Stale data is served whenever possible so that the Redis and API calls never block the request. Revalidation happens in a background task: it checks Redis (L2) first, and only the pod that acquires the distributed lock fetches from the API.

flowchart TB
    req["Firefox NewTab Request"]

    subgraph L1 ["L1 — Per-Pod In-Memory Cache"]
        check_l1{{"Check in-memory cache"}}
        l1_lock{{"asyncio.Lock (per entry)"}}
    end

    respond["Respond with data"]
    return_error["BackendError"]

    subgraph bg ["Background Revalidation Task"]

        subgraph L2 ["L2 — Shared Redis"]
            check_l2{{"Check Redis cache"}}
            acquire_lock{{"Try distributed lock"}}
        end

        api["Fetch from Corpus GraphQL API"]
        write["Write to Redis + release lock"]
        retry_redis["Wait 500ms, retry Redis"]
        serve_stale["Serve stale + update L1"]
        return_data["Return data + update L1"]
        return_503["503 Service Unavailable"]
    end

    req --> check_l1

    check_l1 -- "FRESH or STALE" --> respond
    check_l1 -. "MISS (cold start)" .-> l1_lock

    respond -. "if STALE, spawns task" .-> l1_lock
    l1_lock -- "acquired" --> check_l2
    l1_lock -. "waited, value populated" .-> respond
    l1_lock -. "waited, fetch failed" .-> return_error

    check_l2 -- "FRESH HIT" --> return_data
    check_l2 -. "STALE or MISS" .-> acquire_lock

    acquire_lock -- "LOCK ACQUIRED" --> api
    acquire_lock -. "LOCK HELD + stale exists" .-> serve_stale
    acquire_lock -. "LOCK HELD + no data" .-> retry_redis

    retry_redis -- "HIT" --> return_data
    retry_redis -. "MISS" .-> return_503

    api --> write --> return_data

    style req fill:#2c3e50,stroke:#1a252f,color:#ecf0f1,stroke-width:2px
    style check_l1 fill:#2980b9,stroke:#1f6da0,color:#fff,stroke-width:2px
    style l1_lock fill:#2980b9,stroke:#1f6da0,color:#fff,stroke-width:2px
    style check_l2 fill:#d35400,stroke:#a04000,color:#fff,stroke-width:2px
    style acquire_lock fill:#e67e22,stroke:#bf6516,color:#fff,stroke-width:2px
    style api fill:#1e8449,stroke:#145a32,color:#fff,stroke-width:2px
    style write fill:#1e8449,stroke:#145a32,color:#fff,stroke-width:2px
    style respond fill:#27ae60,stroke:#1e8449,color:#fff,stroke-width:2px
    style serve_stale fill:#f4d03f,stroke:#d4ac0f,color:#333
    style retry_redis fill:#e67e22,stroke:#bf6516,color:#fff,stroke-width:2px
    style return_data fill:#27ae60,stroke:#1e8449,color:#fff
    style return_503 fill:#e74c3c,stroke:#c0392b,color:#fff
    style return_error fill:#e74c3c,stroke:#c0392b,color:#fff
    style L1 fill:#eaf2f8,stroke:#2980b9,stroke-width:2px,color:#2c3e50
    style L2 fill:#fef5e7,stroke:#d35400,stroke-width:2px,color:#2c3e50
    style bg fill:#f4f6f7,stroke:#95a5a6,stroke-width:2px,stroke-dasharray: 8 4,color:#2c3e50

Design decisions

Decision	Choice	Why
Cache layer	Redis L2 behind existing in-memory L1	Keeps per-pod latency low, Redis only consulted on L1 miss or stale
Write pattern	Distributed stale-while-revalidate	One pod revalidates, others serve stale. Avoids thundering herd
L1 lock	`asyncio.Lock` per cache entry	Coordinates concurrent revalidation within a single pod
L2 lock	`SET NX EX` with TTL	Distributed, self-expiring. Worst case on timeout: one extra API call
Soft/hard TTL	60s soft, 1 day hard	Soft triggers revalidation. Hard is a safety net for extended API outages
Cache format	Pydantic model dicts via orjson	Processed models cached, not raw GraphQL. Saves CPU across pods
Redis cluster	Same shared cluster as other providers	~36 keys, ~4MB total. Not worth a separate instance
Cold miss (lock held)	503 Service Unavailable	Prevents connection pile-up; Firefox shows cached NewTab content
No circuit breaker	L1 `asyncio.Lock` limits Redis traffic	Only one coroutine per cache entry per pod reaches Redis. In steady state, L1 serves stale data and only the background task hits Redis
Rollout	Global config flag, off by default	Enable in staging first, then production

Rollout

Checked with Nan; 5 MB storage and 40 IOPS are fine.
Enabled in staging via stage.toml. Monitor metrics, validate API call reduction.
Enable in production: add cache = "redis" to production.toml or set MERINO_CURATED_RECOMMENDATIONS__CORPUS_CACHE__CACHE=redis.
After production rollout: add a request volume alert in Apollo based on the new (lower) baseline, to detect when the Redis cache fails and pods fall back to direct API calls.

PR Review Checklist

Put an x in the boxes that apply

This PR conforms to the Contribution Guidelines
The PR title starts with the JIRA issue reference, format example [DISCO-####], and has the same title (if applicable)
[load test: (abort|skip|warn)] keywords are applied to the last commit message (if applicable)
Documentation has been updated (if applicable)
Functional and performance test coverage has been expanded and maintained (if applicable)

mmiermans · 2026-03-23T19:16:46Z

Returning a 201 code in the scenario where we do not have any (stale) data would improve reliability, by avoiding connections piling up.

TODO: Check how Firefox Desktop & Mobile would handle 201 status code? Does it retry?

mmiermans · 2026-03-23T19:29:34Z

Redis L2 behind existing in-memory L1

Q: Are we consistent, or does weather and other providers hit Redis on every request? Are the use-cases different?

Q: What's the existing pattern for testing with Redis?

Q: What are the existing Redis patterns?

mmiermans · 2026-03-23T19:33:29Z

+Key settings:
+- `cache` — `"redis"` to enable, `"none"` to disable (default: disabled)
+- `soft_ttl_sec` — when a cached entry is considered stale and triggers revalidation
+- `hard_ttl_sec` — when Redis evicts the key entirely (safety net)


We probably want this to be high: more like 1-24 hours? Days? To match our response time.

mmiermans · 2026-03-23T19:35:15Z

+            await self.primary.delete(key)
+        except RedisError as exc:
+            raise CacheAdapterError(f"Failed to DELETE `{repr(key)}` with error: `{exc}`") from exc
+


Again, check if this is a new pattern, and whether it should be one.

Redis locks are a new pattern because the curated recommendations use-case is unique, in that a lot of the same requests arrive at the same time.

Sign in to view

+
+# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__HARD_TTL_SEC
+# Hard TTL in seconds. Redis evicts the key after this. Should be much longer than soft TTL.
+hard_ttl_sec = 600


mmiermans · 2026-03-23T19:38:38Z

+
+# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__LOCK_TTL_SEC
+# Distributed lock TTL in seconds. Auto-releases if the lock holder crashes.
+lock_ttl_sec = 30


TODO: Check if there's a client network timeout of 30s?

Checked: the HTTP client for Corpus API requests uses httpx.AsyncClient with a 5-second request timeout (default from merino/utils/http_client.py:12). The lock_ttl_sec = 30 here is for the Redis distributed lock auto-expiry, not for network timeouts.

So the worst case for a single get_or_fetch call with Redis timing out is bounded by the Redis socket_timeout_sec (configured in [default.redis]), not 30 seconds per API call.

Sign in to view

+            raise ValueError(
+                f"hard_ttl_sec ({self.hard_ttl_sec}) must be greater than "
+                f"lock_ttl_sec ({self.lock_ttl_sec})"
+            )


Sign in to view

+
+
+def _build_data_key(
+    config: CorpusCacheConfig, backend_type: str, surface_id: str, *extra: str


Sign in to view

+            str(days_offset),
+            fetch_fn=lambda: self._backend.fetch(surface_id, days_offset),
+            serialize_fn=lambda items: [item.model_dump(mode="json") for item in items],
+            deserialize_fn=lambda data: [CorpusItem.model_validate(d) for d in data],


mmiermans · 2026-03-23T19:45:03Z

+        return await self._redis_cache.get_or_fetch(
+            "scheduled",
+            surface_id.value,
+            str(days_offset),


Let's verify whether we still need this? Do we need to fetch yesterday's data for the schedule?

Sign in to view

+    config: CorpusCacheConfig, backend_type: str, surface_id: str, *extra: str
+) -> str:
+    """Build the Redis key for cached corpus data."""
+    parts = [config.key_prefix, backend_type, surface_id, *extra]


mmiermans · 2026-03-23T19:47:40Z

+        "expires_at": time.time() + soft_ttl_sec,
+        "data": data,
+    }
+    return orjson.dumps(envelope)


Let's check if there's a pattern to create an envelope or have a separate expiration key?

Discussed with Herraj: no existing pattern. Envelope is probably easiest.

mmiermans · 2026-03-23T19:53:49Z

+        if await self._try_acquire_lock(lock_key):
+            return await self._revalidate(data_key, lock_key, fetch_fn, serialize_fn)
+        # Another pod is populating; wait briefly then retry Redis
+        await asyncio.sleep(0.1)


Return 201 if this happens?

mmiermans · 2026-03-23T19:57:35Z

+    async def shutdown(self) -> None:
+        """Close resources owned by this provider."""
+        if self._cache_adapter is not None:
+            await self._cache_adapter.close()


Why is this the first time we need to do a shutdown in the curated-recommendations provider?

Because previous resources did not use external connections.

Sign in to view

+    return envelope["expires_at"], envelope["data"]
+
+
+class _RedisCorpusCache:


mmiermans · 2026-03-30T18:40:48Z

+    try:
+        await _provider.shutdown()
+    except NameError:
+        pass


Is there any benefit to severing GCS connections here? Other providers may do this. What happens if we don't?

mmiermans · 2026-03-30T19:16:57Z

+
+- `merino/curated_recommendations/corpus_backends/redis_cache.py` — cache logic
+- `merino/curated_recommendations/__init__.py` — wiring (`_init_corpus_cache`)
+- `merino/configs/default.toml` — config section with defaults and documentation


Key files is probably not useful; I'd remove it.

mmiermans · 2026-04-10T18:44:16Z

 domain_data_source = "remote"

+[stage.curated_recommendations.corpus_cache]
+cache = "redis"


Initially roll out to the staging environment for testing.

mmiermans · 2026-04-10T19:47:34Z

Re: days_offset comment

Verified: days_offset is never called with a non-zero value. The only call site is provider.py:195 which uses the default (days_offset=0).

The parameter is part of the pre-existing ScheduledSurfaceProtocol interface, so the cache wrapper implements the same signature. Since it's always 0, the cache key doesn't need to include it.

mmiermans · 2026-04-10T19:13:10Z

  - [Test Failures in CI](./operations/testfailures.md)
  - [Configs](./operations/configs.md)
  - [Elasticsearch](./operations/elasticsearch.md)
+  - [Curated Recommendations](./operations/curated-recommendations/corpus-cache.md)


I'm curious whether we'd like to see more documentation around the Curated Recommendations endpoint. I think yes, but out-of-scope for this PR.

mmiermans · 2026-04-10T19:22:36Z

+
+T = TypeVar("T")
+
+BackendType = Literal["scheduled", "sections"]


This could be an enum, although I don't know if it matters. The scheduled backend is going away in 2026Q2.

mmiermans · 2026-04-10T19:32:15Z

+            if is_fresh and items_data is not None:
+                try:
+                    result = deserialize_fn(items_data)
+                    self._metrics.increment("corpus_cache.hit")


I didn't differentiate between the two backend endpoints, because scheduled is going away in Q2, and the main goal is to reduce total Apollo calls.

mmiermans · 2026-04-10T19:39:21Z

+    # quickly and reducing backend load. With the L2 Redis cache, the total worst-case staleness
+    # is L1 TTL + L2 soft TTL (~120s), comparable to the previous L1-only TTL of ~120s.
+    cache_time_to_live_min = timedelta(seconds=50)
+    cache_time_to_live_max = timedelta(seconds=70)


This reverts #1305, and keeps the total TTL including Redis the same at about 2 minutes.

Sign in to view

+- `corpus_cache.hit` -
+ A counter for Redis cache hits (fresh data returned without calling the Corpus API).
+- `corpus_cache.stale` -
+ A counter for stale Redis cache entries that trigger revalidation.
+- `corpus_cache.miss` -
+ A counter for Redis cache misses (no data in Redis).


jpetto

requesting splitting up this PR - i think focusing on the merino/cache files (and relevant docs) would be a good first step to break these changes into smaller, easier to review chunks. set the groundwork for the actual implementation in one PR, then implement in a service in a follow-up.

jpetto · 2026-04-23T15:17:41Z

+# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__CACHE
+# Shared corpus cache backend. "redis" enables Redis as an L2 cache
+# between in-memory SWR (L1) and the Corpus GraphQL API. "none" disables it.
+cache = "none"


i think it's worth mentioning that a value of none will increase corpus load by the number of k8s pods (300).

jpetto · 2026-04-23T15:18:15Z

+
+# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__SOFT_TTL_SEC
+# Soft TTL in seconds. Balances propagating time-sensitive content quickly vs backend load.
+soft_ttl_sec = 60


can you add a comment about the practical use of this value?

jpetto · 2026-04-23T15:19:44Z

+
+# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__HARD_TTL_SEC
+# Hard TTL in seconds. Safety net so data survives extended API outages (1 day).
+hard_ttl_sec = 86400


same here - add a comment on the practical use.

Introduce a shared Redis cache layer between the in-memory SWR cache (L1) and the Corpus GraphQL API to reduce API request volume by ~300x across Merino pods. Key design: - Distributed stale-while-revalidate: one pod revalidates while others serve stale data, using SET NX EX for distributed locking - Embedded soft TTL envelope (2min soft / 10min hard) stored as orjson - All Redis errors gracefully fall through to the wrapped backend - Gated behind config flag (corpus_cache.backend = "redis" / "none") - Adds set_nx and delete to CacheAdapter protocol Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix lock release on serialize_fn failure: wrap both fetch and serialize in the try/except that releases the distributed lock - Add Redis adapter cleanup on shutdown: store adapter in module-level var, close it in new shutdown() function called from main.py lifespan - Add config validation: CorpusCacheConfig.__post_init__ ensures hard_ttl > soft_ttl and hard_ttl > lock_ttl - Add tests: config validation, serialize error lock release, empty list caching behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix TypeError crash on corrupted envelope with non-numeric expires_at: now caught and treated as cache miss instead of 500ing the request - Fix lock leak on asyncio.CancelledError: use try/finally in _revalidate so lock is released even for BaseException subclasses - Broaden _redis_set exception catch to all Exception (not just CacheAdapterError) so serialization errors are also suppressed - Add graceful fallback in _init_corpus_cache: Redis init failure no longer aborts provider initialization - Guard shutdown() against uninitialized provider (NameError on startup failure) - Add defensive get_provider()/get_legacy_provider() with RuntimeError instead of silent NameError - Add pragma: no cover to new protocol methods for consistency - Add tests for non-numeric expires_at and CancelledError lock release Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace _make_corpus_item helper with generate_corpus_item from test_sections.py. Keep _make_corpus_section as a thin wrapper since no shared CorpusSection fixture exists yet. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- docs/operations/curated-recommendations/corpus-cache.md: operational doc with inline mermaid flow diagram, design decisions table, config reference, and rollout steps - docs/SUMMARY.md: add curated recommendations section to nav - PR_DESCRIPTION.md: PR description following repo template Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On cache miss when the lock is held by another pod, instead of calling the Corpus API directly (which defeats cross-pod coalescing), wait 0.1s and retry Redis. If the lock winner has written data, return it. If not, raise BackendError — the L1 SWR layer will handle the error and retry on the next request. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When Redis is down or the lock holder is slow, the L2 cache now falls back to calling the Corpus API directly instead of raising BackendError (which would surface as empty NewTab results). This ensures the optional L2 cache never degrades availability below what L1+API provide. The L1 SWR cache's per-pod asyncio.Lock coalescing prevents stampede. Also narrows the NameError catch in shutdown() to avoid masking bugs, and removes trailing blank line at EOF in tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Circuit breaker skips Redis after consecutive failures, recovers after timeout - Cold-miss lock losers get 503 + Retry-After instead of piling up connections - Replace dataclass validation with Dynaconf validators - Use typed BackendType and SurfaceId enum for cache keys - Increase hard TTL to 1 day (86400s) to survive extended API outages - Update corpus-cache docs with new behaviors and config table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove redundant L2 stale-while-revalidate paragraph - Fix circuit breaker description: full close, not half-open - Fix design decisions: Redis consulted on L1 miss or stale - Write out Stale-While-Revalidate fully (no SWR abbreviation) - Describe both L1 and L2 as stale-while-revalidate - Move retry/503 nodes inside background task in diagram - Add motivation for cacheability of corpus requests - Remove rollout section (ephemeral, now in PR description) - Fix ruff formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ker_skip) Emits counters at key decision points in the cache lookup: - corpus_cache.hit: fresh Redis hit, no API call needed - corpus_cache.stale: stale entry triggered revalidation - corpus_cache.miss: no data in Redis - corpus_cache.circuit_breaker_skip: Redis skipped due to open circuit Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Redis timeout caveat - Show L1 asyncio.Lock in diagram between L1 miss and L2 check - Increase cold-miss retry delay from 100ms to 500ms (P95 API latency is ~217ms) - Document intra-pod coalescing via asyncio.Lock in cold-miss prose - Document Redis timeout caveat: lock holder blocks, slowing circuit breaker Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Move asyncio.Lock inside L1 subgraph for better layout - Rename terminal nodes to "Return data + update L1" consistently - Simplify 503 label to "503 Service Unavailable" - Pre-declare BackendError node near response nodes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Merge FRESH/STALE L1 responses into single "Respond with data" node - Consolidate three "Return data + update L1" terminal nodes into one - Combine "STALE" and "MISS" L2 branches into "STALE or MISS" - Reduces node count from 18 to 14, producing a much more compact layout Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… API fallthrough - Remove Retry-After: 60 header from 503 response (Firefox doesn't parse it) - Circuit breaker open now raises CorpusCacheUnavailable (503) instead of falling through to the Corpus API. L1 serves stale data in steady state, so this only affects cold starts with a degraded Redis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Uses testcontainers AsyncRedisContainer following existing patterns (weather, finance, yelp). Tests: - Cache miss then hit: verifies backend is called once, second call served from Redis - Distributed lock: concurrent callers on same key, only one fetches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reduces chance of sequential execution masking concurrency bugs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The L1 asyncio.Lock already limits Redis traffic to one coroutine per cache entry per pod, making a circuit breaker unnecessary. In steady state, L1 serves stale data and only the background revalidation task hits Redis. On cold starts, the lock coalesces concurrent requests. Removes ~180 lines of circuit breaker implementation, configuration, tests, and documentation. Adds a "No circuit breaker" design decision explaining the rationale. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Staging has a Redis 7.2 instance (merino-stagepy in moz-fx-merino-nonprod-ee93). This enables the L2 cache there so we can validate before production. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- All logger.warning in redis_cache.py → logger.error so Sentry captures data corruption, serialization bugs, and Redis connectivity issues. - L2 soft TTL: 120s → 60s. L1 TTL: 110s → 60s. Total worst-case staleness stays comparable (~120s vs ~110s), but average staleness drops because any pod's revalidation refreshes L2 for all pods. - Improved TTL comments in default.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Move CorpusCacheUnavailable from merino/exceptions.py to redis_cache.py (co-located with the code that raises it, matching CorpusGraphQLError pattern) - Remove Configuration section from corpus-cache.md (duplicates default.toml) - Parametrize key builder tests, add log assertions for silent error paths - Expand "SWR" to "stale-while-revalidate" in docstrings - Tighten concurrent lock integration test to assert all callers succeed - Clarify test docstrings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Wiring-only module that creates GCS clients, HTTP clients, and Redis connections. Not unit-testable in a meaningful way; exercised by integration tests and the running app. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This reverts commit ea36cb4.

Cover get, set, sadd, sismember, scard, close (primary != replica), register_script, and run_script. Brings merino/cache/redis.py from 65% to 97% coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace opportunistic RedisAdapter coverage (sadd, sismember, scard, register_script, run_script, get, set, close) with targeted tests for _init_corpus_cache which this PR introduced. Covers all 3 paths: disabled, redis-enabled, and error fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace corpus_cache.hit/stale/miss with a single `cache` metric using tags: name=corpus, result=hit|stale|miss. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tings Adds one-line annotations to cache, soft_ttl_sec, and hard_ttl_sec describing their real-world effects (Apollo volume / Corpus API load, avg propagation time, max stale-data age during outages). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mmiermans requested a review from jpetto March 12, 2026 01:19

mmiermans changed the title ~~Hnt 1890 shared curated recommendation cache~~ HNT-1890: shared curated recommendation cache Mar 12, 2026

mmiermans force-pushed the hnt-1890-shared-curated-recommendation-cache branch from 5894b87 to 8e632b5 Compare March 12, 2026 21:07

mmiermans commented Mar 23, 2026

View reviewed changes

Comment thread merino/configs/default.toml Outdated

# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__HARD_TTL_SEC

# Hard TTL in seconds. Redis evicts the key after this. Should be much longer than soft TTL.

hard_ttl_sec = 600

This comment was marked as resolved.

Sign in to view

mmiermans commented Mar 23, 2026

View reviewed changes

Comment thread merino/curated_recommendations/corpus_backends/redis_cache.py Outdated

def _build_data_key(

config: CorpusCacheConfig, backend_type: str, surface_id: str, *extra: str

This comment was marked as resolved.

Sign in to view

mmiermans commented Mar 23, 2026

View reviewed changes

mmiermans commented Mar 30, 2026

View reviewed changes

Comment thread merino/curated_recommendations/corpus_backends/redis_cache.py

return envelope["expires_at"], envelope["data"]

class _RedisCorpusCache:

This comment was marked as resolved.

Sign in to view

mmiermans commented Mar 30, 2026

View reviewed changes

mmiermans force-pushed the hnt-1890-shared-curated-recommendation-cache branch from 8a17c0d to 7791e1c Compare April 10, 2026 18:03

mmiermans commented Apr 10, 2026

View reviewed changes

mmiermans requested a review from Herraj April 10, 2026 21:21

mmiermans marked this pull request as ready for review April 10, 2026 22:14

mmiermans commented Apr 10, 2026

View reviewed changes

kschelonka reviewed Apr 13, 2026

View reviewed changes

jpetto requested changes Apr 23, 2026

View reviewed changes

mmiermans and others added 2 commits April 27, 2026 12:21

mmiermans and others added 28 commits April 27, 2026 12:21

Review iteration

544f78c

Add shutdown error handling and stale deserialization test case

9ca2bb3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add BackendError path from asyncio.Lock when fetch fails

7da4c50

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix stale Retry-After references in docs and exception docstring

ada9124

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Increase concurrent lock test from 2 to 10 callers

2a618ea

Reduces chance of sequential execution masking concurrency bugs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix E731: replace lambda assignments with def in integration tests

5dcbeed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Enable Redis corpus cache in staging

233688a

Staging has a Redis 7.2 instance (merino-stagepy in moz-fx-merino-nonprod-ee93). This enables the L2 cache there so we can validate before production. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "Exclude curated_recommendations/__init__.py from coverage"

9e99fd4

This reverts commit ea36cb4.

Add unit tests for RedisAdapter methods to improve coverage

8631df6

Cover get, set, sadd, sismember, scard, close (primary != replica), register_script, and run_script. Brings merino/cache/redis.py from 65% to 97% coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reduce metrics cardinality: use tags instead of metric name hierarchy

71315de

Replace corpus_cache.hit/stale/miss with a single `cache` metric using tags: name=corpus, result=hit|stale|miss. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mmiermans force-pushed the hnt-1890-shared-curated-recommendation-cache branch from 9e644e4 to 71315de Compare April 27, 2026 19:21



		def _build_data_key(
		config: CorpusCacheConfig, backend_type: str, surface_id: str, *extra: str

		return envelope["expires_at"], envelope["data"]


		class _RedisCorpusCache:


		T = TypeVar("T")

		BackendType = Literal["scheduled", "sections"]

Conversation

mmiermans commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

References

Description

How it works

Design decisions

Rollout

PR Review Checklist

Uh oh!

mmiermans commented Mar 23, 2026

Uh oh!

mmiermans commented Mar 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmiermans commented Apr 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

jpetto left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

mmiermans commented Mar 12, 2026 •

edited

Loading