Skip to content

HNT-1890: shared curated recommendation cache#1306

Open
mmiermans wants to merge 35 commits into
mainfrom
hnt-1890-shared-curated-recommendation-cache
Open

HNT-1890: shared curated recommendation cache#1306
mmiermans wants to merge 35 commits into
mainfrom
hnt-1890-shared-curated-recommendation-cache

Conversation

@mmiermans

@mmiermans mmiermans commented Mar 12, 2026

Copy link
Copy Markdown
Collaborator

References

JIRA: HNT-1890

Description

Adds a shared Redis cache between the per-pod in-memory cache and the Corpus GraphQL API. ~300 pods currently hit the API independently every ~60s. With this, one pod fetches and the rest read from Redis.

How it works

Requests are always served from the in-memory cache (L1) — either fresh or stale. Stale data is served whenever possible so that the Redis and API calls never block the request. Revalidation happens in a background task: it checks Redis (L2) first, and only the pod that acquires the distributed lock fetches from the API.

flowchart TB
    req["Firefox NewTab Request"]

    subgraph L1 ["L1 — Per-Pod In-Memory Cache"]
        check_l1{{"Check in-memory cache"}}
        l1_lock{{"asyncio.Lock (per entry)"}}
    end

    respond["Respond with data"]
    return_error["BackendError"]

    subgraph bg ["Background Revalidation Task"]

        subgraph L2 ["L2 — Shared Redis"]
            check_l2{{"Check Redis cache"}}
            acquire_lock{{"Try distributed lock"}}
        end

        api["Fetch from Corpus GraphQL API"]
        write["Write to Redis + release lock"]
        retry_redis["Wait 500ms, retry Redis"]
        serve_stale["Serve stale + update L1"]
        return_data["Return data + update L1"]
        return_503["503 Service Unavailable"]
    end

    req --> check_l1

    check_l1 -- "FRESH or STALE" --> respond
    check_l1 -. "MISS (cold start)" .-> l1_lock

    respond -. "if STALE, spawns task" .-> l1_lock
    l1_lock -- "acquired" --> check_l2
    l1_lock -. "waited, value populated" .-> respond
    l1_lock -. "waited, fetch failed" .-> return_error

    check_l2 -- "FRESH HIT" --> return_data
    check_l2 -. "STALE or MISS" .-> acquire_lock

    acquire_lock -- "LOCK ACQUIRED" --> api
    acquire_lock -. "LOCK HELD + stale exists" .-> serve_stale
    acquire_lock -. "LOCK HELD + no data" .-> retry_redis

    retry_redis -- "HIT" --> return_data
    retry_redis -. "MISS" .-> return_503

    api --> write --> return_data

    style req fill:#2c3e50,stroke:#1a252f,color:#ecf0f1,stroke-width:2px
    style check_l1 fill:#2980b9,stroke:#1f6da0,color:#fff,stroke-width:2px
    style l1_lock fill:#2980b9,stroke:#1f6da0,color:#fff,stroke-width:2px
    style check_l2 fill:#d35400,stroke:#a04000,color:#fff,stroke-width:2px
    style acquire_lock fill:#e67e22,stroke:#bf6516,color:#fff,stroke-width:2px
    style api fill:#1e8449,stroke:#145a32,color:#fff,stroke-width:2px
    style write fill:#1e8449,stroke:#145a32,color:#fff,stroke-width:2px
    style respond fill:#27ae60,stroke:#1e8449,color:#fff,stroke-width:2px
    style serve_stale fill:#f4d03f,stroke:#d4ac0f,color:#333
    style retry_redis fill:#e67e22,stroke:#bf6516,color:#fff,stroke-width:2px
    style return_data fill:#27ae60,stroke:#1e8449,color:#fff
    style return_503 fill:#e74c3c,stroke:#c0392b,color:#fff
    style return_error fill:#e74c3c,stroke:#c0392b,color:#fff
    style L1 fill:#eaf2f8,stroke:#2980b9,stroke-width:2px,color:#2c3e50
    style L2 fill:#fef5e7,stroke:#d35400,stroke-width:2px,color:#2c3e50
    style bg fill:#f4f6f7,stroke:#95a5a6,stroke-width:2px,stroke-dasharray: 8 4,color:#2c3e50
Loading

Design decisions

Decision Choice Why
Cache layer Redis L2 behind existing in-memory L1 Keeps per-pod latency low, Redis only consulted on L1 miss or stale
Write pattern Distributed stale-while-revalidate One pod revalidates, others serve stale. Avoids thundering herd
L1 lock asyncio.Lock per cache entry Coordinates concurrent revalidation within a single pod
L2 lock SET NX EX with TTL Distributed, self-expiring. Worst case on timeout: one extra API call
Soft/hard TTL 60s soft, 1 day hard Soft triggers revalidation. Hard is a safety net for extended API outages
Cache format Pydantic model dicts via orjson Processed models cached, not raw GraphQL. Saves CPU across pods
Redis cluster Same shared cluster as other providers ~36 keys, ~4MB total. Not worth a separate instance
Cold miss (lock held) 503 Service Unavailable Prevents connection pile-up; Firefox shows cached NewTab content
No circuit breaker L1 asyncio.Lock limits Redis traffic Only one coroutine per cache entry per pod reaches Redis. In steady state, L1 serves stale data and only the background task hits Redis
Rollout Global config flag, off by default Enable in staging first, then production

Rollout

  1. Checked with Nan; 5 MB storage and 40 IOPS are fine.
  2. Enabled in staging via stage.toml. Monitor metrics, validate API call reduction.
  3. Enable in production: add cache = "redis" to production.toml or set MERINO_CURATED_RECOMMENDATIONS__CORPUS_CACHE__CACHE=redis.
  4. After production rollout: add a request volume alert in Apollo based on the new (lower) baseline, to detect when the Redis cache fails and pods fall back to direct API calls.

PR Review Checklist

Put an x in the boxes that apply

  • This PR conforms to the Contribution Guidelines
  • The PR title starts with the JIRA issue reference, format example [DISCO-####], and has the same title (if applicable)
  • [load test: (abort|skip|warn)] keywords are applied to the last commit message (if applicable)
  • Documentation has been updated (if applicable)
  • Functional and performance test coverage has been expanded and maintained (if applicable)

@mmiermans mmiermans requested a review from jpetto March 12, 2026 01:19
@mmiermans mmiermans changed the title Hnt 1890 shared curated recommendation cache HNT-1890: shared curated recommendation cache Mar 12, 2026
@mmiermans mmiermans force-pushed the hnt-1890-shared-curated-recommendation-cache branch from 5894b87 to 8e632b5 Compare March 12, 2026 21:07
@mmiermans

Copy link
Copy Markdown
Collaborator Author

Returning a 201 code in the scenario where we do not have any (stale) data would improve reliability, by avoiding connections piling up.

TODO: Check how Firefox Desktop & Mobile would handle 201 status code? Does it retry?

@mmiermans

Copy link
Copy Markdown
Collaborator Author

Redis L2 behind existing in-memory L1

Q: Are we consistent, or does weather and other providers hit Redis on every request? Are the use-cases different?

Q: What's the existing pattern for testing with Redis?

Q: What are the existing Redis patterns?

Key settings:
- `cache` — `"redis"` to enable, `"none"` to disable (default: disabled)
- `soft_ttl_sec` — when a cached entry is considered stale and triggers revalidation
- `hard_ttl_sec` — when Redis evicts the key entirely (safety net)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want this to be high: more like 1-24 hours? Days? To match our response time.

Comment thread merino/cache/redis.py
await self.primary.delete(key)
except RedisError as exc:
raise CacheAdapterError(f"Failed to DELETE `{repr(key)}` with error: `{exc}`") from exc

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, check if this is a new pattern, and whether it should be one.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis locks are a new pattern because the curated recommendations use-case is unique, in that a lot of the same requests arrive at the same time.

Comment thread merino/configs/default.toml Outdated

# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__HARD_TTL_SEC
# Hard TTL in seconds. Redis evicts the key after this. Should be much longer than soft TTL.
hard_ttl_sec = 600

This comment was marked as resolved.


# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__LOCK_TTL_SEC
# Distributed lock TTL in seconds. Auto-releases if the lock holder crashes.
lock_ttl_sec = 30

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Check if there's a client network timeout of 30s?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked: the HTTP client for Corpus API requests uses httpx.AsyncClient with a 5-second request timeout (default from merino/utils/http_client.py:12). The lock_ttl_sec = 30 here is for the Redis distributed lock auto-expiry, not for network timeouts.

So the worst case for a single get_or_fetch call with Redis timing out is bounded by the Redis socket_timeout_sec (configured in [default.redis]), not 30 seconds per API call.

raise ValueError(
f"hard_ttl_sec ({self.hard_ttl_sec}) must be greater than "
f"lock_ttl_sec ({self.lock_ttl_sec})"
)

This comment was marked as resolved.

This comment was marked as resolved.



def _build_data_key(
config: CorpusCacheConfig, backend_type: str, surface_id: str, *extra: str

This comment was marked as resolved.

str(days_offset),
fetch_fn=lambda: self._backend.fetch(surface_id, days_offset),
serialize_fn=lambda items: [item.model_dump(mode="json") for item in items],
deserialize_fn=lambda data: [CorpusItem.model_validate(d) for d in data],

This comment was marked as resolved.

return await self._redis_cache.get_or_fetch(
"scheduled",
surface_id.value,
str(days_offset),

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's verify whether we still need this? Do we need to fetch yesterday's data for the schedule?

config: CorpusCacheConfig, backend_type: str, surface_id: str, *extra: str
) -> str:
"""Build the Redis key for cached corpus data."""
parts = [config.key_prefix, backend_type, surface_id, *extra]

This comment was marked as resolved.

"expires_at": time.time() + soft_ttl_sec,
"data": data,
}
return orjson.dumps(envelope)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check if there's a pattern to create an envelope or have a separate expiration key?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with Herraj: no existing pattern. Envelope is probably easiest.

if await self._try_acquire_lock(lock_key):
return await self._revalidate(data_key, lock_key, fetch_fn, serialize_fn)
# Another pod is populating; wait briefly then retry Redis
await asyncio.sleep(0.1)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return 201 if this happens?

async def shutdown(self) -> None:
"""Close resources owned by this provider."""
if self._cache_adapter is not None:
await self._cache_adapter.close()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this the first time we need to do a shutdown in the curated-recommendations provider?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because previous resources did not use external connections.

return envelope["expires_at"], envelope["data"]


class _RedisCorpusCache:

This comment was marked as resolved.

try:
await _provider.shutdown()
except NameError:
pass

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any benefit to severing GCS connections here? Other providers may do this. What happens if we don't?


- `merino/curated_recommendations/corpus_backends/redis_cache.py` — cache logic
- `merino/curated_recommendations/__init__.py` — wiring (`_init_corpus_cache`)
- `merino/configs/default.toml` — config section with defaults and documentation

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key files is probably not useful; I'd remove it.

@mmiermans mmiermans force-pushed the hnt-1890-shared-curated-recommendation-cache branch from 8a17c0d to 7791e1c Compare April 10, 2026 18:03
Comment thread merino/configs/stage.toml
domain_data_source = "remote"

[stage.curated_recommendations.corpus_cache]
cache = "redis"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially roll out to the staging environment for testing.

@mmiermans

Copy link
Copy Markdown
Collaborator Author

Re: days_offset comment

Verified: days_offset is never called with a non-zero value. The only call site is provider.py:195 which uses the default (days_offset=0).

The parameter is part of the pre-existing ScheduledSurfaceProtocol interface, so the cache wrapper implements the same signature. Since it's always 0, the cache key doesn't need to include it.

@mmiermans mmiermans requested a review from Herraj April 10, 2026 21:21
@mmiermans mmiermans marked this pull request as ready for review April 10, 2026 22:14
Comment thread docs/SUMMARY.md
- [Test Failures in CI](./operations/testfailures.md)
- [Configs](./operations/configs.md)
- [Elasticsearch](./operations/elasticsearch.md)
- [Curated Recommendations](./operations/curated-recommendations/corpus-cache.md)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious whether we'd like to see more documentation around the Curated Recommendations endpoint. I think yes, but out-of-scope for this PR.


T = TypeVar("T")

BackendType = Literal["scheduled", "sections"]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be an enum, although I don't know if it matters. The scheduled backend is going away in 2026Q2.

if is_fresh and items_data is not None:
try:
result = deserialize_fn(items_data)
self._metrics.increment("corpus_cache.hit")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't differentiate between the two backend endpoints, because scheduled is going away in Q2, and the main goal is to reduce total Apollo calls.

# quickly and reducing backend load. With the L2 Redis cache, the total worst-case staleness
# is L1 TTL + L2 soft TTL (~120s), comparable to the previous L1-only TTL of ~120s.
cache_time_to_live_min = timedelta(seconds=50)
cache_time_to_live_max = timedelta(seconds=70)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reverts #1305, and keeps the total TTL including Redis the same at about 2 minutes.

Comment thread docs/data.md Outdated
Comment on lines +198 to +203
- `corpus_cache.hit` -
A counter for Redis cache hits (fresh data returned without calling the Corpus API).
- `corpus_cache.stale` -
A counter for stale Redis cache entries that trigger revalidation.
- `corpus_cache.miss` -
A counter for Redis cache misses (no data in Redis).

This comment was marked as resolved.

@jpetto jpetto left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requesting splitting up this PR - i think focusing on the merino/cache files (and relevant docs) would be a good first step to break these changes into smaller, easier to review chunks. set the groundwork for the actual implementation in one PR, then implement in a service in a follow-up.

# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__CACHE
# Shared corpus cache backend. "redis" enables Redis as an L2 cache
# between in-memory SWR (L1) and the Corpus GraphQL API. "none" disables it.
cache = "none"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's worth mentioning that a value of none will increase corpus load by the number of k8s pods (300).


# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__SOFT_TTL_SEC
# Soft TTL in seconds. Balances propagating time-sensitive content quickly vs backend load.
soft_ttl_sec = 60

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment about the practical use of this value?


# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__HARD_TTL_SEC
# Hard TTL in seconds. Safety net so data survives extended API outages (1 day).
hard_ttl_sec = 86400

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - add a comment on the practical use.

mmiermans and others added 2 commits April 27, 2026 12:21
Introduce a shared Redis cache layer between the in-memory SWR cache (L1)
and the Corpus GraphQL API to reduce API request volume by ~300x across
Merino pods.

Key design:
- Distributed stale-while-revalidate: one pod revalidates while others
  serve stale data, using SET NX EX for distributed locking
- Embedded soft TTL envelope (2min soft / 10min hard) stored as orjson
- All Redis errors gracefully fall through to the wrapped backend
- Gated behind config flag (corpus_cache.backend = "redis" / "none")
- Adds set_nx and delete to CacheAdapter protocol

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix lock release on serialize_fn failure: wrap both fetch and serialize
  in the try/except that releases the distributed lock
- Add Redis adapter cleanup on shutdown: store adapter in module-level var,
  close it in new shutdown() function called from main.py lifespan
- Add config validation: CorpusCacheConfig.__post_init__ ensures
  hard_ttl > soft_ttl and hard_ttl > lock_ttl
- Add tests: config validation, serialize error lock release, empty list
  caching behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mmiermans and others added 28 commits April 27, 2026 12:21
- Fix TypeError crash on corrupted envelope with non-numeric expires_at:
  now caught and treated as cache miss instead of 500ing the request
- Fix lock leak on asyncio.CancelledError: use try/finally in _revalidate
  so lock is released even for BaseException subclasses
- Broaden _redis_set exception catch to all Exception (not just
  CacheAdapterError) so serialization errors are also suppressed
- Add graceful fallback in _init_corpus_cache: Redis init failure no
  longer aborts provider initialization
- Guard shutdown() against uninitialized provider (NameError on startup
  failure)
- Add defensive get_provider()/get_legacy_provider() with RuntimeError
  instead of silent NameError
- Add pragma: no cover to new protocol methods for consistency
- Add tests for non-numeric expires_at and CancelledError lock release

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace _make_corpus_item helper with generate_corpus_item from
test_sections.py. Keep _make_corpus_section as a thin wrapper since
no shared CorpusSection fixture exists yet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- docs/operations/curated-recommendations/corpus-cache.md: operational
  doc with inline mermaid flow diagram, design decisions table, config
  reference, and rollout steps
- docs/SUMMARY.md: add curated recommendations section to nav
- PR_DESCRIPTION.md: PR description following repo template

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On cache miss when the lock is held by another pod, instead of calling
the Corpus API directly (which defeats cross-pod coalescing), wait 0.1s
and retry Redis. If the lock winner has written data, return it. If not,
raise BackendError — the L1 SWR layer will handle the error and retry
on the next request.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When Redis is down or the lock holder is slow, the L2 cache now falls
back to calling the Corpus API directly instead of raising BackendError
(which would surface as empty NewTab results). This ensures the optional
L2 cache never degrades availability below what L1+API provide. The L1
SWR cache's per-pod asyncio.Lock coalescing prevents stampede.

Also narrows the NameError catch in shutdown() to avoid masking bugs,
and removes trailing blank line at EOF in tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Circuit breaker skips Redis after consecutive failures, recovers after timeout
- Cold-miss lock losers get 503 + Retry-After instead of piling up connections
- Replace dataclass validation with Dynaconf validators
- Use typed BackendType and SurfaceId enum for cache keys
- Increase hard TTL to 1 day (86400s) to survive extended API outages
- Update corpus-cache docs with new behaviors and config table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove redundant L2 stale-while-revalidate paragraph
- Fix circuit breaker description: full close, not half-open
- Fix design decisions: Redis consulted on L1 miss or stale
- Write out Stale-While-Revalidate fully (no SWR abbreviation)
- Describe both L1 and L2 as stale-while-revalidate
- Move retry/503 nodes inside background task in diagram
- Add motivation for cacheability of corpus requests
- Remove rollout section (ephemeral, now in PR description)
- Fix ruff formatting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ker_skip)

Emits counters at key decision points in the cache lookup:
- corpus_cache.hit: fresh Redis hit, no API call needed
- corpus_cache.stale: stale entry triggered revalidation
- corpus_cache.miss: no data in Redis
- corpus_cache.circuit_breaker_skip: Redis skipped due to open circuit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Redis timeout caveat

- Show L1 asyncio.Lock in diagram between L1 miss and L2 check
- Increase cold-miss retry delay from 100ms to 500ms (P95 API latency is ~217ms)
- Document intra-pod coalescing via asyncio.Lock in cold-miss prose
- Document Redis timeout caveat: lock holder blocks, slowing circuit breaker

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move asyncio.Lock inside L1 subgraph for better layout
- Rename terminal nodes to "Return data + update L1" consistently
- Simplify 503 label to "503 Service Unavailable"
- Pre-declare BackendError node near response nodes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Merge FRESH/STALE L1 responses into single "Respond with data" node
- Consolidate three "Return data + update L1" terminal nodes into one
- Combine "STALE" and "MISS" L2 branches into "STALE or MISS"
- Reduces node count from 18 to 14, producing a much more compact layout

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… API fallthrough

- Remove Retry-After: 60 header from 503 response (Firefox doesn't parse it)
- Circuit breaker open now raises CorpusCacheUnavailable (503) instead of
  falling through to the Corpus API. L1 serves stale data in steady state,
  so this only affects cold starts with a degraded Redis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Uses testcontainers AsyncRedisContainer following existing patterns
(weather, finance, yelp). Tests:
- Cache miss then hit: verifies backend is called once, second call served from Redis
- Distributed lock: concurrent callers on same key, only one fetches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduces chance of sequential execution masking concurrency bugs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The L1 asyncio.Lock already limits Redis traffic to one coroutine per
cache entry per pod, making a circuit breaker unnecessary. In steady
state, L1 serves stale data and only the background revalidation task
hits Redis. On cold starts, the lock coalesces concurrent requests.

Removes ~180 lines of circuit breaker implementation, configuration,
tests, and documentation. Adds a "No circuit breaker" design decision
explaining the rationale.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Staging has a Redis 7.2 instance (merino-stagepy in moz-fx-merino-nonprod-ee93).
This enables the L2 cache there so we can validate before production.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- All logger.warning in redis_cache.py → logger.error so Sentry captures
  data corruption, serialization bugs, and Redis connectivity issues.
- L2 soft TTL: 120s → 60s. L1 TTL: 110s → 60s. Total worst-case
  staleness stays comparable (~120s vs ~110s), but average staleness
  drops because any pod's revalidation refreshes L2 for all pods.
- Improved TTL comments in default.toml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move CorpusCacheUnavailable from merino/exceptions.py to redis_cache.py
  (co-located with the code that raises it, matching CorpusGraphQLError pattern)
- Remove Configuration section from corpus-cache.md (duplicates default.toml)
- Parametrize key builder tests, add log assertions for silent error paths
- Expand "SWR" to "stale-while-revalidate" in docstrings
- Tighten concurrent lock integration test to assert all callers succeed
- Clarify test docstrings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wiring-only module that creates GCS clients, HTTP clients, and Redis
connections. Not unit-testable in a meaningful way; exercised by
integration tests and the running app.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cover get, set, sadd, sismember, scard, close (primary != replica),
register_script, and run_script. Brings merino/cache/redis.py from
65% to 97% coverage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace opportunistic RedisAdapter coverage (sadd, sismember, scard,
register_script, run_script, get, set, close) with targeted tests for
_init_corpus_cache which this PR introduced. Covers all 3 paths:
disabled, redis-enabled, and error fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace corpus_cache.hit/stale/miss with a single `cache` metric
using tags: name=corpus, result=hit|stale|miss.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mmiermans mmiermans force-pushed the hnt-1890-shared-curated-recommendation-cache branch from 9e644e4 to 71315de Compare April 27, 2026 19:21
…tings

Adds one-line annotations to cache, soft_ttl_sec, and hard_ttl_sec describing
their real-world effects (Apollo volume / Corpus API load, avg propagation
time, max stale-data age during outages).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants