HNT-1890 (4/4): wire RedisCorpusCache into providers + enable in staging by mmiermans · Pull Request #1441 · mozilla-services/merino-py

mmiermans · 2026-04-27T19:52:13Z

PR 4/4 to implement shared caching of curated recommendations between pods. This PR makes the shared cache live in staging.

Stack: 1/4 infra → 2/4 impl → 3/4 integration tests → 4/4 (this).

⚠️ Base of this PR is hnt-1890-cache-3-integration-tests (PR 3/4), not main. Merge order: 1 → 2 → 3 → 4.

References

JIRA: HNT-1890

Description

Wires RedisCorpusCache into the curated-recommendations subsystem and turns it on in staging. After this PR, ~300 production pods stop hitting the Pocket Corpus GraphQL API independently — one pod fetches and the rest read from Redis (in staging first; production is a separate config change).

What's wired

merino/curated_recommendations/__init__.py — when cache = "redis", builds a RedisAdapter via create_redis_clients and wraps ScheduledSurfaceBackend / SectionsBackend with their RedisCached* counterparts. Adds shutdown() to close the adapter cleanly.
merino/main.py — lifespan registration + a global FastAPI exception handler that converts CorpusCacheUnavailable into HTTP 503.
corpus_backends/{scheduled_surface,sections}_backend.py — hooks the existing L1 SWR cache to consult the L2 cache during revalidation. L1 SWR TTLs are cut roughly in half (110–130s → 50–70s), because L2 now absorbs the load this would otherwise generate. Reviewers: this is the only behavior change for clients running cache = "none", so confirm it's acceptable.
merino/configs/stage.toml — sets cache = \"redis\" in staging.
tests/unit/curated_recommendations/test_init.py — covers cache=\"none\", cache=\"redis\", and missing-Redis-config paths.

Implementation decisions

Decision	Choice	Why
L1 lock	`asyncio.Lock` per cache entry	Coordinates concurrent revalidation within a single pod and keeps Redis traffic to one coroutine per entry per pod
L1 TTL change	110–130s → 50–70s	Tighter L1 surfaces fresh content faster; L2 prevents this from becoming Corpus API load
Staging rollout	`cache = \"redis\"` in `stage.toml`	Soaks in staging before production. Capacity confirmed with Nan: 5 MB storage, 40 IOPS
Cold miss on lock-held	503 Service Unavailable	Prevents connection pile-up; Firefox shows cached NewTab content

Rollout plan (post-merge)

Staging: enabled by this PR's stage.toml change. Monitor corpus API QPS drop and Redis hit rate.
Production: separate config change after staging soak — either set cache = \"redis\" in production.toml or override via MERINO_CURATED_RECOMMENDATIONS__CORPUS_CACHE__CACHE=redis env var.
Alerting: add a request-volume alert in Apollo based on the new (lower) baseline so we detect cache failures (pods falling back to direct API calls).

PR Review Checklist

Conforms to Contribution Guidelines
PR title starts with JIRA reference
[load test: (abort|skip|warn)] keywords applied (consider for the wire-up commit)
Documentation updated (landed in PR 2/4)
Test coverage expanded

┆Issue is synchronized with this Jira Task

This PR makes the shared cache live: - merino/curated_recommendations/__init__.py: lifecycle wiring. When cache="redis", builds a RedisAdapter via create_redis_clients and wraps the ScheduledSurfaceBackend / SectionsBackend with their RedisCached* counterparts. Adds shutdown() to close the adapter. - merino/main.py: lifespan integration + a global FastAPI exception handler for CorpusCacheUnavailable -> 503. - corpus_backends/{scheduled_surface,sections}_backend.py: hook the L1 SWR cache to invoke the L2 cache. L1 SWR TTLs are cut roughly in half (110-130s -> 50-70s) on the assumption L2 absorbs the load this would otherwise generate. - merino/configs/stage.toml: cache = "redis", enabling the cache in staging. Production is enabled separately after staging soak. - tests/unit/curated_recommendations/test_init.py: covers the new init paths (cache="none", cache="redis", missing Redis config). Rollout (post-merge): 1. Stage: enabled by stage.toml in this PR. Watch corpus QPS drop. 2. Prod: separate config change after stage soak (cache = "redis" in production.toml or MERINO_..._CACHE=redis env var). 3. Add a request-volume alert at the new (lower) baseline so we detect Redis cache failures via Apollo metrics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mmiermans force-pushed the hnt-1890-cache-3-integration-tests branch from 2d9be74 to 46ee57c Compare April 27, 2026 20:45

mmiermans force-pushed the hnt-1890-cache-4-wireup branch from c323f09 to 40e7bf9 Compare April 27, 2026 20:46

This was referenced Apr 27, 2026

HNT-1890 (1/4): cache adapter primitives + architecture doc #1438

Open

HNT-1890 (2/4): RedisCorpusCache + corpus_cache config + unit tests #1439

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HNT-1890 (4/4): wire RedisCorpusCache into providers + enable in staging#1441

HNT-1890 (4/4): wire RedisCorpusCache into providers + enable in staging#1441
mmiermans wants to merge 1 commit into
hnt-1890-cache-3-integration-testsfrom
hnt-1890-cache-4-wireup

mmiermans commented Apr 27, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mmiermans commented Apr 27, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

References

Description

What's wired

Implementation decisions

Rollout plan (post-merge)

PR Review Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mmiermans commented Apr 27, 2026 •

edited by atlassian Bot

Loading