Skip to content

HNT-1890 (4/4): wire RedisCorpusCache into providers + enable in staging#1441

Open
mmiermans wants to merge 1 commit into
hnt-1890-cache-3-integration-testsfrom
hnt-1890-cache-4-wireup
Open

HNT-1890 (4/4): wire RedisCorpusCache into providers + enable in staging#1441
mmiermans wants to merge 1 commit into
hnt-1890-cache-3-integration-testsfrom
hnt-1890-cache-4-wireup

Conversation

@mmiermans

@mmiermans mmiermans commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

PR 4/4 to implement shared caching of curated recommendations between pods. This PR makes the shared cache live in staging.

Stack: 1/4 infra → 2/4 impl → 3/4 integration tests → 4/4 (this).

⚠️ Base of this PR is hnt-1890-cache-3-integration-tests (PR 3/4), not main. Merge order: 1 → 2 → 3 → 4.

References

JIRA: HNT-1890

Description

Wires RedisCorpusCache into the curated-recommendations subsystem and turns it on in staging. After this PR, ~300 production pods stop hitting the Pocket Corpus GraphQL API independently — one pod fetches and the rest read from Redis (in staging first; production is a separate config change).

What's wired

  • merino/curated_recommendations/__init__.py — when cache = "redis", builds a RedisAdapter via create_redis_clients and wraps ScheduledSurfaceBackend / SectionsBackend with their RedisCached* counterparts. Adds shutdown() to close the adapter cleanly.
  • merino/main.py — lifespan registration + a global FastAPI exception handler that converts CorpusCacheUnavailable into HTTP 503.
  • corpus_backends/{scheduled_surface,sections}_backend.py — hooks the existing L1 SWR cache to consult the L2 cache during revalidation. L1 SWR TTLs are cut roughly in half (110–130s → 50–70s), because L2 now absorbs the load this would otherwise generate. Reviewers: this is the only behavior change for clients running cache = "none", so confirm it's acceptable.
  • merino/configs/stage.toml — sets cache = \"redis\" in staging.
  • tests/unit/curated_recommendations/test_init.py — covers cache=\"none\", cache=\"redis\", and missing-Redis-config paths.

Implementation decisions

Decision Choice Why
L1 lock asyncio.Lock per cache entry Coordinates concurrent revalidation within a single pod and keeps Redis traffic to one coroutine per entry per pod
L1 TTL change 110–130s → 50–70s Tighter L1 surfaces fresh content faster; L2 prevents this from becoming Corpus API load
Staging rollout cache = \"redis\" in stage.toml Soaks in staging before production. Capacity confirmed with Nan: 5 MB storage, 40 IOPS
Cold miss on lock-held 503 Service Unavailable Prevents connection pile-up; Firefox shows cached NewTab content

Rollout plan (post-merge)

  1. Staging: enabled by this PR's stage.toml change. Monitor corpus API QPS drop and Redis hit rate.
  2. Production: separate config change after staging soak — either set cache = \"redis\" in production.toml or override via MERINO_CURATED_RECOMMENDATIONS__CORPUS_CACHE__CACHE=redis env var.
  3. Alerting: add a request-volume alert in Apollo based on the new (lower) baseline so we detect cache failures (pods falling back to direct API calls).

PR Review Checklist

  • Conforms to Contribution Guidelines
  • PR title starts with JIRA reference
  • [load test: (abort|skip|warn)] keywords applied (consider for the wire-up commit)
  • Documentation updated (landed in PR 2/4)
  • Test coverage expanded

┆Issue is synchronized with this Jira Task

@mmiermans mmiermans force-pushed the hnt-1890-cache-3-integration-tests branch from 2d9be74 to 46ee57c Compare April 27, 2026 20:45
This PR makes the shared cache live:

- merino/curated_recommendations/__init__.py: lifecycle wiring. When
  cache="redis", builds a RedisAdapter via create_redis_clients and
  wraps the ScheduledSurfaceBackend / SectionsBackend with their
  RedisCached* counterparts. Adds shutdown() to close the adapter.
- merino/main.py: lifespan integration + a global FastAPI exception
  handler for CorpusCacheUnavailable -> 503.
- corpus_backends/{scheduled_surface,sections}_backend.py: hook the
  L1 SWR cache to invoke the L2 cache. L1 SWR TTLs are cut roughly
  in half (110-130s -> 50-70s) on the assumption L2 absorbs the load
  this would otherwise generate.
- merino/configs/stage.toml: cache = "redis", enabling the cache in
  staging. Production is enabled separately after staging soak.
- tests/unit/curated_recommendations/test_init.py: covers the new
  init paths (cache="none", cache="redis", missing Redis config).

Rollout (post-merge):
1. Stage: enabled by stage.toml in this PR. Watch corpus QPS drop.
2. Prod: separate config change after stage soak (cache = "redis"
   in production.toml or MERINO_..._CACHE=redis env var).
3. Add a request-volume alert at the new (lower) baseline so we
   detect Redis cache failures via Apollo metrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant