Skip to content

fix: revert buggy recommendation caching logic causing latency spike#11

Closed
AlexanderWert wants to merge 1 commit intoai-demofrom
fix/revert-recommendation-cache-bug
Closed

fix: revert buggy recommendation caching logic causing latency spike#11
AlexanderWert wants to merge 1 commit intoai-demofrom
fix/revert-recommendation-cache-bug

Conversation

@AlexanderWert
Copy link
Copy Markdown
Owner

Problem

An active alert was triggered: Avg. latency for GET /api/recommendations on frontend-proxy reached 435 ms (threshold: 200 ms).

Root Cause Analysis

Automated investigation traced the latency spike through the service topology:

Call chain: load-generatorfrontend-proxyfrontendrecommendationproduct-catalog

Change point detected at: 2026-03-23T10:08:00Z

  • frontend-proxy latency spiked to ~647 ms (from ~42 ms baseline)
  • recommendation latency spiked to ~299 ms (from ~5 ms baseline)
  • product-catalog latency spiked to ~4 ms (from ~1 ms baseline)

Correlation analysis identified git.sha: 72f15750cce77d4414888363ed52087b0b0ee4b4 as the root cause, deployed on pod recommendation-757f645fb6-tkn9d (started 2026-03-23T10:06:55Z).

Bug in Commit 72f1575

The introduced get_recommendations_ids() function contains a critical memory growth bug:

for x in cat_response.products:
    ids_to_add.extend(cached_ids)   # ← BUG: appends entire cache for EACH product
    ids_to_add.append(x.id)
    if len(ids_to_add) + len(cached_ids) < MAX_CACHED_IDS:
        cached_ids = cached_ids + ids_to_add

On every cache miss, ids_to_add is extended with the full cached_ids list for each product in the catalog. This causes exponential memory growth up to MAX_CACHED_IDS = 2,000,000 entries, making every subsequent call increasingly slow.

Fix

Reverts to the original, correct implementation that directly calls product_catalog_stub.ListProducts() and extracts product IDs without any caching.

Impact

  • Resolves latency spike on recommendation service
  • Restores frontend-proxy GET /api/recommendations latency to baseline (~42 ms)

…and latency spike

The get_recommendations_ids function introduced in commit 72f1575 contains
a critical bug: on every cache miss, it extends ids_to_add with the entire
cached_ids list for each product, causing exponential memory growth up to
MAX_CACHED_IDS (2,000,000 entries). This results in massive latency spikes
on /oteldemo.RecommendationService/ListRecommendations.

Reverts to the original direct product catalog call.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant