fix: revert buggy recommendation caching logic causing latency spike by AlexanderWert · Pull Request #11 · AlexanderWert/opentelemetry-demo

AlexanderWert · 2026-03-23T10:14:10Z

Problem

An active alert was triggered: Avg. latency for GET /api/recommendations on frontend-proxy reached 435 ms (threshold: 200 ms).

Root Cause Analysis

Automated investigation traced the latency spike through the service topology:

Call chain: load-generator → frontend-proxy → frontend → recommendation → product-catalog

Change point detected at: 2026-03-23T10:08:00Z

frontend-proxy latency spiked to ~647 ms (from ~42 ms baseline)
recommendation latency spiked to ~299 ms (from ~5 ms baseline)
product-catalog latency spiked to ~4 ms (from ~1 ms baseline)

Correlation analysis identified git.sha: 72f15750cce77d4414888363ed52087b0b0ee4b4 as the root cause, deployed on pod recommendation-757f645fb6-tkn9d (started 2026-03-23T10:06:55Z).

Bug in Commit `72f1575`

The introduced get_recommendations_ids() function contains a critical memory growth bug:

for x in cat_response.products:
    ids_to_add.extend(cached_ids)   # ← BUG: appends entire cache for EACH product
    ids_to_add.append(x.id)
    if len(ids_to_add) + len(cached_ids) < MAX_CACHED_IDS:
        cached_ids = cached_ids + ids_to_add

On every cache miss, ids_to_add is extended with the full cached_ids list for each product in the catalog. This causes exponential memory growth up to MAX_CACHED_IDS = 2,000,000 entries, making every subsequent call increasingly slow.

Fix

Reverts to the original, correct implementation that directly calls product_catalog_stub.ListProducts() and extracts product IDs without any caching.

Impact

Resolves latency spike on recommendation service
Restores frontend-proxy GET /api/recommendations latency to baseline (~42 ms)

…and latency spike The get_recommendations_ids function introduced in commit 72f1575 contains a critical bug: on every cache miss, it extends ids_to_add with the entire cached_ids list for each product, causing exponential memory growth up to MAX_CACHED_IDS (2,000,000 entries). This results in massive latency spikes on /oteldemo.RecommendationService/ListRecommendations. Reverts to the original direct product catalog call.

AlexanderWert closed this Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: revert buggy recommendation caching logic causing latency spike#11

fix: revert buggy recommendation caching logic causing latency spike#11
AlexanderWert wants to merge 1 commit intoai-demofrom
fix/revert-recommendation-cache-bug

AlexanderWert commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlexanderWert commented Mar 23, 2026

Problem

Root Cause Analysis

Bug in Commit 72f1575

Fix

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug in Commit `72f1575`