fix: revert buggy recommendation caching logic causing latency spike#11
Closed
AlexanderWert wants to merge 1 commit intoai-demofrom
Closed
fix: revert buggy recommendation caching logic causing latency spike#11AlexanderWert wants to merge 1 commit intoai-demofrom
AlexanderWert wants to merge 1 commit intoai-demofrom
Conversation
…and latency spike The get_recommendations_ids function introduced in commit 72f1575 contains a critical bug: on every cache miss, it extends ids_to_add with the entire cached_ids list for each product, causing exponential memory growth up to MAX_CACHED_IDS (2,000,000 entries). This results in massive latency spikes on /oteldemo.RecommendationService/ListRecommendations. Reverts to the original direct product catalog call.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
An active alert was triggered: Avg. latency for
GET /api/recommendationsonfrontend-proxyreached 435 ms (threshold: 200 ms).Root Cause Analysis
Automated investigation traced the latency spike through the service topology:
Call chain:
load-generator→ frontend-proxy → frontend → recommendation → product-catalogChange point detected at:
2026-03-23T10:08:00Zfrontend-proxylatency spiked to ~647 ms (from ~42 ms baseline)recommendationlatency spiked to ~299 ms (from ~5 ms baseline)product-cataloglatency spiked to ~4 ms (from ~1 ms baseline)Correlation analysis identified
git.sha: 72f15750cce77d4414888363ed52087b0b0ee4b4as the root cause, deployed on podrecommendation-757f645fb6-tkn9d(started2026-03-23T10:06:55Z).Bug in Commit
72f1575The introduced
get_recommendations_ids()function contains a critical memory growth bug:On every cache miss,
ids_to_addis extended with the fullcached_idslist for each product in the catalog. This causes exponential memory growth up toMAX_CACHED_IDS = 2,000,000entries, making every subsequent call increasingly slow.Fix
Reverts to the original, correct implementation that directly calls
product_catalog_stub.ListProducts()and extracts product IDs without any caching.Impact
recommendationservicefrontend-proxyGET /api/recommendationslatency to baseline (~42 ms)