Memory leak in long-running Node.js service, heap grows indefinitely despite manual cleanup #196856
-
🏷️ Discussion TypeQuestion BodyRunning a Node.js 20 microservice that processes a high-volume event stream (~5k events/sec). After about 6–8 hours in production, heap memory climbs from ~180MB to over 2GB and the process eventually crashes. We're not seeing this in staging (lower traffic), which makes it harder to reproduce locally. We've already:
Heap snapshot diffs in Chrome DevTools point to a setInterval(() => {
const now = Date.now();
for (const [key, ts] of rateLimiter) {
if (now - ts > TTL) rateLimiter.delete(key);
}
}, 60_000);Suspected culprit is key cardinality; client IPs under load are unique enough that the Map keeps growing faster than the interval can evict. Considering switching to a sliding window with a bounded structure or offloading to Redis entirely, but wanted to check if anyone's hit this pattern before and found a cleaner in-process solution. Stack: Node.js 20, Fastify 4, running on AWS ECS with 2GB memory limit. Guidelines
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Hey, I've dealt with almost the exact same issue on a high-traffic service. The raw The fix that worked for me was two things: First, ditch the import { LRUCache } from 'lru-cache';
const rateLimiter = new LRUCache({
max: 100_000, // hard cap, evicts oldest automatically
ttl: 60_000,
ttlAutopurge: false,
});Second, and this matters since you're on ECS; each container instance holds its own in-process state, so a single client can bypass the limit by hitting different instances. I moved the authoritative counter to Redis using a sorted set as a sliding window, batched into a single pipeline round-trip to keep latency low. But Redis on every request at 5k/sec adds up, so I layered it: check local LRU first, only hit Redis if the local check passes. Blocked IPs get cached locally for ~10s, which cuts Redis calls by 60–80% in practice since repeat offenders dominate real traffic. Heap stayed flat after that, even after 24h+ uptime. Hope that helps point you in the right direction! |
Beta Was this translation helpful? Give feedback.
Hey, I've dealt with almost the exact same issue on a high-traffic service. The raw
Mapwith a cleanup interval is a losing battle at that scale; you're inserting faster than you can evict.The fix that worked for me was two things:
First, ditch the
Mapentirely and use an LRU cache with a hard size cap. I switched tolru-cacheand the heap flatlined immediately. It evicts lazily on access instead of scanning on every tick, so it's O(1) and memory-bounded by design:Second, and this matters since you're on ECS; ea…