Observed behavior
TL;DR
We've been facing CPU throttling, high memory consumption, and increasing latency in the hub clusters.
After we turn compression off in the leafnode connections, all problems disappear.
From our initial analysis, we think a snowball effect occurred: once memory usage reached a certain threshold, the garbage collector was triggered, which in turn increased CPU usage. As the CPU became throttled, RTT started to grow for some connections. This led to an automatic upgrade to a higher compression level, which further increased CPU load and RTT, perpetuating the cycle.
We encountered this issue on our production servers, but were able to reproduce it in our development setup.
All the data below refers to the development setup.
Note: In the development setup, tracing and debugging are enabled. We repeated the same test with trace and debug set to false, without noticing any changes in behavior.
Observabiltity
CPU:

Memory:
Observation: We stopped the bench and started it again. During the period when the bench was stopped, memory usage never dropped, while CPU usage showed a noticeable decrease.

Latency:

Throttling:

Profilling
Development: profiles.zip
Production: profiling_prod.zip
Expected behavior
Typical behaviour, the server being able to stand the load.
Server and client version
~ $ nats --version
v0.1.2-0.20250310115758-f4eda5b1b7a3
~ $ nats-server --version
nats-server: v2.11.4
Host environment
EKS at AWS, nodes running Bottlerocket.

~ $ cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.22.0
PRETTY_NAME="Alpine Linux v3.22"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues"
Steps to reproduce
This is our setup.

We generated the load with:
nats bench pub -s nats://$T@localhost:4221 test --msgs 1000000000 --clients=10 --multisubject --multisubjectmax 100000
We consumed messages with:
nats sub -s nats://$T@localhost:4220 "test.*"
Observed behavior
TL;DR
We've been facing CPU throttling, high memory consumption, and increasing latency in the hub clusters.
After we turn compression off in the leafnode connections, all problems disappear.
From our initial analysis, we think a snowball effect occurred: once memory usage reached a certain threshold, the garbage collector was triggered, which in turn increased CPU usage. As the CPU became throttled, RTT started to grow for some connections. This led to an automatic upgrade to a higher compression level, which further increased CPU load and RTT, perpetuating the cycle.
We encountered this issue on our production servers, but were able to reproduce it in our development setup.
All the data below refers to the development setup.
Note: In the development setup, tracing and debugging are enabled. We repeated the same test with
traceanddebugset tofalse, without noticing any changes in behavior.Observabiltity
CPU:
Memory:
Observation: We stopped the bench and started it again. During the period when the bench was stopped, memory usage never dropped, while CPU usage showed a noticeable decrease.
Latency:
Throttling:
Profilling
Development: profiles.zip
Production: profiling_prod.zip
Expected behavior
Typical behaviour, the server being able to stand the load.
Server and client version
Host environment
EKS at AWS, nodes running Bottlerocket.
Steps to reproduce
This is our setup.
We generated the load with:
We consumed messages with: