Skip to content

Automatic compression selection issues #7037

@renato0307

Description

@renato0307

Observed behavior

TL;DR
We've been facing CPU throttling, high memory consumption, and increasing latency in the hub clusters.

After we turn compression off in the leafnode connections, all problems disappear.

From our initial analysis, we think a snowball effect occurred: once memory usage reached a certain threshold, the garbage collector was triggered, which in turn increased CPU usage. As the CPU became throttled, RTT started to grow for some connections. This led to an automatic upgrade to a higher compression level, which further increased CPU load and RTT, perpetuating the cycle.

We encountered this issue on our production servers, but were able to reproduce it in our development setup.

All the data below refers to the development setup.

Note: In the development setup, tracing and debugging are enabled. We repeated the same test with trace and debug set to false, without noticing any changes in behavior.

Observabiltity

CPU:

Image

Memory:

Observation: We stopped the bench and started it again. During the period when the bench was stopped, memory usage never dropped, while CPU usage showed a noticeable decrease.

Image

Latency:

Image

Throttling:

Image

Profilling

Development: profiles.zip

Production: profiling_prod.zip

Expected behavior

Typical behaviour, the server being able to stand the load.

Server and client version

~ $ nats --version
v0.1.2-0.20250310115758-f4eda5b1b7a3
~ $ nats-server --version
nats-server: v2.11.4

Host environment

EKS at AWS, nodes running Bottlerocket.

Image

~ $ cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.22.0
PRETTY_NAME="Alpine Linux v3.22"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues"

Steps to reproduce

This is our setup.

Image

We generated the load with:

nats bench pub -s nats://$T@localhost:4221 test --msgs 1000000000 --clients=10  --multisubject --multisubjectmax 100000

We consumed messages with:

nats sub -s nats://$T@localhost:4220 "test.*"

Metadata

Metadata

Assignees

No one assigned

    Labels

    defectSuspected defect such as a bug or regressionstaleThis issue has had no activity in a while

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions