Skip to content

CPU spikes upon broker replacement #2400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
antonakospanos opened this issue Oct 9, 2023 · 3 comments
Closed

CPU spikes upon broker replacement #2400

antonakospanos opened this issue Oct 9, 2023 · 3 comments

Comments

@antonakospanos
Copy link

Hello @dpkp @wbarnha,

We have used kafka-python to interact with our Kafka brokers since 2020.

However, during the last Kafka Broker upgrades (i.e. 3.3 => 3.5) we experienced spikes in the CPU utilization of the broker's nodes. The same happens upon hardware failure in the kafka cluster that results in replacing a broker with another node.
production

The issue is mitigated by restarting the consumers in our python apps. Aiven responded that this is a known issue on apps relying on kafka-python and we should switch to https://github.com/confluentinc/confluent-kafka-python.

Here is their whole response:

We also noticed that the impacted consumers are those based on client library kafka-python-2.0.2 , which one was last updated 3 years ago, and is based on an older version of Kafka protocol. In Kafka 3.3.1, we are aware that this older protocol has lost in efficiency, why generally leads to more latency and consumer rebalances. The short-term solution is to relax your configurations as follow:

The long term solution is to migrate these consumers to client library confuent-kafka-python-2.2.0 which one offers a similar SDK, and is up-to-date.

Is there any development in progress so that we tackle this issue?

If not we will have to switch to another kafka library.

@wbarnha
Copy link
Contributor

wbarnha commented Oct 9, 2023

Sorry to hear that you're having issues. Unfortunately I'm not in a position to do new releases, so even if I did have a solution, I couldn't get it pushed out into the mainstream.

I do acknowledge the older version of the Kafka Protocol is still implemented in this project. To be honest, aiokafka and confluent-kafka-python are more up-to-date with the current state of the Kafka Protocol and have more underlying optimizations. I'm not surprised that kafka-python is not handling the newer versions of Kafka so well, since this project is currently geared towards supporting older implementations.

In the meantime, I do recommend switching to a newer Kafka library, if possible. In my personal experience, aiokafka has had the best results for me, but I understand your application(s) may not be geared towards asynchronous implementations. Food for thought.

@dpkp
Copy link
Owner

dpkp commented Feb 13, 2025

Did you see similar CPU spikes in prior broker upgrades? If so then perhaps client side protocol changes would be helpful. But assuming you did not, it doesn't make sense to me that this is caused by the client side (still) using an older protocol. Nor does it align with your comment that the problem was "mitigated by restarting the consumers in our python apps." That suggests that it is related to internal kafka-python client state, and most likely how we handle backoff/retry.

@dpkp
Copy link
Owner

dpkp commented Feb 14, 2025

I made a number of improvements to backoff/retry here: #2480

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants