bugfix: fix infinite loop on KafkaAdminClient #2194

hackaugusto · 2021-01-14T14:27:59Z

An infinite loop may happen with the following pattern:

self._send_request_to_node(self._client.least_loaded_node(), request)

The problem happens when self._client's cluster metadata is out-of-date, and the
result of least_loaded_node() is a node that has been removed from the cluster but
the client is unware of it. When this happens _send_request_to_node will enter an
infinite loop waiting for the chosen node to become available, which won't happen,
resulting in an infinite loop.

This commit introduces a new method named _send_request_to_least_loaded_node which
handles the case above. This is done by regularly checking if the target node is
available in the cluster metadata, and if not, a new node is chosen.

Notes:

This does not yet cover every call site to _send_request_to_node, there are some
other places were similar race conditions may happen.
The code above does not guarantee that the request itself will be sucessful, since
it is still possible for the target node to exit, however, it does remove the
infinite loop which can render client code unusable.

This change is

An infinite loop may happen with the following pattern: self._send_request_to_node(self._client.least_loaded_node(), request) The problem happens when `self._client`'s cluster metadata is out-of-date, and the result of `least_loaded_node()` is a node that has been removed from the cluster but the client is unware of it. When this happens `_send_request_to_node` will enter an infinite loop waiting for the chosen node to become available, which won't happen, resulting in an infinite loop. This commit introduces a new method named `_send_request_to_least_loaded_node` which handles the case above. This is done by regularly checking if the target node is available in the cluster metadata, and if not, a new node is chosen. Notes: - This does not yet cover every call site to `_send_request_to_node`, there are some other places were similar race conditions may happen. - The code above does not guarantee that the request itself will be sucessful, since it is still possible for the target node to exit, however, it does remove the infinite loop which can render client code unusable.

If the value `_controller_id` is out-of-date and the node was removed from the cluster, `_send_request_to_node` would enter an infinite loop.

An infinite loop may happen with the following pattern: self._send_request_to_node(self._client.least_loaded_node(), request) The problem happens when `self._client`'s cluster metadata is out-of-date, and the result of `least_loaded_node()` is a node that has been removed from the cluster but the client is unware of it. When this happens `_send_request_to_node` will enter an infinite loop waiting for the chosen node to become available, which won't happen, resulting in an infinite loop. This commit introduces a new method named `_send_request_to_least_loaded_node` which handles the case above. This is done by regularly checking if the target node is available in the cluster metadata, and if not, a new node is chosen. Notes: - This does not yet cover every call site to `_send_request_to_node`, there are some other places were similar race conditions may happen. - The code above does not guarantee that the request itself will be sucessful, since it is still possible for the target node to exit, however, it does remove the infinite loop which can render client code unusable.

If the value `_controller_id` is out-of-date and the node was removed from the cluster, `_send_request_to_node` would enter an infinite loop.

An infinite loop may happen with the following pattern: self._send_request_to_node(self._client.least_loaded_node(), request) The problem happens when `self._client`'s cluster metadata is out-of-date, and the result of `least_loaded_node()` is a node that has been removed from the cluster but the client is unware of it. When this happens `_send_request_to_node` will enter an infinite loop waiting for the chosen node to become available, which won't happen, resulting in an infinite loop. This commit introduces a new method named `_send_request_to_least_loaded_node` which handles the case above. This is done by regularly checking if the target node is available in the cluster metadata, and if not, a new node is chosen. Notes: - This does not yet cover every call site to `_send_request_to_node`, there are some other places were similar race conditions may happen. - The code above does not guarantee that the request itself will be sucessful, since it is still possible for the target node to exit, however, it does remove the infinite loop which can render client code unusable.

If the value `_controller_id` is out-of-date and the node was removed from the cluster, `_send_request_to_node` would enter an infinite loop.

hackaugusto added 2 commits January 14, 2021 15:18

bugfix: infinite loop when send msgs to controller

e1b8d42

If the value `_controller_id` is out-of-date and the node was removed from the cluster, `_send_request_to_node` would enter an infinite loop.

hackaugusto mentioned this pull request Jan 18, 2021

fix infinite loop with kafka admin aiven/kafka-python#20

Merged

hackaugusto mentioned this pull request Feb 5, 2021

SASL_SSL / SCRAM support Aiven-Open/karapace#153

Closed

hackaugusto mentioned this pull request Apr 27, 2021

Implement endpoint for health check Aiven-Open/karapace#143

Closed

Merge branch 'master' into fix-infinite-loop-with-kafka-admin

baf58d1

hackaugusto closed this Aug 7, 2023

hackaugusto deleted the fix-infinite-loop-with-kafka-admin branch August 7, 2023 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bugfix: fix infinite loop on KafkaAdminClient #2194

bugfix: fix infinite loop on KafkaAdminClient #2194

Uh oh!

hackaugusto commented Jan 14, 2021 •

edited by dpkp

Loading

Uh oh!

Uh oh!

bugfix: fix infinite loop on KafkaAdminClient #2194

bugfix: fix infinite loop on KafkaAdminClient #2194

Uh oh!

Conversation

hackaugusto commented Jan 14, 2021 • edited by dpkp Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hackaugusto commented Jan 14, 2021 •

edited by dpkp

Loading