Fix infinite retry loops in flb_tls_net_read/write#11547
Fix infinite retry loops in flb_tls_net_read/write#11547spstack wants to merge 1 commit intofluent:masterfrom
Conversation
📝 WalkthroughWalkthroughRefactors TLS non-blocking I/O in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can use Clang for C/C++ static analysis and code quality checks.Clang provides comprehensive static analysis for C and C++ code, including syntax checking, type checking, and various warning diagnostics. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/tls/flb_tls.c (1)
347-357: Tight CPU spin remains whenio_timeout > 0.The fix correctly prevents infinite loops when
io_timeout == 0, but when a timeout is configured and not yet expired, thegoto retry_readat line 356 still creates a tight busy-wait loop until either:
- The TLS operation succeeds, or
- The timeout expires
Compare this to the handshake retry logic at line 699 which uses
flb_time_msleep(500)to avoid CPU spin, and the async variant at line 392 which properly yields viaflb_coro_yield().Consider adding a small sleep before retrying to reduce CPU consumption:
♻️ Suggested improvement to add sleep before retry
if (ret == FLB_TLS_WANT_READ || ret == FLB_TLS_WANT_WRITE) { /* If no timeout is configured OR timeout expired, return immediately * to let the event loop wait for the socket to be ready. * Without this check, we loop forever at 100% CPU when timeout is 0. */ if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) { return ret; } + flb_time_msleep(10); goto retry_read; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tls/flb_tls.c` around lines 347 - 357, The FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE handling spins when io_timeout > 0 because it immediately goto retry_read; modify the branch that currently checks timeout_timestamp/current_timestamp so that, if the timeout hasn't expired, it yields/sleeps before retrying: call flb_coro_yield() when in async mode (same context used by the async variant) or call flb_time_msleep(50) (or a small configurable msleep) for synchronous code, then goto retry_read; reference the retry_read label, FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE, timeout_timestamp/current_timestamp, flb_coro_yield and flb_time_msleep when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/tls/flb_tls.c`:
- Around line 461-471: flb_tls_net_write currently returns
FLB_TLS_WANT_WRITE/READ directly which sync-mode callers (flb_io_net_write and
plugins/out_websocket/websocket.c, plugins/out_tcp/tcp.c,
plugins/out_syslog/syslog.c) treat only -1 as error; change behavior so WANT
codes are translated to a standard -1 return with errno set to
EAGAIN/EWOULDBLOCK when the timeout has expired or timeout_timestamp == 0,
ensuring callers' error handling runs; additionally, remove the tight busy-loop
on WANT (the goto retry_write) and instead wait for the socket to become
writable/readable using the existing IO wait primitive (e.g., flb_io_wait or a
select/poll wrapper) until timeout before retrying, then either retry or return
-1/EAGAIN on timeout.
---
Nitpick comments:
In `@src/tls/flb_tls.c`:
- Around line 347-357: The FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE handling spins
when io_timeout > 0 because it immediately goto retry_read; modify the branch
that currently checks timeout_timestamp/current_timestamp so that, if the
timeout hasn't expired, it yields/sleeps before retrying: call flb_coro_yield()
when in async mode (same context used by the async variant) or call
flb_time_msleep(50) (or a small configurable msleep) for synchronous code, then
goto retry_read; reference the retry_read label,
FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE, timeout_timestamp/current_timestamp,
flb_coro_yield and flb_time_msleep when making the change.
| if (ret == FLB_TLS_WANT_WRITE || ret == FLB_TLS_WANT_READ) { | ||
| /* If no timeout is configured OR timeout expired, return immediately | ||
| * to let the event loop wait for the socket to be ready. | ||
| * Without this check, we loop forever at 100% CPU when timeout is 0. | ||
| */ | ||
| if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) { | ||
| *out_len = total; | ||
| return ret; | ||
| } | ||
|
|
||
| goto retry_write; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find callers of flb_io_net_write and flb_tls_net_write to verify WANT code handling
echo "=== Callers of flb_io_net_write ==="
rg -nP --type=c -C5 '\bflb_io_net_write\s*\(' -g '!*test*'
echo ""
echo "=== Check for FLB_TLS_WANT handling patterns ==="
rg -nP --type=c 'FLB_TLS_WANT_(READ|WRITE)' -C3Repository: fluent/fluent-bit
Length of output: 30899
🏁 Script executed:
# Get the full implementation of flb_io_net_write
cat -n src/flb_io.c | sed -n '675,750p'Repository: fluent/fluent-bit
Length of output: 2689
🏁 Script executed:
# Get the full flb_tls_net_write function
cat -n src/tls/flb_tls.c | sed -n '430,480p'Repository: fluent/fluent-bit
Length of output: 1784
Verify caller handling of FLB_TLS_WANT_* return codes.
When timeout_timestamp == 0 or timeout expires, flb_tls_net_write returns FLB_TLS_WANT_WRITE (-2022) or FLB_TLS_WANT_READ (-2020) directly to callers through flb_io_net_write. However, sync-mode callers such as plugins/out_websocket/websocket.c:300, plugins/out_tcp/tcp.c:189, and plugins/out_syslog/syslog.c:869 only check ret == -1 for errors. These negative WANT codes will bypass all error handling, potentially causing incorrect behavior where a WANT code (partial write state) is silently treated as neither success nor error.
Additionally, when io_timeout > 0, the code enters a tight loop at line 471 (goto retry_write) for the duration of the timeout, consuming CPU cycles while waiting for the socket to be ready. This repeats the same concern as flb_tls_net_read.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/tls/flb_tls.c` around lines 461 - 471, flb_tls_net_write currently
returns FLB_TLS_WANT_WRITE/READ directly which sync-mode callers
(flb_io_net_write and plugins/out_websocket/websocket.c, plugins/out_tcp/tcp.c,
plugins/out_syslog/syslog.c) treat only -1 as error; change behavior so WANT
codes are translated to a standard -1 return with errno set to
EAGAIN/EWOULDBLOCK when the timeout has expired or timeout_timestamp == 0,
ensuring callers' error handling runs; additionally, remove the tight busy-loop
on WANT (the goto retry_write) and instead wait for the socket to become
writable/readable using the existing IO wait primitive (e.g., flb_io_wait or a
select/poll wrapper) until timeout before retrying, then either retry or return
-1/EAGAIN on timeout.
Fixes 100% CPU hang in flb_tls_net_read and flb_tls_net_write when
SSL_read/SSL_write returns WANT_READ/WANT_WRITE and no io_timeout
is configured.
Both functions had retry loops that would exit when a timeout expires:
if (timeout_timestamp > 0 && timeout_timestamp <= current_timestamp)
return ret;
However, when io_timeout is 0 or not set, timeout_timestamp is 0, and
this condition is always false, causing an infinite loop at 100% CPU.
flb_tls_net_write() didn't even check timeout at all - it just looped
forever on WANT_READ/WANT_WRITE.
Both functions now explicitly check: if timeout_timestamp == 0 OR
timeout has expired, return immediately instead of looping. This lets
the event loop wait for the socket to become readable/writable,
preventing the CPU spin.
When a timeout IS configured, the existing time-based retry behavior
is preserved for backwards compatibility.
The async versions (_async suffix) don't have this bug because they
properly yield control via flb_coro_yield() instead of spinning.
Signed-off-by: Scott Stack <scottstack14@gmail.com>
62b036d to
aef9239
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 62b036d41e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) { | ||
| return ret; |
There was a problem hiding this comment.
Do not return TLS_WANT to blocking read callers
When io_timeout is unset (default 0, per src/flb_network.c), this branch now immediately returns FLB_TLS_WANT_READ/WRITE. Blocking read call sites treat any <= 0 as a hard failure (for example flb_http_client_session_read and flb_http_server_session_read), so a transient WANT condition now closes the connection instead of waiting/retrying. That turns the previous spin into immediate read failures for default configurations.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This is a good point, I have to rethink how we want this to work.... I still don't think the default behavior (io_timeout = 0) should be able to hang the input forever in any situation, but the real solution is to bump io_timeout to a non-zero value.
I'm wondering if the approach might be to use a high timeout value when the io_timeout is set to zero only for this particular set of functions. This does potentially change the expected user behavior, but I can't imagine a scenario where the user would want to wait forever for a connection that is almost certainly broken.
| if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) { | ||
| *out_len = total; | ||
| return ret; |
There was a problem hiding this comment.
Avoid surfacing TLS_WANT as successful blocking writes
This path now returns FLB_TLS_WANT_READ/WRITE when no timeout is configured, but multiple blocking write callers only treat -1 as failure (e.g. src/flb_http_client.c and src/http_server/flb_http_server.c). Since WANT codes are negative but not -1, those callers can continue as if the write succeeded while *out_len is partial or zero, which can silently leave requests/responses unsent or truncated.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/tls/flb_tls.c (2)
347-356:⚠️ Potential issue | 🟠 MajorFinite
io_timeoutstill busy-spins until the deadline expires.Line 356 and Line 471 jump straight back into
SSL_read/SSL_writewhile the deadline is still in the future. That fixes the unbounded loop forio_timeout == 0, but a positive timeout can still pin a core for the full timeout window. Please wait for socket readiness between retries instead of polling the clock in a tight loop.Also applies to: 461-471
347-353:⚠️ Potential issue | 🟠 Major
io_timeout == 0now changes blocking I/O into an immediate WANT return.
src/flb_network.c:151-154documentsio_timeout == 0as infinite wait, but Line 352 and Line 466 now returnFLB_TLS_WANT_*on the first retryable condition without any readiness wait. On the write side,src/flb_io.c:675-715forwards that value unchanged, so sync callers that only treat-1as failure can leak a WANT code instead of either blocking or entering normal error handling. Please keep the blocking contract here by waiting for readiness, or normalize the sync return path before it reachesflb_io_net_*.#!/bin/bash set -euo pipefail echo "=== io_timeout default semantics ===" sed -n '145,160p' src/flb_network.c | nl -ba echo echo "=== flb_io_net_write passthrough ===" sed -n '675,715p' src/flb_io.c | nl -ba echo echo "=== flb_io_net_read / flb_io_net_write definitions ===" rg -n -C3 --type=c '\bint\s+flb_io_net_(read|write)\s*\(' src echo echo "=== direct call sites of flb_io_net_write ===" rg -n -C4 --type=c '\bflb_io_net_write\s*\(' src plugins echo echo "=== nearby handling in known sync output plugins ===" for f in plugins/out_websocket/websocket.c plugins/out_tcp/tcp.c plugins/out_syslog/syslog.c; do if [ -f "$f" ]; then echo "--- $f ---" rg -n -C4 'flb_io_net_write\s*\(|==\s*-1|ret\s*<\s*0' "$f" fi done echo echo "=== direct call sites of flb_io_net_read ===" rg -n -C4 --type=c '\bflb_io_net_read\s*\(' src pluginsAlso applies to: 461-468
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tls/flb_tls.c` around lines 347 - 353, The code returns FLB_TLS_WANT_READ/WRITE immediately when timeout_timestamp == 0, which breaks the documented io_timeout==0 (infinite wait) contract; change the condition so we only return WANT when a timeout is configured and has actually expired (i.e., replace the current check that uses "timeout_timestamp == 0 || timeout_timestamp <= current_timestamp" with a check like "timeout_timestamp != 0 && timeout_timestamp <= current_timestamp") for both retryable branches that return FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE (use the timeout_timestamp/current_timestamp checks in flb_tls.c where those WANT codes are returned), ensuring blocking callers see normal blocking behavior instead of a stray WANT code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/tls/flb_tls.c`:
- Around line 347-353: The code returns FLB_TLS_WANT_READ/WRITE immediately when
timeout_timestamp == 0, which breaks the documented io_timeout==0 (infinite
wait) contract; change the condition so we only return WANT when a timeout is
configured and has actually expired (i.e., replace the current check that uses
"timeout_timestamp == 0 || timeout_timestamp <= current_timestamp" with a check
like "timeout_timestamp != 0 && timeout_timestamp <= current_timestamp") for
both retryable branches that return FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE (use
the timeout_timestamp/current_timestamp checks in flb_tls.c where those WANT
codes are returned), ensuring blocking callers see normal blocking behavior
instead of a stray WANT code.
Attempts to fix 100% CPU hang in flb_tls_net_read and flb_tls_net_write when SSL_read/SSL_write returns WANT_READ/WANT_WRITE and no io_timeout is configured.
This seems to happen when connections are closed unexpectedly. It causes a condition where open SSL just repeatedly returns
WANT_READ, but there's no new data being received as we've lost the other side, so it just hangs forever and consumes 100% CPU.If an
io_timeoutis configured, then everything should be ok as it will eventually give up and return, but the problem arises when the default value of0is used forio_timeoutas it will spin forever in a tight loop.Proposing changing this to immediately return if there is no io_timeout configured.
Intended to address this issue, but may address others instances where CPU was 100% consumed
#11551
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
I've had to test this change by first getting a test to reproduce this, then applying this fix and trying again.
I've at least confirmed that I'm able to reproduce the issue without this change, and not reproduce it with this change.
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit