Fix infinite retry loops in flb_tls_net_read/write by spstack · Pull Request #11547 · fluent/fluent-bit

spstack · 2026-03-12T17:01:08Z

Attempts to fix 100% CPU hang in flb_tls_net_read and flb_tls_net_write when SSL_read/SSL_write returns WANT_READ/WANT_WRITE and no io_timeout is configured.

This seems to happen when connections are closed unexpectedly. It causes a condition where open SSL just repeatedly returns WANT_READ, but there's no new data being received as we've lost the other side, so it just hangs forever and consumes 100% CPU.

If an io_timeout is configured, then everything should be ok as it will eventually give up and return, but the problem arises when the default value of 0 is used for io_timeout as it will spin forever in a tight loop.

Proposing changing this to immediately return if there is no io_timeout configured.

Intended to address this issue, but may address others instances where CPU was 100% consumed
#11551

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

[N/A] Example configuration file for the change
Debug log output from testing the change

I've had to test this change by first getting a test to reproduce this, then applying this fix and trying again.
I've at least confirmed that I'm able to reproduce the issue without this change, and not reproduce it with this change.

Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

Run local packaging test showing all targets (including any new ones) build.
Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

[N/A] Documentation required for this feature

Backporting

Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

Bug Fixes
- Improved TLS I/O handling to prevent excessive CPU usage when timeouts are unset or expired.
- Unified read/write readiness handling to reduce unnecessary retries and spin loops.
- Ensure progress is reported on interrupted writes (outgoing bytes updated) when operations return due to timeout.

coderabbitai · 2026-03-12T17:01:39Z

📝 Walkthrough

Walkthrough

Refactors TLS non-blocking I/O in flb_tls_net_read and flb_tls_net_write: unifies handling of WANT_READ/WANT_WRITE, adds upfront timeout_timestamp calculation, and adds immediate-return checks to avoid busy-looping when timeouts are zero or expired.

Changes

Cohort / File(s)	Summary
TLS I/O timeout & readiness handling `src/tls/flb_tls.c`	Merged WANT_READ/WANT_WRITE branches into unified paths for read/write. Initialize `timeout_timestamp` early, check current time to return immediately when timeout is zero or expired, and update `out_len` on timeout returns in write path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

backport to v4.2.x

Suggested reviewers

edsiper
cosmo0920
fujimotos

Poem

🐰 I hopped in the TLS glade at night,
Merged paths so reads and writes unite,
Timers set early, no frantic spin,
Packets rest easy, calm within,
I nibble code — now everything's light ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: fixing infinite retry loops in two TLS functions by implementing proper timeout handling.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use Clang for C/C++ static analysis and code quality checks.

Clang provides comprehensive static analysis for C and C++ code, including syntax checking, type checking, and various warning diagnostics.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/tls/flb_tls.c (1)
347-357: Tight CPU spin remains when io_timeout > 0.

The fix correctly prevents infinite loops when io_timeout == 0, but when a timeout is configured and not yet expired, the goto retry_read at line 356 still creates a tight busy-wait loop until either:

The TLS operation succeeds, or

The timeout expires

Compare this to the handshake retry logic at line 699 which uses flb_time_msleep(500) to avoid CPU spin, and the async variant at line 392 which properly yields via flb_coro_yield().

Consider adding a small sleep before retrying to reduce CPU consumption:
♻️ Suggested improvement to add sleep before retry
     if (ret == FLB_TLS_WANT_READ || ret == FLB_TLS_WANT_WRITE) {
         /* If no timeout is configured OR timeout expired, return immediately
          * to let the event loop wait for the socket to be ready.
          * Without this check, we loop forever at 100% CPU when timeout is 0.
          */
         if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) {
             return ret;
         }
 
+        flb_time_msleep(10);
         goto retry_read;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tls/flb_tls.c` around lines 347 - 357, The
FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE handling spins when io_timeout > 0 because
it immediately goto retry_read; modify the branch that currently checks
timeout_timestamp/current_timestamp so that, if the timeout hasn't expired, it
yields/sleeps before retrying: call flb_coro_yield() when in async mode (same
context used by the async variant) or call flb_time_msleep(50) (or a small
configurable msleep) for synchronous code, then goto retry_read; reference the
retry_read label, FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE,
timeout_timestamp/current_timestamp, flb_coro_yield and flb_time_msleep when
making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/tls/flb_tls.c`:
- Around line 461-471: flb_tls_net_write currently returns
FLB_TLS_WANT_WRITE/READ directly which sync-mode callers (flb_io_net_write and
plugins/out_websocket/websocket.c, plugins/out_tcp/tcp.c,
plugins/out_syslog/syslog.c) treat only -1 as error; change behavior so WANT
codes are translated to a standard -1 return with errno set to
EAGAIN/EWOULDBLOCK when the timeout has expired or timeout_timestamp == 0,
ensuring callers' error handling runs; additionally, remove the tight busy-loop
on WANT (the goto retry_write) and instead wait for the socket to become
writable/readable using the existing IO wait primitive (e.g., flb_io_wait or a
select/poll wrapper) until timeout before retrying, then either retry or return
-1/EAGAIN on timeout.

---

Nitpick comments:
In `@src/tls/flb_tls.c`:
- Around line 347-357: The FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE handling spins
when io_timeout > 0 because it immediately goto retry_read; modify the branch
that currently checks timeout_timestamp/current_timestamp so that, if the
timeout hasn't expired, it yields/sleeps before retrying: call flb_coro_yield()
when in async mode (same context used by the async variant) or call
flb_time_msleep(50) (or a small configurable msleep) for synchronous code, then
goto retry_read; reference the retry_read label,
FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE, timeout_timestamp/current_timestamp,
flb_coro_yield and flb_time_msleep when making the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 524eb97e-3b3c-4d74-b785-d073426424f7

📥 Commits

Reviewing files that changed from the base of the PR and between 405fed0 and 62b036d.

📒 Files selected for processing (1)

src/tls/flb_tls.c

coderabbitai · 2026-03-12T17:06:49Z

src/tls/flb_tls.c

+    if (ret == FLB_TLS_WANT_WRITE || ret == FLB_TLS_WANT_READ) {
+        /* If no timeout is configured OR timeout expired, return immediately
+         * to let the event loop wait for the socket to be ready.
+         * Without this check, we loop forever at 100% CPU when timeout is 0.
+         */
+        if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) {
+            *out_len = total;
+            return ret;
+        }
+
        goto retry_write;


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find callers of flb_io_net_write and flb_tls_net_write to verify WANT code handling echo "=== Callers of flb_io_net_write ===" rg -nP --type=c -C5 '\bflb_io_net_write\s*\(' -g '!*test*' echo "" echo "=== Check for FLB_TLS_WANT handling patterns ===" rg -nP --type=c 'FLB_TLS_WANT_(READ|WRITE)' -C3

Repository: fluent/fluent-bit

Length of output: 30899

🏁 Script executed:

# Get the full implementation of flb_io_net_write cat -n src/flb_io.c | sed -n '675,750p'

Repository: fluent/fluent-bit

Length of output: 2689

🏁 Script executed:

# Get the full flb_tls_net_write function cat -n src/tls/flb_tls.c | sed -n '430,480p'

Repository: fluent/fluent-bit

Length of output: 1784

Verify caller handling of FLB_TLS_WANT_* return codes.

When timeout_timestamp == 0 or timeout expires, flb_tls_net_write returns FLB_TLS_WANT_WRITE (-2022) or FLB_TLS_WANT_READ (-2020) directly to callers through flb_io_net_write. However, sync-mode callers such as plugins/out_websocket/websocket.c:300, plugins/out_tcp/tcp.c:189, and plugins/out_syslog/syslog.c:869 only check ret == -1 for errors. These negative WANT codes will bypass all error handling, potentially causing incorrect behavior where a WANT code (partial write state) is silently treated as neither success nor error.

Additionally, when io_timeout > 0, the code enters a tight loop at line 471 (goto retry_write) for the duration of the timeout, consuming CPU cycles while waiting for the socket to be ready. This repeats the same concern as flb_tls_net_read.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/tls/flb_tls.c` around lines 461 - 471, flb_tls_net_write currently returns FLB_TLS_WANT_WRITE/READ directly which sync-mode callers (flb_io_net_write and plugins/out_websocket/websocket.c, plugins/out_tcp/tcp.c, plugins/out_syslog/syslog.c) treat only -1 as error; change behavior so WANT codes are translated to a standard -1 return with errno set to EAGAIN/EWOULDBLOCK when the timeout has expired or timeout_timestamp == 0, ensuring callers' error handling runs; additionally, remove the tight busy-loop on WANT (the goto retry_write) and instead wait for the socket to become writable/readable using the existing IO wait primitive (e.g., flb_io_wait or a select/poll wrapper) until timeout before retrying, then either retry or return -1/EAGAIN on timeout.

Fixes 100% CPU hang in flb_tls_net_read and flb_tls_net_write when SSL_read/SSL_write returns WANT_READ/WANT_WRITE and no io_timeout is configured. Both functions had retry loops that would exit when a timeout expires: if (timeout_timestamp > 0 && timeout_timestamp <= current_timestamp) return ret; However, when io_timeout is 0 or not set, timeout_timestamp is 0, and this condition is always false, causing an infinite loop at 100% CPU. flb_tls_net_write() didn't even check timeout at all - it just looped forever on WANT_READ/WANT_WRITE. Both functions now explicitly check: if timeout_timestamp == 0 OR timeout has expired, return immediately instead of looping. This lets the event loop wait for the socket to become readable/writable, preventing the CPU spin. When a timeout IS configured, the existing time-based retry behavior is preserved for backwards compatibility. The async versions (_async suffix) don't have this bug because they properly yield control via flb_coro_yield() instead of spinning. Signed-off-by: Scott Stack <scottstack14@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62b036d41e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-12T17:08:51Z

src/tls/flb_tls.c

+        if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) {
            return ret;


Do not return TLS_WANT to blocking read callers

When io_timeout is unset (default 0, per src/flb_network.c), this branch now immediately returns FLB_TLS_WANT_READ/WRITE. Blocking read call sites treat any <= 0 as a hard failure (for example flb_http_client_session_read and flb_http_server_session_read), so a transient WANT condition now closes the connection instead of waiting/retrying. That turns the previous spin into immediate read failures for default configurations.

Useful? React with 👍 / 👎.

This is a good point, I have to rethink how we want this to work.... I still don't think the default behavior (io_timeout = 0) should be able to hang the input forever in any situation, but the real solution is to bump io_timeout to a non-zero value.

I'm wondering if the approach might be to use a high timeout value when the io_timeout is set to zero only for this particular set of functions. This does potentially change the expected user behavior, but I can't imagine a scenario where the user would want to wait forever for a connection that is almost certainly broken.

chatgpt-codex-connector · 2026-03-12T17:08:51Z

src/tls/flb_tls.c

+        if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) {
+            *out_len = total;
+            return ret;


Avoid surfacing TLS_WANT as successful blocking writes

This path now returns FLB_TLS_WANT_READ/WRITE when no timeout is configured, but multiple blocking write callers only treat -1 as failure (e.g. src/flb_http_client.c and src/http_server/flb_http_server.c). Since WANT codes are negative but not -1, those callers can continue as if the write succeeded while *out_len is partial or zero, which can silently leave requests/responses unsent or truncated.

Useful? React with 👍 / 👎.

coderabbitai

♻️ Duplicate comments (2)

src/tls/flb_tls.c (2)
347-356: ⚠️ Potential issue | 🟠 Major

Finite io_timeout still busy-spins until the deadline expires.

Line 356 and Line 471 jump straight back into SSL_read/SSL_write while the deadline is still in the future. That fixes the unbounded loop for io_timeout == 0, but a positive timeout can still pin a core for the full timeout window. Please wait for socket readiness between retries instead of polling the clock in a tight loop.

Also applies to: 461-471

347-353: ⚠️ Potential issue | 🟠 Major

io_timeout == 0 now changes blocking I/O into an immediate WANT return.

src/flb_network.c:151-154 documents io_timeout == 0 as infinite wait, but Line 352 and Line 466 now return FLB_TLS_WANT_* on the first retryable condition without any readiness wait. On the write side, src/flb_io.c:675-715 forwards that value unchanged, so sync callers that only treat -1 as failure can leak a WANT code instead of either blocking or entering normal error handling. Please keep the blocking contract here by waiting for readiness, or normalize the sync return path before it reaches flb_io_net_*.
#!/bin/bash
set -euo pipefail

echo "=== io_timeout default semantics ==="
sed -n '145,160p' src/flb_network.c | nl -ba

echo
echo "=== flb_io_net_write passthrough ==="
sed -n '675,715p' src/flb_io.c | nl -ba

echo
echo "=== flb_io_net_read / flb_io_net_write definitions ==="
rg -n -C3 --type=c '\bint\s+flb_io_net_(read|write)\s*\(' src

echo
echo "=== direct call sites of flb_io_net_write ==="
rg -n -C4 --type=c '\bflb_io_net_write\s*\(' src plugins

echo
echo "=== nearby handling in known sync output plugins ==="
for f in plugins/out_websocket/websocket.c plugins/out_tcp/tcp.c plugins/out_syslog/syslog.c; do
  if [ -f "$f" ]; then
    echo "--- $f ---"
    rg -n -C4 'flb_io_net_write\s*\(|==\s*-1|ret\s*<\s*0' "$f"
  fi
done

echo
echo "=== direct call sites of flb_io_net_read ==="
rg -n -C4 --type=c '\bflb_io_net_read\s*\(' src plugins
Also applies to: 461-468
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tls/flb_tls.c` around lines 347 - 353, The code returns
FLB_TLS_WANT_READ/WRITE immediately when timeout_timestamp == 0, which breaks
the documented io_timeout==0 (infinite wait) contract; change the condition so
we only return WANT when a timeout is configured and has actually expired (i.e.,
replace the current check that uses "timeout_timestamp == 0 || timeout_timestamp
<= current_timestamp" with a check like "timeout_timestamp != 0 &&
timeout_timestamp <= current_timestamp") for both retryable branches that return
FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE (use the
timeout_timestamp/current_timestamp checks in flb_tls.c where those WANT codes
are returned), ensuring blocking callers see normal blocking behavior instead of
a stray WANT code.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/tls/flb_tls.c`:
- Around line 347-353: The code returns FLB_TLS_WANT_READ/WRITE immediately when
timeout_timestamp == 0, which breaks the documented io_timeout==0 (infinite
wait) contract; change the condition so we only return WANT when a timeout is
configured and has actually expired (i.e., replace the current check that uses
"timeout_timestamp == 0 || timeout_timestamp <= current_timestamp" with a check
like "timeout_timestamp != 0 && timeout_timestamp <= current_timestamp") for
both retryable branches that return FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE (use
the timeout_timestamp/current_timestamp checks in flb_tls.c where those WANT
codes are returned), ensuring blocking callers see normal blocking behavior
instead of a stray WANT code.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6406de29-357d-472d-8731-a76662d94f3b

📥 Commits

Reviewing files that changed from the base of the PR and between 62b036d and aef9239.

📒 Files selected for processing (1)

src/tls/flb_tls.c

spstack requested review from cosmo0920 and edsiper as code owners March 12, 2026 17:01

github-actions bot added the docs-required label Mar 12, 2026

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

spstack force-pushed the fix-tls-infinite-loop-hang branch from 62b036d to aef9239 Compare March 12, 2026 17:07

chatgpt-codex-connector bot reviewed Mar 12, 2026

View reviewed changes

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

spstack temporarily deployed to pr March 13, 2026 08:26 — with GitHub Actions Inactive

spstack temporarily deployed to pr March 13, 2026 08:46 — with GitHub Actions Inactive

spstack temporarily deployed to pr March 13, 2026 08:47 — with GitHub Actions Inactive

spstack marked this pull request as draft March 13, 2026 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix infinite retry loops in flb_tls_net_read/write#11547

Fix infinite retry loops in flb_tls_net_read/write#11547
spstack wants to merge 1 commit intofluent:masterfrom
spstack:fix-tls-infinite-loop-hang

spstack commented Mar 12, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 12, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 12, 2026

Uh oh!

spstack Mar 13, 2026

Uh oh!

chatgpt-codex-connector bot Mar 12, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if (timeout_timestamp == 0 \|\| timeout_timestamp <= current_timestamp) {
		return ret;

Conversation

spstack commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

spstack Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spstack commented Mar 12, 2026 •

edited

Loading

coderabbitai bot commented Mar 12, 2026 •

edited

Loading