Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 27 additions & 10 deletions src/tls/flb_tls.c
Original file line number Diff line number Diff line change
Expand Up @@ -344,17 +344,17 @@ int flb_tls_net_read(struct flb_tls_session *session, void *buf, size_t len)

current_timestamp = time(NULL);

if (ret == FLB_TLS_WANT_READ) {
if (timeout_timestamp > 0 &&
timeout_timestamp <= current_timestamp) {
if (ret == FLB_TLS_WANT_READ || ret == FLB_TLS_WANT_WRITE) {
/* If no timeout is configured OR timeout expired, return immediately
* to let the event loop wait for the socket to be ready.
* Without this check, we loop forever at 100% CPU when timeout is 0.
*/
if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) {
return ret;
Comment on lines +352 to 353

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not return TLS_WANT to blocking read callers

When io_timeout is unset (default 0, per src/flb_network.c), this branch now immediately returns FLB_TLS_WANT_READ/WRITE. Blocking read call sites treat any <= 0 as a hard failure (for example flb_http_client_session_read and flb_http_server_session_read), so a transient WANT condition now closes the connection instead of waiting/retrying. That turns the previous spin into immediate read failures for default configurations.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, I have to rethink how we want this to work.... I still don't think the default behavior (io_timeout = 0) should be able to hang the input forever in any situation, but the real solution is to bump io_timeout to a non-zero value.

I'm wondering if the approach might be to use a high timeout value when the io_timeout is set to zero only for this particular set of functions. This does potentially change the expected user behavior, but I can't imagine a scenario where the user would want to wait forever for a connection that is almost certainly broken.

}

goto retry_read;
}
else if (ret == FLB_TLS_WANT_WRITE) {
goto retry_read;
}
else if (ret < 0) {
return -1;
}
Expand Down Expand Up @@ -435,22 +435,39 @@ int flb_tls_net_read_async(struct flb_coro *co,
int flb_tls_net_write(struct flb_tls_session *session,
const void *data, size_t len, size_t *out_len)
{
time_t timeout_timestamp;
time_t current_timestamp;
size_t total;
int ret;
struct flb_tls *tls;

total = 0;
tls = session->tls;

if (session->connection->net->io_timeout > 0) {
timeout_timestamp = time(NULL) + session->connection->net->io_timeout;
}
else {
timeout_timestamp = 0;
}

retry_write:
ret = tls->api->net_write(session,
(unsigned char *) data + total,
len - total);

if (ret == FLB_TLS_WANT_WRITE) {
goto retry_write;
}
else if (ret == FLB_TLS_WANT_READ) {
current_timestamp = time(NULL);

if (ret == FLB_TLS_WANT_WRITE || ret == FLB_TLS_WANT_READ) {
/* If no timeout is configured OR timeout expired, return immediately
* to let the event loop wait for the socket to be ready.
* Without this check, we loop forever at 100% CPU when timeout is 0.
*/
if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) {
*out_len = total;
return ret;
Comment on lines +466 to +468

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid surfacing TLS_WANT as successful blocking writes

This path now returns FLB_TLS_WANT_READ/WRITE when no timeout is configured, but multiple blocking write callers only treat -1 as failure (e.g. src/flb_http_client.c and src/http_server/flb_http_server.c). Since WANT codes are negative but not -1, those callers can continue as if the write succeeded while *out_len is partial or zero, which can silently leave requests/responses unsent or truncated.

Useful? React with 👍 / 👎.

}

goto retry_write;
Comment on lines +461 to 471
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find callers of flb_io_net_write and flb_tls_net_write to verify WANT code handling
echo "=== Callers of flb_io_net_write ==="
rg -nP --type=c -C5 '\bflb_io_net_write\s*\(' -g '!*test*'

echo ""
echo "=== Check for FLB_TLS_WANT handling patterns ==="
rg -nP --type=c 'FLB_TLS_WANT_(READ|WRITE)' -C3

Repository: fluent/fluent-bit

Length of output: 30899


🏁 Script executed:

# Get the full implementation of flb_io_net_write
cat -n src/flb_io.c | sed -n '675,750p'

Repository: fluent/fluent-bit

Length of output: 2689


🏁 Script executed:

# Get the full flb_tls_net_write function
cat -n src/tls/flb_tls.c | sed -n '430,480p'

Repository: fluent/fluent-bit

Length of output: 1784


Verify caller handling of FLB_TLS_WANT_* return codes.

When timeout_timestamp == 0 or timeout expires, flb_tls_net_write returns FLB_TLS_WANT_WRITE (-2022) or FLB_TLS_WANT_READ (-2020) directly to callers through flb_io_net_write. However, sync-mode callers such as plugins/out_websocket/websocket.c:300, plugins/out_tcp/tcp.c:189, and plugins/out_syslog/syslog.c:869 only check ret == -1 for errors. These negative WANT codes will bypass all error handling, potentially causing incorrect behavior where a WANT code (partial write state) is silently treated as neither success nor error.

Additionally, when io_timeout > 0, the code enters a tight loop at line 471 (goto retry_write) for the duration of the timeout, consuming CPU cycles while waiting for the socket to be ready. This repeats the same concern as flb_tls_net_read.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tls/flb_tls.c` around lines 461 - 471, flb_tls_net_write currently
returns FLB_TLS_WANT_WRITE/READ directly which sync-mode callers
(flb_io_net_write and plugins/out_websocket/websocket.c, plugins/out_tcp/tcp.c,
plugins/out_syslog/syslog.c) treat only -1 as error; change behavior so WANT
codes are translated to a standard -1 return with errno set to
EAGAIN/EWOULDBLOCK when the timeout has expired or timeout_timestamp == 0,
ensuring callers' error handling runs; additionally, remove the tight busy-loop
on WANT (the goto retry_write) and instead wait for the socket to become
writable/readable using the existing IO wait primitive (e.g., flb_io_wait or a
select/poll wrapper) until timeout before retrying, then either retry or return
-1/EAGAIN on timeout.

}
else if (ret < 0) {
Expand Down
Loading