-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix infinite retry loops in flb_tls_net_read/write #11547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -344,17 +344,17 @@ int flb_tls_net_read(struct flb_tls_session *session, void *buf, size_t len) | |
|
|
||
| current_timestamp = time(NULL); | ||
|
|
||
| if (ret == FLB_TLS_WANT_READ) { | ||
| if (timeout_timestamp > 0 && | ||
| timeout_timestamp <= current_timestamp) { | ||
| if (ret == FLB_TLS_WANT_READ || ret == FLB_TLS_WANT_WRITE) { | ||
| /* If no timeout is configured OR timeout expired, return immediately | ||
| * to let the event loop wait for the socket to be ready. | ||
| * Without this check, we loop forever at 100% CPU when timeout is 0. | ||
| */ | ||
| if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) { | ||
| return ret; | ||
| } | ||
|
|
||
| goto retry_read; | ||
| } | ||
| else if (ret == FLB_TLS_WANT_WRITE) { | ||
| goto retry_read; | ||
| } | ||
| else if (ret < 0) { | ||
| return -1; | ||
| } | ||
|
|
@@ -435,22 +435,39 @@ int flb_tls_net_read_async(struct flb_coro *co, | |
| int flb_tls_net_write(struct flb_tls_session *session, | ||
| const void *data, size_t len, size_t *out_len) | ||
| { | ||
| time_t timeout_timestamp; | ||
| time_t current_timestamp; | ||
| size_t total; | ||
| int ret; | ||
| struct flb_tls *tls; | ||
|
|
||
| total = 0; | ||
| tls = session->tls; | ||
|
|
||
| if (session->connection->net->io_timeout > 0) { | ||
| timeout_timestamp = time(NULL) + session->connection->net->io_timeout; | ||
| } | ||
| else { | ||
| timeout_timestamp = 0; | ||
| } | ||
|
|
||
| retry_write: | ||
| ret = tls->api->net_write(session, | ||
| (unsigned char *) data + total, | ||
| len - total); | ||
|
|
||
| if (ret == FLB_TLS_WANT_WRITE) { | ||
| goto retry_write; | ||
| } | ||
| else if (ret == FLB_TLS_WANT_READ) { | ||
| current_timestamp = time(NULL); | ||
|
|
||
| if (ret == FLB_TLS_WANT_WRITE || ret == FLB_TLS_WANT_READ) { | ||
| /* If no timeout is configured OR timeout expired, return immediately | ||
| * to let the event loop wait for the socket to be ready. | ||
| * Without this check, we loop forever at 100% CPU when timeout is 0. | ||
| */ | ||
| if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) { | ||
| *out_len = total; | ||
| return ret; | ||
|
Comment on lines
+466
to
+468
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This path now returns Useful? React with 👍 / 👎. |
||
| } | ||
|
|
||
| goto retry_write; | ||
|
Comment on lines
+461
to
471
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Find callers of flb_io_net_write and flb_tls_net_write to verify WANT code handling
echo "=== Callers of flb_io_net_write ==="
rg -nP --type=c -C5 '\bflb_io_net_write\s*\(' -g '!*test*'
echo ""
echo "=== Check for FLB_TLS_WANT handling patterns ==="
rg -nP --type=c 'FLB_TLS_WANT_(READ|WRITE)' -C3Repository: fluent/fluent-bit Length of output: 30899 🏁 Script executed: # Get the full implementation of flb_io_net_write
cat -n src/flb_io.c | sed -n '675,750p'Repository: fluent/fluent-bit Length of output: 2689 🏁 Script executed: # Get the full flb_tls_net_write function
cat -n src/tls/flb_tls.c | sed -n '430,480p'Repository: fluent/fluent-bit Length of output: 1784 Verify caller handling of When Additionally, when 🤖 Prompt for AI Agents |
||
| } | ||
| else if (ret < 0) { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When
io_timeoutis unset (default0, persrc/flb_network.c), this branch now immediately returnsFLB_TLS_WANT_READ/WRITE. Blocking read call sites treat any<= 0as a hard failure (for exampleflb_http_client_session_readandflb_http_server_session_read), so a transient WANT condition now closes the connection instead of waiting/retrying. That turns the previous spin into immediate read failures for default configurations.Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, I have to rethink how we want this to work.... I still don't think the default behavior (io_timeout = 0) should be able to hang the input forever in any situation, but the real solution is to bump io_timeout to a non-zero value.
I'm wondering if the approach might be to use a high timeout value when the io_timeout is set to zero only for this particular set of functions. This does potentially change the expected user behavior, but I can't imagine a scenario where the user would want to wait forever for a connection that is almost certainly broken.