Skip to content

Comments

fix: deflake //rs/tests/networking:firewall_max_connections_test#8983

Open
basvandijk wants to merge 2 commits intomasterfrom
ai/deflake-firewall_max_connections_test
Open

fix: deflake //rs/tests/networking:firewall_max_connections_test#8983
basvandijk wants to merge 2 commits intomasterfrom
ai/deflake-firewall_max_connections_test

Conversation

@basvandijk
Copy link
Collaborator

@basvandijk basvandijk commented Feb 21, 2026

The //rs/tests/networking:firewall_max_connections_test is slightly flaky:

$ bazel run //ci/githubstats:query -- top 50 flaky% --week --include //rs/tests/networking:firewall_max_connections_test%
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━━┑
│    │ label                                                        │   total │   non_success │   flaky │   timeout │   fail │   non_success% │   flaky% │   timeout% │   fail% │   impact │   total duration │   duration_p90 │ owners    │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━━┥
│  0 │ //rs/tests/networking:firewall_max_connections_test_head_nns │     126 │             4 │       4 │         0 │      0 │            3.2 │      3.2 │          0 │       0 │    18:32 │          9:43:48 │           4:38 │ consensus │
│  1 │ //rs/tests/networking:firewall_max_connections_test          │     126 │             3 │       3 │         0 │      0 │            2.4 │      2.4 │          0 │       0 │    14:06 │          9:52:12 │           4:42 │ consensus │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━━┙

Always failing in a similar way:

$ bazel run //ci/githubstats:query -- last --flaky --week //rs/tests/networking:firewall_max_connections_test --columns=last_started_at,errors
...
╒════╤═════════════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │ errors per attempt                                                                                            │
╞════╪═════════════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ Fri 2026-02-20 12:18:57 │ 1: connection_count_test: Could not create connection 56#. Connection is below the limit of active connect... │
├────┼─────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│  1 │ Thu 2026-02-19 19:34:41 │ 1: connection_count_test: Could not create connection 878#. Connection is below the limit of active connec... │
├────┼─────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│  2 │ Mon 2026-02-16 19:21:41 │ 1: connection_count_test: Could not create connection 745#. Connection is below the limit of active connec... │
╘════╧═════════════════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

Claude Opus 4.6 provided the following Root Cause Analysis and accompanying fix:

Root Cause

The test creates 1000 TCP connections sequentially to a node, each with a 2-second handshake timeout. Under load, individual TCP handshakes can transiently take longer than 2 seconds, causing the test to fail at varying connection numbers (observed at connections 56, 745, and 878 in recent CI runs).

Fix

  • Increase TCP_HANDSHAKE_TIMEOUT from 2s to 5s to give more headroom for individual handshakes.
  • Wrap each connection attempt with retry_with_msg! (30s timeout, 1s backoff) so transient timeouts are retried instead of immediately panicking.

Affected Tests

  • //rs/tests/networking:firewall_max_connections_test
  • //rs/tests/networking:firewall_max_connections_test_head_nns

Skill: .claude/skills/fix-flaky-tests/SKILL.md.

Root cause: The test creates 1000 TCP connections sequentially to a node,
each with a 2-second handshake timeout. Under load, individual TCP
handshakes can transiently take longer than 2 seconds, causing the test
to fail at varying connection numbers (observed at connections 56, 745,
and 878 in recent CI runs).

Fix:
- Increase TCP_HANDSHAKE_TIMEOUT from 2s to 5s.
- Use retry_with_msg! macro (30s timeout, 1s backoff) around each
  connection attempt, so transient timeouts are retried instead of
  immediately panicking.

Skill: .claude/skills/fix-flaky-tests/SKILL.md
@github-actions github-actions bot added the fix label Feb 21, 2026
@basvandijk basvandijk marked this pull request as ready for review February 21, 2026 19:17
@basvandijk basvandijk requested a review from a team as a code owner February 21, 2026 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant