Skip to content

Comments

fix: deflake //rs/tests/consensus/tecdsa:tecdsa_add_nodes_test#8991

Open
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-tecdsa_add_nodes_test
Open

fix: deflake //rs/tests/consensus/tecdsa:tecdsa_add_nodes_test#8991
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-tecdsa_add_nodes_test

Conversation

@basvandijk
Copy link
Collaborator

@basvandijk basvandijk commented Feb 22, 2026

The //rs/tests/consensus/tecdsa:tecdsa_add_nodes_test is slightly flaky:

$ bazel run //ci/githubstats:query -- top 1 flaky% --week --include //rs/tests/consensus/tecdsa:tecdsa_add_nodes_test
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━━┑
│    │ label                                             │   total │   non_success │   flaky │   timeout │   fail │   non_success% │   flaky% │   timeout% │   fail% │   impact │   total duration │   duration_p90 │ owners    │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━━┥
│  0 │ //rs/tests/consensus/tecdsa:tecdsa_add_nodes_test │     121 │             3 │       3 │         0 │      0 │            2.5 │      2.5 │          0 │       0 │    18:30 │         12:26:10 │           6:10 │ consensus │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━━┙

Failing for different reasons:

$ bazel run //ci/githubstats:query -- last --flaky --week //rs/tests/consensus/tecdsa:tecdsa_add_nodes_test  --columns=last_started_at,errors
...
╒════╤═════════════════════════╤════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │ errors per attempt                                                     │
╞════╪═════════════════════════╪════════════════════════════════════════════════════════════════════════╡
│  0 │ Fri 2026-02-20 11:36:51 │ 1: test: Timeout after 600s                                            │
├────┼─────────────────────────┼────────────────────────────────────────────────────────────────────────┤
│  1 │ Thu 2026-02-19 15:32:17 │ 1: test: Failed to find key rotation of ecdsa:Secp256k1:some_ecdsa_key │
├────┼─────────────────────────┼────────────────────────────────────────────────────────────────────────┤
│  2 │ Mon 2026-02-16 11:39:59 │ 1: test: nns canisters were not installed within timeout of 160 sec    │
╘════╧═════════════════════════╧════════════════════════════════════════════════════════════════════════╛

Claude Opus 4.6 made the following Root Cause Analysis and accompanying fixes:

Root Cause

The test's per-task timeout of 600s (default DEFAULT_TIMEOUT_PER_TEST) is too tight for the operations performed in the test function. The test does NNS canister installation, chain key operations, node addition, topology propagation, key rotation verification, progress checks, and signature tests — all within a single test function that gets 600s.

Three failure modes were observed across 3 flaky runs in the past week:

  1. NNS canister install timeout — NNS install within the test function consumed most of the time budget (Feb 16)
  2. Key rotation assertion failureassert_metric_sum failed after only 30 retries × 3s = 90s, which was not enough time for key rotation to complete after adding 3 nodes (Feb 19)
  3. Overall 600s per-task timeout — Test hit the 600s timeout while waiting for topology propagation via await_subnet_earliest_topology_version (Feb 20)

Changes

  • The NNS_CANISTER_INSTALL_TIMEOUT was already increased from 160s to 300s by:
    fix: deflake //rs/tests/nns:bitcoin_set_config_proposal_test #8951.
  • Set overall_timeout and timeout_per_test to 900s (matching Bazel's long test timeout), up from the 600s default
  • Increase assert_metric_sum retries from 30 to 60 (180s max wait for key rotation, up from 90s)

Skill: .claude/skills/fix-flaky-tests/SKILL.md.

Increase timeouts and retry counts to fix flakiness.

Root cause: The test's per-task timeout of 600s (default) is too tight for the
operations performed by the test function:
- NNS canister installation (19-139s observed, up to 300s timeout)
- Chain key enabling via proposals (~60s)
- Getting public keys for 4 key types (each 30-120s)
- Pre-signature stash assertions (up to 500s READY_WAIT_TIMEOUT)
- Node addition proposals
- await_subnet_earliest_topology_version (up to 500s READY_WAIT_TIMEOUT)
- Key rotation metric checks with only 30 retries x 3s = 90s max
- Progress checks on all 7 nodes (up to 100s each)

Three failure modes were observed across 3 flaky runs in the past week:
1. NNS canister install timing out within the test function
2. assert_metric_sum failing after 30 retries (90s too short for key rotation)
3. Overall 600s per-task timeout hit while waiting for topology propagation

Changes:
- Set overall_timeout and timeout_per_test to 900s (matching Bazel long timeout)
- Increase assert_metric_sum retries from 30 to 60 (180s max for key rotation)

Skill: .claude/skills/fix-flaky-tests/SKILL.md
@github-actions github-actions bot added the fix label Feb 22, 2026
@basvandijk basvandijk marked this pull request as ready for review February 22, 2026 13:29
@basvandijk basvandijk requested a review from a team as a code owner February 22, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant