Skip to content

Comments

fix: deflake //rs/tests/nested/nns_recovery:{nr_local,nr_no_bless_fix_like_np}#8987

Open
basvandijk wants to merge 2 commits intomasterfrom
ai/deflake-nr_local
Open

fix: deflake //rs/tests/nested/nns_recovery:{nr_local,nr_no_bless_fix_like_np}#8987
basvandijk wants to merge 2 commits intomasterfrom
ai/deflake-nr_local

Conversation

@basvandijk
Copy link
Collaborator

@basvandijk basvandijk commented Feb 21, 2026

//rs/tests/nested/nns_recovery:nr_local and //rs/tests/nested/nns_recovery:nr_no_bless_fix_like_np were both flaky in the last week:

$ bazel run //ci/githubstats:query -- top 1 flaky% --week --include //rs/tests/nested/nns_recovery:nr_local
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━┑
│    │ label                                   │   total │   non_success │   flaky │   timeout │   fail │   non_success% │   flaky% │   timeout% │   fail% │   impact │   total duration │   duration_p90 │ owners          │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━┥
│  0 │ //rs/tests/nested/nns_recovery:nr_local │     145 │            11 │       4 │         0 │      7 │            7.6 │      2.8 │          0 │     4.8 │  4:44:43 │  2 days 14:33:05 │          25:53 │ consensus, node │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━┙

Due to several different reasons:

$ bazel run //ci/githubstats:query -- last --flaky --week //rs/tests/nested/nns_recovery:nr_local --columns=last_started_at,errors
...
╒════╤═════════════════════════╤═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │ errors per attempt                                                                                                                      │
╞════╪═════════════════════════╪═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ Mon 2026-02-16 21:17:05 │ 1: test...: false, sequential_np_actions: false }): called `Result::unwrap()` on an `Err` value: Func="check if latest registry vers... ┤
│  1 │ Mon 2026-02-16 16:05:58 │ 1: test...: false, sequential_np_actions: false }): Failed to read new message on 2602:fb2b:110:10:6801:dcff:fe23:9b23                  │
├────┼─────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│  2 │ Mon 2026-02-16 10:25:54 │ 1: test...: false, sequential_np_actions: false }): called `Result::unwrap()` on an `Err` value: Func="check if latest registry vers... │
├────┼─────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│  3 │ Mon 2026-02-16 10:18:33 │ 1: test...: false, sequential_np_actions: false }): No healthy node found                                                               │
╘════╧═════════════════════════╧═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛
$ bazel run //ci/githubstats:query -- last --flaky --week //rs/tests/nested/nns_recovery:nr_no_bless_fix_like_np --columns=last_started_at,errors
...
╒════╤═════════════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │ errors per attempt                                                                                                                     │
╞════╪═════════════════════════╪════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ Fri 2026-02-20 13:33:11 │ 1: test...: true, sequential_np_actions: false }): called `Result::unwrap()` on an `Err` value: Func="check if latest registry vers... │
├────┼─────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│  1 │ Thu 2026-02-19 15:49:23 │ 1: test...: false, fix_dfinity_owned_node_like_np: true, sequential_np_actions: false }): Timeout after 1800s                          │
├────┼─────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│  2 │ Thu 2026-02-19 14:56:38 │ 1: test...: called `Result::unwrap()` on an `Err` value: Func="check if latest registry vers...                                        │
├────┼─────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│  3 │ Mon 2026-02-16 14:58:14 │ 1: test...: Timeout after 1800s                                                                                                        │
╘════╧═════════════════════════╧════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

Claude Opus 4.6 determined the following root causes and proposed the accompanying fixes:

Root Cause Analysis

The nr_local test flaked 4 times in the past week with three distinct failure modes:

1. Registration timeout (2/4 failures)

In registration() (rs/tests/nested/src/lib.rs), the inner block_for_newer_registry_version_within_duration().unwrap() panics on timeout instead of allowing the outer retry loop to handle the error. The inner timeout (10 min) equals the outer timeout, so a single slow node registration exhausts the entire budget.

Errors:

  • check if latest registry version >= 2 timed out after 603s
  • check if latest registry version >= 5 timed out after 600s

2. No healthy node found (1/4 failures)

node_with_highest_certification_share_height() calls .expect() without retry logic. If metrics endpoints aren't immediately available after a subnet membership change, the test panics.

3. Failed to read new message (1/4 failures)

assert_subnet_is_healthy() uses can_read_msg_with_retries with only 5 retries (25 seconds), which is insufficient for state propagation after subnet membership changes.

Fixes

  1. Registration: Replace .unwrap() with ? to propagate errors to outer retry. Reduce inner timeout from 10 min to 60 sec. Increase outer timeout from 10 min to 15 min.
  2. Metrics gathering: Add retry logic (120s timeout, 10s backoff) around node_with_highest_certification_share_height in common.rs.
  3. Message read retries: Increase from 5 to 20 retries in assert_subnet_is_healthy().

Generated following .claude/skills/fix-flaky-tests/SKILL.md.

Fix three sources of flakiness in the nr_local test:

1. Registration timeout (2/4 failures): In registration(), the inner
   block_for_newer_registry_version_within_duration().unwrap() would
   panic on timeout instead of allowing the outer retry loop to handle
   the error. The inner timeout (10 min) equaled the outer timeout, so
   one slow node registration exhausted the entire budget. Fixed by:
   - Replacing .unwrap() with ? to propagate errors to the outer retry
   - Reducing the inner timeout from 10 min to 60 sec
   - Increasing the outer timeout from 10 min to 15 min

2. No healthy node found (1/4 failures): node_with_highest_certification_share_height()
   calls .expect() without any retry logic. If metrics endpoints aren't
   immediately available after a subnet membership change, the test
   panics. Fixed by adding retry logic (120s timeout, 10s backoff)
   around the metrics gathering in common.rs.

3. Failed to read new message (1/4 failures): assert_subnet_is_healthy()
   used can_read_msg_with_retries with only 5 retries (25 seconds),
   which is insufficient for state propagation after subnet membership
   changes. Increased to 20 retries (100 seconds).
@github-actions github-actions bot added the fix label Feb 21, 2026
@basvandijk basvandijk added the CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 label Feb 21, 2026
@basvandijk basvandijk marked this pull request as ready for review February 22, 2026 00:09
@basvandijk basvandijk requested review from a team as code owners February 22, 2026 00:09
@basvandijk basvandijk changed the title fix: deflake //rs/tests/nested/nns_recovery:nr_local fix: deflake //rs/tests/nested/nns_recovery:(nr_local|nr_no_bless_fix_like_np) Feb 22, 2026
@basvandijk basvandijk changed the title fix: deflake //rs/tests/nested/nns_recovery:(nr_local|nr_no_bless_fix_like_np) fix: deflake //rs/tests/nested/nns_recovery:{nr_local,nr_no_bless_fix_like_np} Feb 22, 2026
// Download pool from the node with the highest certification share height.
// Retry because metrics endpoints may not be immediately available after
// the subnet membership change.
let (download_pool_node, highest_cert_share) = retry_with_msg!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to make node_with_highest_certification_share_height return Result instead and update the rest of the users to unwrap the returned Result. Or alternatively, we could put the retry logic directly into node_with_highest_certification_share_height. Perhaps other tests would also benefit from it.

new_topology = block_on(
new_topology.block_for_newer_registry_version_within_duration(
NODE_REGISTRATION_TIMEOUT,
NODE_REGISTRATION_VERSION_BACKOFF,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how using a lower timeout works here (assuming that block_for_newer_registry_version_within_duration works correctly). The code waits for NODE_REGISTRATION_VERSION_BACKOFF (1 minute) for the newer registry. The newer registry is not yet available after 1 minute, the call fails, the error is propagated back to retry_with_msg! on line 108. retry_with_msg! waits NODE_REGISTRATION_BACKOFF and runs block_for_newer_registry_version_within_duration again which waits for another minute. Then the whole wait - error - retry loop keeps repeating for NODE_REGISTRATION_TIMEOUT (was 10 min, is now 15 minutes).
How is this not equiavalent to just running block_for_newer_registry_version_within_duration for NODE_REGISTRATION_TIMEOUT? The only reason this could work differently if block_for_newer_registry_version_within_duration sporadically does not return even though a newer registry version is available but I don't think that's the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 @consensus fix @node

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants