fix: deflake //rs/tests/nested/nns_recovery:{nr_local,nr_no_bless_fix_like_np} by basvandijk · Pull Request #8987 · dfinity/ic

basvandijk · 2026-02-21T22:48:53Z

//rs/tests/nested/nns_recovery:nr_local and //rs/tests/nested/nns_recovery:nr_no_bless_fix_like_np were both flaky in the last week:

$ bazel run //ci/githubstats:query -- top 1 flaky% --week --include //rs/tests/nested/nns_recovery:nr_local
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━┑
│    │ label                                   │   total │   non_success │   flaky │   timeout │   fail │   non_success% │   flaky% │   timeout% │   fail% │   impact │   total duration │   duration_p90 │ owners          │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━┥
│  0 │ //rs/tests/nested/nns_recovery:nr_local │     145 │            11 │       4 │         0 │      7 │            7.6 │      2.8 │          0 │     4.8 │  4:44:43 │  2 days 14:33:05 │          25:53 │ consensus, node │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━┙

Due to several different reasons:

$ bazel run //ci/githubstats:query -- last --flaky --week //rs/tests/nested/nns_recovery:nr_local --columns=last_started_at,errors
...
╒════╤═════════════════════════╤═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │ errors per attempt                                                                                                                      │
╞════╪═════════════════════════╪═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ Mon 2026-02-16 21:17:05 │ 1: test...: false, sequential_np_actions: false }): called `Result::unwrap()` on an `Err` value: Func="check if latest registry vers... ┤
│  1 │ Mon 2026-02-16 16:05:58 │ 1: test...: false, sequential_np_actions: false }): Failed to read new message on 2602:fb2b:110:10:6801:dcff:fe23:9b23                  │
├────┼─────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│  2 │ Mon 2026-02-16 10:25:54 │ 1: test...: false, sequential_np_actions: false }): called `Result::unwrap()` on an `Err` value: Func="check if latest registry vers... │
├────┼─────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│  3 │ Mon 2026-02-16 10:18:33 │ 1: test...: false, sequential_np_actions: false }): No healthy node found                                                               │
╘════╧═════════════════════════╧═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

$ bazel run //ci/githubstats:query -- last --flaky --week //rs/tests/nested/nns_recovery:nr_no_bless_fix_like_np --columns=last_started_at,errors
...
╒════╤═════════════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │ errors per attempt                                                                                                                     │
╞════╪═════════════════════════╪════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ Fri 2026-02-20 13:33:11 │ 1: test...: true, sequential_np_actions: false }): called `Result::unwrap()` on an `Err` value: Func="check if latest registry vers... │
├────┼─────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│  1 │ Thu 2026-02-19 15:49:23 │ 1: test...: false, fix_dfinity_owned_node_like_np: true, sequential_np_actions: false }): Timeout after 1800s                          │
├────┼─────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│  2 │ Thu 2026-02-19 14:56:38 │ 1: test...: called `Result::unwrap()` on an `Err` value: Func="check if latest registry vers...                                        │
├────┼─────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│  3 │ Mon 2026-02-16 14:58:14 │ 1: test...: Timeout after 1800s                                                                                                        │
╘════╧═════════════════════════╧════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

Claude Opus 4.6 determined the following root causes and proposed the accompanying fixes:

Root Cause Analysis

The nr_local test flaked 4 times in the past week with three distinct failure modes:

1. Registration timeout (2/4 failures)

In registration() (rs/tests/nested/src/lib.rs), the inner block_for_newer_registry_version_within_duration().unwrap() panics on timeout instead of allowing the outer retry loop to handle the error. The inner timeout (10 min) equals the outer timeout, so a single slow node registration exhausts the entire budget.

Errors:

check if latest registry version >= 2 timed out after 603s
check if latest registry version >= 5 timed out after 600s

2. No healthy node found (1/4 failures)

node_with_highest_certification_share_height() calls .expect() without retry logic. If metrics endpoints aren't immediately available after a subnet membership change, the test panics.

3. Failed to read new message (1/4 failures)

assert_subnet_is_healthy() uses can_read_msg_with_retries with only 5 retries (25 seconds), which is insufficient for state propagation after subnet membership changes.

Fixes

Registration: Replace .unwrap() with ? to propagate errors to outer retry. Reduce inner timeout from 10 min to 60 sec. Increase outer timeout from 10 min to 15 min.
Metrics gathering: Add retry logic (120s timeout, 10s backoff) around node_with_highest_certification_share_height in common.rs.
Message read retries: Increase from 5 to 20 retries in assert_subnet_is_healthy().

Generated following .claude/skills/fix-flaky-tests/SKILL.md.

Fix three sources of flakiness in the nr_local test: 1. Registration timeout (2/4 failures): In registration(), the inner block_for_newer_registry_version_within_duration().unwrap() would panic on timeout instead of allowing the outer retry loop to handle the error. The inner timeout (10 min) equaled the outer timeout, so one slow node registration exhausted the entire budget. Fixed by: - Replacing .unwrap() with ? to propagate errors to the outer retry - Reducing the inner timeout from 10 min to 60 sec - Increasing the outer timeout from 10 min to 15 min 2. No healthy node found (1/4 failures): node_with_highest_certification_share_height() calls .expect() without any retry logic. If metrics endpoints aren't immediately available after a subnet membership change, the test panics. Fixed by adding retry logic (120s timeout, 10s backoff) around the metrics gathering in common.rs. 3. Failed to read new message (1/4 failures): assert_subnet_is_healthy() used can_read_msg_with_retries with only 5 retries (25 seconds), which is insufficient for state propagation after subnet membership changes. Increased to 20 retries (100 seconds).

frankdavid · 2026-02-23T08:01:09Z

rs/tests/nested/nns_recovery/common.rs

+    // Download pool from the node with the highest certification share height.
+    // Retry because metrics endpoints may not be immediately available after
+    // the subnet membership change.
+    let (download_pool_node, highest_cert_share) = retry_with_msg!(


Would be better to make node_with_highest_certification_share_height return Result instead and update the rest of the users to unwrap the returned Result. Or alternatively, we could put the retry logic directly into node_with_highest_certification_share_height. Perhaps other tests would also benefit from it.

frankdavid · 2026-02-23T08:09:37Z

rs/tests/nested/src/lib.rs

            new_topology = block_on(
                new_topology.block_for_newer_registry_version_within_duration(
-                    NODE_REGISTRATION_TIMEOUT,
+                    NODE_REGISTRATION_VERSION_BACKOFF,


I don't understand how using a lower timeout works here (assuming that block_for_newer_registry_version_within_duration works correctly). The code waits for NODE_REGISTRATION_VERSION_BACKOFF (1 minute) for the newer registry. The newer registry is not yet available after 1 minute, the call fails, the error is propagated back to retry_with_msg! on line 108. retry_with_msg! waits NODE_REGISTRATION_BACKOFF and runs block_for_newer_registry_version_within_duration again which waits for another minute. Then the whole wait - error - retry loop keeps repeating for NODE_REGISTRATION_TIMEOUT (was 10 min, is now 15 minutes).
How is this not equiavalent to just running block_for_newer_registry_version_within_duration for NODE_REGISTRATION_TIMEOUT? The only reason this could work differently if block_for_newer_registry_version_within_duration sporadically does not return even though a newer registry version is available but I don't think that's the case.

github-actions bot added the fix label Feb 21, 2026

basvandijk added the CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 label Feb 21, 2026

trigger

2a82f64

basvandijk marked this pull request as ready for review February 22, 2026 00:09

basvandijk requested review from a team as code owners February 22, 2026 00:09

github-actions bot added @consensus @node labels Feb 22, 2026

basvandijk changed the title ~~fix: deflake //rs/tests/nested/nns_recovery:nr_local~~ fix: deflake //rs/tests/nested/nns_recovery:(nr_local|nr_no_bless_fix_like_np) Feb 22, 2026

basvandijk changed the title ~~fix: deflake //rs/tests/nested/nns_recovery:(nr_local|nr_no_bless_fix_like_np)~~ fix: deflake //rs/tests/nested/nns_recovery:{nr_local,nr_no_bless_fix_like_np} Feb 22, 2026

frankdavid reviewed Feb 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: deflake //rs/tests/nested/nns_recovery:{nr_local,nr_no_bless_fix_like_np}#8987

fix: deflake //rs/tests/nested/nns_recovery:{nr_local,nr_no_bless_fix_like_np}#8987
basvandijk wants to merge 2 commits intomasterfrom
ai/deflake-nr_local

basvandijk commented Feb 21, 2026 •

edited

Loading

Uh oh!

frankdavid Feb 23, 2026

Uh oh!

frankdavid Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

basvandijk commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause Analysis

1. Registration timeout (2/4 failures)

2. No healthy node found (1/4 failures)

3. Failed to read new message (1/4 failures)

Fixes

Uh oh!

frankdavid Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

frankdavid Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

basvandijk commented Feb 21, 2026 •

edited

Loading