fix: deflake //rs/tests/nested/nns_recovery:{nr_local,nr_no_bless_fix_like_np}#8987
fix: deflake //rs/tests/nested/nns_recovery:{nr_local,nr_no_bless_fix_like_np}#8987basvandijk wants to merge 2 commits intomasterfrom
Conversation
Fix three sources of flakiness in the nr_local test: 1. Registration timeout (2/4 failures): In registration(), the inner block_for_newer_registry_version_within_duration().unwrap() would panic on timeout instead of allowing the outer retry loop to handle the error. The inner timeout (10 min) equaled the outer timeout, so one slow node registration exhausted the entire budget. Fixed by: - Replacing .unwrap() with ? to propagate errors to the outer retry - Reducing the inner timeout from 10 min to 60 sec - Increasing the outer timeout from 10 min to 15 min 2. No healthy node found (1/4 failures): node_with_highest_certification_share_height() calls .expect() without any retry logic. If metrics endpoints aren't immediately available after a subnet membership change, the test panics. Fixed by adding retry logic (120s timeout, 10s backoff) around the metrics gathering in common.rs. 3. Failed to read new message (1/4 failures): assert_subnet_is_healthy() used can_read_msg_with_retries with only 5 retries (25 seconds), which is insufficient for state propagation after subnet membership changes. Increased to 20 retries (100 seconds).
| // Download pool from the node with the highest certification share height. | ||
| // Retry because metrics endpoints may not be immediately available after | ||
| // the subnet membership change. | ||
| let (download_pool_node, highest_cert_share) = retry_with_msg!( |
There was a problem hiding this comment.
Would be better to make node_with_highest_certification_share_height return Result instead and update the rest of the users to unwrap the returned Result. Or alternatively, we could put the retry logic directly into node_with_highest_certification_share_height. Perhaps other tests would also benefit from it.
| new_topology = block_on( | ||
| new_topology.block_for_newer_registry_version_within_duration( | ||
| NODE_REGISTRATION_TIMEOUT, | ||
| NODE_REGISTRATION_VERSION_BACKOFF, |
There was a problem hiding this comment.
I don't understand how using a lower timeout works here (assuming that block_for_newer_registry_version_within_duration works correctly). The code waits for NODE_REGISTRATION_VERSION_BACKOFF (1 minute) for the newer registry. The newer registry is not yet available after 1 minute, the call fails, the error is propagated back to retry_with_msg! on line 108. retry_with_msg! waits NODE_REGISTRATION_BACKOFF and runs block_for_newer_registry_version_within_duration again which waits for another minute. Then the whole wait - error - retry loop keeps repeating for NODE_REGISTRATION_TIMEOUT (was 10 min, is now 15 minutes).
How is this not equiavalent to just running block_for_newer_registry_version_within_duration for NODE_REGISTRATION_TIMEOUT? The only reason this could work differently if block_for_newer_registry_version_within_duration sporadically does not return even though a newer registry version is available but I don't think that's the case.
//rs/tests/nested/nns_recovery:nr_localand//rs/tests/nested/nns_recovery:nr_no_bless_fix_like_npwere both flaky in the last week:Due to several different reasons:
Claude Opus 4.6 determined the following root causes and proposed the accompanying fixes:
Root Cause Analysis
The
nr_localtest flaked 4 times in the past week with three distinct failure modes:1. Registration timeout (2/4 failures)
In
registration()(rs/tests/nested/src/lib.rs), the innerblock_for_newer_registry_version_within_duration().unwrap()panics on timeout instead of allowing the outer retry loop to handle the error. The inner timeout (10 min) equals the outer timeout, so a single slow node registration exhausts the entire budget.Errors:
check if latest registry version >= 2timed out after 603scheck if latest registry version >= 5timed out after 600s2. No healthy node found (1/4 failures)
node_with_highest_certification_share_height()calls.expect()without retry logic. If metrics endpoints aren't immediately available after a subnet membership change, the test panics.3. Failed to read new message (1/4 failures)
assert_subnet_is_healthy()usescan_read_msg_with_retrieswith only 5 retries (25 seconds), which is insufficient for state propagation after subnet membership changes.Fixes
.unwrap()with?to propagate errors to outer retry. Reduce inner timeout from 10 min to 60 sec. Increase outer timeout from 10 min to 15 min.node_with_highest_certification_share_heightin common.rs.assert_subnet_is_healthy().Generated following
.claude/skills/fix-flaky-tests/SKILL.md.