fix: deflake //rs/tests/networking:network_reliability_test_colocate#8986
Open
basvandijk wants to merge 1 commit intomasterfrom
Open
fix: deflake //rs/tests/networking:network_reliability_test_colocate#8986basvandijk wants to merge 1 commit intomasterfrom
basvandijk wants to merge 1 commit intomasterfrom
Conversation
The test intermittently fails with "Certificate is stale (over 270s)" when installing the counter canister in Step 2. This happens because the IC agent's certificate expires when Steps 0 (readiness check) and 1 (NNS install) take longer than usual, making the certificate >270s old by the time canister creation is attempted. Fix: Add a try_create_and_install_canister_with_arg_and_cycles method to IcNodeSnapshot that returns Result instead of panicking, and wrap the canister installation in retry_with_msg! with a 120s timeout and 5s backoff. Each retry creates a fresh agent with a new root key, avoiding the stale certificate issue.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Claude Opus 4.6 determined the following Root Cause Analysis of the single flake of
//rs/tests/networking:network_reliability_test_colocatein the last week and accompanying fix. I'm slightly doubtful about the analysis but the fix to retry canister installation makes sense.Root Cause
The test intermittently fails with
Certificate is stale (over 270s)when installing the counter canister in Step 2. This happens because the IC agent's certificate expires when Steps 0 (readiness check) and 1 (NNS install) take longer than usual, making the certificate >270s old by the time canister creation is attempted.In the failed run, Step 0 took ~2.5 min and Step 1 took ~2 min, so by the time Step 2 tried to create a canister (~4 min elapsed), the agent's root key certificate was stale.
Fix
Added
try_create_and_install_canister_with_arg_and_cyclestoIcNodeSnapshotinrs/tests/driver/src/driver/test_env_api.rs— a fallible variant that returnsResult<Principal, String>instead of panicking.Wrapped the canister installation in
retry_with_msg!with a 120s timeout and 5s backoff. Each retry creates a fresh agent with a new root key, avoiding the stale certificate issue.Automation
This fix was generated following the workflow in
.claude/skills/fix-flaky-tests/SKILL.md.