Skip to content

Comments

fix: deflake //rs/tests/networking:network_reliability_test_colocate#8986

Open
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-network_reliability_test
Open

fix: deflake //rs/tests/networking:network_reliability_test_colocate#8986
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-network_reliability_test

Conversation

@basvandijk
Copy link
Collaborator

@basvandijk basvandijk commented Feb 21, 2026

Claude Opus 4.6 determined the following Root Cause Analysis of the single flake of //rs/tests/networking:network_reliability_test_colocate in the last week and accompanying fix. I'm slightly doubtful about the analysis but the fix to retry canister installation makes sense.

Root Cause

The test intermittently fails with Certificate is stale (over 270s) when installing the counter canister in Step 2. This happens because the IC agent's certificate expires when Steps 0 (readiness check) and 1 (NNS install) take longer than usual, making the certificate >270s old by the time canister creation is attempted.

In the failed run, Step 0 took ~2.5 min and Step 1 took ~2 min, so by the time Step 2 tried to create a canister (~4 min elapsed), the agent's root key certificate was stale.

Fix

  1. Added try_create_and_install_canister_with_arg_and_cycles to IcNodeSnapshot in rs/tests/driver/src/driver/test_env_api.rs — a fallible variant that returns Result<Principal, String> instead of panicking.

  2. Wrapped the canister installation in retry_with_msg! with a 120s timeout and 5s backoff. Each retry creates a fresh agent with a new root key, avoiding the stale certificate issue.

Automation

This fix was generated following the workflow in .claude/skills/fix-flaky-tests/SKILL.md.

The test intermittently fails with "Certificate is stale (over 270s)" when
installing the counter canister in Step 2. This happens because the IC agent's
certificate expires when Steps 0 (readiness check) and 1 (NNS install) take
longer than usual, making the certificate >270s old by the time canister
creation is attempted.

Fix: Add a try_create_and_install_canister_with_arg_and_cycles method to
IcNodeSnapshot that returns Result instead of panicking, and wrap the canister
installation in retry_with_msg! with a 120s timeout and 5s backoff. Each retry
creates a fresh agent with a new root key, avoiding the stale certificate issue.
@github-actions github-actions bot added the fix label Feb 21, 2026
@basvandijk basvandijk marked this pull request as ready for review February 21, 2026 22:39
@basvandijk basvandijk requested review from a team as code owners February 21, 2026 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant