fix: deflake //rs/tests/consensus:replica_determinism_test by basvandijk · Pull Request #8982 · dfinity/ic

basvandijk · 2026-02-21T16:17:05Z

The //rs/tests/consensus:replica_determinism_test is slightly flaky:

$ bazel run //ci/githubstats:query -- top 1 flaky --include //rs/tests/consensus:replica_determinism_test
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━━┑
│    │ label                                         │   total │   non_success │   flaky │   timeout │   fail │   non_success% │   flaky% │   timeout% │   fail% │   impact │   total duration │   duration_p90 │ owners    │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━━┥
│  0 │ //rs/tests/consensus:replica_determinism_test │     412 │             6 │       5 │         0 │      1 │            1.5 │      1.2 │          0 │     0.2 │    18:36 │         21:17:12 │           3:06 │ consensus │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━━┙

Usually because of the same failure mode:

$ bazel run //ci/githubstats:query -- last --flaky --week //rs/tests/consensus:replica_determinism_test --columns=last_started_at,errors
...
╒════╤═════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │ errors per attempt                                                                           │
╞════╪═════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ Thu 2026-02-19 14:49:33 │ 1: test: failed to update: TransportError(Reqwest(reqwest::Error { kind: Request, url: "h... │
├────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│  1 │ Thu 2026-02-19 14:13:55 │ 1: test: failed to update: TransportError(reqwest::Error { kind: Request, url: "http://[2... │
├────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│  2 │ Thu 2026-02-19 10:54:09 │ 1: test: failed to update: TransportError(reqwest::Error { kind: Request, url: "http://[2... │
├────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│  3 │ Wed 2026-02-18 12:34:03 │ 1: test: failed to update: UncertifiedReject { reject: RejectResponse { reject_code: Cani... │
├────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│  4 │ Tue 2026-02-17 17:14:02 │ 1: test: failed to update: TransportError(Reqwest(reqwest::Error { kind: Request, url: "h... │
╘════╧═════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛

Claude Opus 4.6 provides the following Root Cause Analysis and accompanying fix:

Root Cause

After the malicious node restarts and reports healthy via await_status_is_healthy_async(), it may still need time to catch up to the latest state. The canister's wasm module may not yet be available, or the HTTP endpoint may not be fully ready for call processing.

The second loop of update calls (after node restart) used .expect("failed to update") with no retry logic. When the restarted node was healthy but not yet caught up, the first update call would fail with:

TransportError — BrokenPipe, ConnectionRefused, or TimedOut (4 out of 5 flaky runs)
UncertifiedReject — "Requested canister has no wasm module" IC0537 (1 out of 5 flaky runs)

Fix

Wrap the first update call after restart in retry_with_msg_async! (120s timeout, 5s backoff) to allow the node time to fully catch up before asserting on subsequent calls. The remaining iterations proceed without retry since after the first succeeds, the node is confirmed to be fully caught up.

Skill: .claude/skills/fix-flaky-tests/SKILL.md.

After the malicious node restarts and reports healthy, it may still need time to catch up to the latest state. The canister's wasm module may not yet be available, or the HTTP endpoint may not be fully ready for call processing. Root cause: The second loop of update calls (after node restart) used .expect("failed to update") with no retry logic. When the restarted node was healthy but not yet caught up, the first update call would fail with TransportError (BrokenPipe, ConnectionRefused, TimedOut) or UncertifiedReject ("Requested canister has no wasm module" IC0537). Fix: Wrap the first update call after restart in retry_with_msg_async! (120s timeout, 5s backoff) to allow the node time to fully catch up before asserting on subsequent calls. Skill: .claude/skills/fix-flaky-tests/SKILL.md

github-actions bot added the fix label Feb 21, 2026

basvandijk marked this pull request as ready for review February 21, 2026 16:30

basvandijk requested a review from a team as a code owner February 21, 2026 16:30

github-actions bot added the @consensus label Feb 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: deflake //rs/tests/consensus:replica_determinism_test#8982

fix: deflake //rs/tests/consensus:replica_determinism_test#8982
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-replica_determinism_test

basvandijk commented Feb 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

basvandijk commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

basvandijk commented Feb 21, 2026 •

edited

Loading