Skip to content

Comments

fix: deflake //rs/tests/consensus:replica_determinism_test#8982

Open
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-replica_determinism_test
Open

fix: deflake //rs/tests/consensus:replica_determinism_test#8982
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-replica_determinism_test

Conversation

@basvandijk
Copy link
Collaborator

@basvandijk basvandijk commented Feb 21, 2026

The //rs/tests/consensus:replica_determinism_test is slightly flaky:

$ bazel run //ci/githubstats:query -- top 1 flaky --include //rs/tests/consensus:replica_determinism_test
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━━┑
│    │ label                                         │   total │   non_success │   flaky │   timeout │   fail │   non_success% │   flaky% │   timeout% │   fail% │   impact │   total duration │   duration_p90 │ owners    │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━━┥
│  0 │ //rs/tests/consensus:replica_determinism_test │     412 │             6 │       5 │         0 │      1 │            1.5 │      1.2 │          0 │     0.2 │    18:36 │         21:17:12 │           3:06 │ consensus │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━━┙

Usually because of the same failure mode:

$ bazel run //ci/githubstats:query -- last --flaky --week //rs/tests/consensus:replica_determinism_test --columns=last_started_at,errors
...
╒════╤═════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │ errors per attempt                                                                           │
╞════╪═════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ Thu 2026-02-19 14:49:33 │ 1: test: failed to update: TransportError(Reqwest(reqwest::Error { kind: Request, url: "h... │
├────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│  1 │ Thu 2026-02-19 14:13:55 │ 1: test: failed to update: TransportError(reqwest::Error { kind: Request, url: "http://[2... │
├────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│  2 │ Thu 2026-02-19 10:54:09 │ 1: test: failed to update: TransportError(reqwest::Error { kind: Request, url: "http://[2... │
├────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│  3 │ Wed 2026-02-18 12:34:03 │ 1: test: failed to update: UncertifiedReject { reject: RejectResponse { reject_code: Cani... │
├────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│  4 │ Tue 2026-02-17 17:14:02 │ 1: test: failed to update: TransportError(Reqwest(reqwest::Error { kind: Request, url: "h... │
╘════╧═════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛

Claude Opus 4.6 provides the following Root Cause Analysis and accompanying fix:

Root Cause

After the malicious node restarts and reports healthy via await_status_is_healthy_async(), it may still need time to catch up to the latest state. The canister's wasm module may not yet be available, or the HTTP endpoint may not be fully ready for call processing.

The second loop of update calls (after node restart) used .expect("failed to update") with no retry logic. When the restarted node was healthy but not yet caught up, the first update call would fail with:

  • TransportErrorBrokenPipe, ConnectionRefused, or TimedOut (4 out of 5 flaky runs)
  • UncertifiedReject"Requested canister has no wasm module" IC0537 (1 out of 5 flaky runs)

Fix

Wrap the first update call after restart in retry_with_msg_async! (120s timeout, 5s backoff) to allow the node time to fully catch up before asserting on subsequent calls. The remaining iterations proceed without retry since after the first succeeds, the node is confirmed to be fully caught up.

Skill: .claude/skills/fix-flaky-tests/SKILL.md.

After the malicious node restarts and reports healthy, it may still need
time to catch up to the latest state. The canister's wasm module may not
yet be available, or the HTTP endpoint may not be fully ready for call
processing.

Root cause: The second loop of update calls (after node restart) used
.expect("failed to update") with no retry logic. When the restarted
node was healthy but not yet caught up, the first update call would fail
with TransportError (BrokenPipe, ConnectionRefused, TimedOut) or
UncertifiedReject ("Requested canister has no wasm module" IC0537).

Fix: Wrap the first update call after restart in retry_with_msg_async!
(120s timeout, 5s backoff) to allow the node time to fully catch up
before asserting on subsequent calls.

Skill: .claude/skills/fix-flaky-tests/SKILL.md
@github-actions github-actions bot added the fix label Feb 21, 2026
@basvandijk basvandijk marked this pull request as ready for review February 21, 2026 16:30
@basvandijk basvandijk requested a review from a team as a code owner February 21, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant