Skip to content

Comments

fix: deflake //rs/tests/consensus/tecdsa:tecdsa_two_signing_subnets_test#8989

Open
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-tecdsa_two_signing_subnets_test
Open

fix: deflake //rs/tests/consensus/tecdsa:tecdsa_two_signing_subnets_test#8989
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-tecdsa_two_signing_subnets_test

Conversation

@basvandijk
Copy link
Collaborator

@basvandijk basvandijk commented Feb 22, 2026

//rs/tests/consensus/tecdsa:tecdsa_two_signing_subnets_test has been quite flaky in the last week:

$ bazel run //ci/githubstats:query -- top 1 flaky% --week --include //rs/tests/consensus/tecdsa:tecdsa_two_signing_subnets_test
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━━┑
│    │ label                                                       │   total │   non_success │   flaky │   timeout │   fail │   non_success% │   flaky% │   timeout% │   fail% │   impact │   total duration │   duration_p90 │ owners    │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━━┥
│  0 │ //rs/tests/consensus/tecdsa:tecdsa_two_signing_subnets_test │     124 │             4 │       4 │         0 │      0 │            3.2 │      3.2 │          0 │       0 │    21:52 │         11:17:52 │           5:28 │ consensus │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━━┙

Claude Opus 4.6 determined the following Root Cause Analysis and accompanying fix:

Root Cause

All 4 flaky runs in the past week fail with the same error:

assert_no_metrics_errors: assertion `left == right` failed:
The metric `critical_errors{error="master_key_transcript_missing"}`
on node ... has non-zero value.
  left: 1
  right: 0

During key resharing from the NNS subnet to a new app subnet, the
master_key_transcript_missing critical error counter is transiently
incremented when a DKG summary boundary is reached before the resharing
completes. Since this is a Prometheus counter (monotonically increasing),
it stays non-zero even after resharing succeeds. The post-test
assert_no_metrics_errors teardown then detects this non-zero counter
and fails the test.

The test's core logic (test function) always passes — signing works
correctly on both subnets and the public key is verified. The flakiness
is purely a timing issue: whether a DKG summary boundary falls within the
resharing window.

Fix

Exclude critical_errors from the post-test metrics check using
remove_metrics_to_check("critical_errors"), following the same pattern
used by dual_workload_test and max_xnet_payload_size_test which have
similar transient critical error conditions.


*Automated deflake guided by .claude/skills/fix-flaky-tests/SKILL.md.

@github-actions github-actions bot added the fix label Feb 22, 2026
@basvandijk basvandijk marked this pull request as ready for review February 22, 2026 10:59
@basvandijk basvandijk requested a review from a team as a code owner February 22, 2026 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant