Skip to content

Comments

fix: deflake //rs/nns/integration_tests:integration_tests_src/node_provider_remuneration#8981

Open
basvandijk wants to merge 2 commits intomasterfrom
ai/deflake-node_provider_remuneration
Open

fix: deflake //rs/nns/integration_tests:integration_tests_src/node_provider_remuneration#8981
basvandijk wants to merge 2 commits intomasterfrom
ai/deflake-node_provider_remuneration

Conversation

@basvandijk
Copy link
Collaborator

@basvandijk basvandijk commented Feb 21, 2026

The test //rs/nns/integration_tests:integration_tests_src/node_provider_remuneration is slightly flaky:

$ bazel run //ci/githubstats:query -- top 1 flaky --include //rs/nns/integration_tests:integration_tests_src/node_provider_remuneration
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━┑
│    │ label                                                                       │   total │   non_success │   flaky │   timeout │   fail │   non_success% │   flaky% │   timeout% │   fail% │   impact │   total duration │   duration_p90 │ owners          │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━┥
│  0 │ //rs/nns/integration_tests:integration_tests_src/node_provider_remuneration │     402 │             5 │       5 │         0 │      0 │            1.2 │      1.2 │          0 │       0 │    21:45 │  1 days 05:08:42 │           4:21 │ governance-team │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━┙

Claude Opus 4.6 made the follow Root Cause Analysis and accompanying fix:

Root Cause

The test was flaky because governance's automated reward minting flow requires many inter-canister calls (NRC, CMC, Ledger, self-call) spanning 8+ ticks, but only 2 ticks were provided after each monthly metrics loop. Additionally, the retry loops used advance_time() without tick(), so governance's run_periodic_tasks never actually ran during retries.

Failure Patterns

Detailed Flow

  • After feeding 31 days of daily node metrics, governance detects it is time to mint rewards during the final wait_for_nrc_metrics_sync ticks
  • The NRC may not have synced all needed days yet, causing the mint to fail with No metrics found for day X
  • After wait_for_nrc_metrics_sync exits, only 2 explicit ticks were provided - insufficient for the full inter-canister call chain
  • The retry loops called advance_time() but never tick(), so governance never ran periodic tasks during retries

Fix

  • Added a wait_for_rewards_minting() helper that keeps calling tick() until governance completes the full reward minting flow
  • Replaced all insufficient 2-tick + non-ticking-retry patterns with calls to this helper
  • Removed seconds_advanced_test_2 tracking that compensated for the now-removed retry loops

Skill: .claude/skills/fix-flaky-tests/SKILL.md.

…ovider_remuneration

The test was flaky because governance's automated reward minting flow
requires many inter-canister calls (NRC, CMC, Ledger, self-call) spanning
8+ ticks, but only 2 ticks were provided after each monthly metrics loop.
Additionally, the retry loops used advance_time() without tick(), so
governance's run_periodic_tasks never actually ran during retries.

Root cause:
- After feeding 31 days of daily node metrics in the inner loop,
  governance detects it's time to mint rewards during the final
  wait_for_nrc_metrics_sync ticks
- The NRC may not have synced all needed days yet at that point,
  causing the mint to fail with "No metrics found for day X"
- After wait_for_nrc_metrics_sync exits, only 2 explicit ticks were
  provided - insufficient for the full inter-canister call chain
- The subsequent retry loops called advance_time() and query() but
  never tick(), so governance never got another chance to run periodic
  tasks and complete the minting

Fix:
- Add a wait_for_rewards_minting() helper that keeps calling tick()
  until governance successfully completes the full reward minting flow
  (detected by a change in the most_recent_monthly_node_provider_rewards
  timestamp)
- Replace all insufficient 2-tick + non-ticking-retry patterns with
  calls to this helper
- Remove the seconds_advanced_test_2 tracking that compensated for
  time advanced in the now-removed retry loops
@github-actions github-actions bot added the fix label Feb 21, 2026
@basvandijk basvandijk marked this pull request as ready for review February 21, 2026 15:58
@basvandijk basvandijk requested a review from a team as a code owner February 21, 2026 15:58
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request changes code owned by the Governance team. Therefore, make sure that
you have considered the following (for Governance-owned code):

  1. Update unreleased_changelog.md (if there are behavior changes, even if they are
    non-breaking).

  2. Are there BREAKING changes?

  3. Is a data migration needed?

  4. Security review?

How to Satisfy This Automatic Review

  1. Go to the bottom of the pull request page.

  2. Look for where it says this bot is requesting changes.

  3. Click the three dots to the right.

  4. Select "Dismiss review".

  5. In the text entry box, respond to each of the numbered items in the previous
    section, declare one of the following:

  • Done.

  • $REASON_WHY_NO_NEED. E.g. for unreleased_changelog.md, "No
    canister behavior changes.", or for item 2, "Existing APIs
    behave as before.".

Brief Guide to "Externally Visible" Changes

"Externally visible behavior change" is very often due to some NEW canister API.

Changes to EXISTING APIs are more likely to be "breaking".

If these changes are breaking, make sure that clients know how to migrate, how to
maintain their continuity of operations.

If your changes are behind a feature flag, then, do NOT add entrie(s) to
unreleased_changelog.md in this PR! But rather, add entrie(s) later, in the PR
that enables these changes in production.

Reference(s)

For a more comprehensive checklist, see here.

GOVERNANCE_CHECKLIST_REMINDER_DEDUP

previous_timestamp: u64,
) -> MonthlyNodeProviderRewards {
for i in 0..500 {
state_machine.tick();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the fact that it ticks explicitly, but I also think it would be good to keep the advance_time (perhaps 1s for each tick).

The point the LLM made - the previous logic only advance_time without ticking, might not fully explain flakiness since nns_get_most_recent_monthly_node_provider_rewards also ticks (implicitly). However, I do think it's better to tick explicitly. On top of that, I think keeping the advance_time is more robust - for example if we switch the minting from heartbeat to timer, not advancing time wouldn't have the timer triggered. Ticking while advancing time is also more realistic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jasonz-dfinity! Would you be able to push your suggestions directly to the PR? I feel uncomfortable making these changes in tests I know very little about 😉

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also feel free to update the PR description of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants