Skip to content

Comments

fix: stabilize //rs/dogecoin/ckdoge/minter:integration_tests#8978

Open
basvandijk wants to merge 3 commits intomasterfrom
basvandijk/deflake-ckdoge-minter-integration-tests
Open

fix: stabilize //rs/dogecoin/ckdoge/minter:integration_tests#8978
basvandijk wants to merge 3 commits intomasterfrom
basvandijk/deflake-ckdoge-minter-integration-tests

Conversation

@basvandijk
Copy link
Collaborator

@basvandijk basvandijk commented Feb 21, 2026

The //rs/dogecoin/ckdoge/minter:integration_tests often fails or times out:

$ bazel run //ci/githubstats:query -- top 1 non_success --week --include //rs/dogecoin/ckdoge/minter:integration_tests
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┑
│    │ label                                         │   total │   non_success │   flaky │   timeout │   fail │   non_success% │   flaky% │   timeout% │   fail% │   impact │   total duration │   duration_p90 │ owners   │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┥
│  0 │ //rs/dogecoin/ckdoge/minter:integration_tests │     522 │            64 │       5 │        14 │     45 │           12.3 │        1 │        2.7 │     8.6 │  2:58:08 │  1 days 00:12:54 │           2:47 │ defi     │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┙

Since the test timeouts so often increase the timeout to long (15m).

In addition we downloaded the logs of its last non successful runs:

$ bazel run //ci/githubstats:query -- last //rs/dogecoin/ckdoge/minter:integration_tests --non_success --week
...
Downloading logs to: /ic/logs/integration_tests/2026-02-21T09:59:02
...

And prompted Claude Opus 4.6:

Figure out why //rs/dogecoin/ckdoge/minter:integration_tests implemented in rs/dogecoin/ckdoge/minter/tests/tests.rs is often failing or timing out. The logs of its non-successful runs can be found in: logs/integration_tests/2026-02-21T09:59:02.

It responded with the following Root Cause Analysis and accompanying fix:

fix: deflake //rs/dogecoin/ckdoge/minter:integration_tests

Fixes two independent root causes that made this test target fail or time out on nearly every CI run.

Root Cause Analysis

Analysis based on non-successful runs from the last week (logs/integration_tests/2026-02-21T09:59:02/README.md) (65 entries). Three failure modes were observed:

1. Timeout — should_cancel_and_reimburse_large_withdrawal (most impactful)

The test created 1,900 UTXOs to trigger a TooManyInputs reimbursement (limit: 1,000 inputs).
Each UTXO requires multiple cross-canister calls in PocketIC during minter_update_balance() (KYT check + ledger mint), resulting in ~3,800+ inter-canister messages. This consistently exceeded the 5-minute Bazel test timeout.

The test intended to set max_num_inputs_in_transaction: Some(500) via the init args to use fewer UTXOs, but the ckBTC minter's From<InitArgs> for CkBtcMinterState implementation silently ignores this field and always uses DEFAULT_MAX_NUM_INPUTS_IN_TRANSACTION (1,000). The field is only respected during upgrades (via reinit). This meant the test needed >1,000 UTXOs to trigger the error, making it infeasibly slow.

Fix: Set max_num_inputs_in_transaction to 100 via a canister upgrade (which does respect the arg), then use only 120 UTXOs. The test exercises the exact same TooManyInputs error path with 101 inputs > 100 max. Runtime dropped from >5 minutes (timeout) to ~50 seconds.

2. Port conflict — dogecoind fails to bind (intermittent)

Port allocation in Daemon::new has a TOCTOU race: it binds port 0 to get a free port, records the number, drops the listener, then starts the daemon on that port. When 15 tests run in parallel, another process can grab the port in between, causing:

Error: Unable to bind to 0.0.0.0:38953 on this computer.
Error: Failed to listen on any port. Use -listen=0 if you want this.

This caused a random test (whichever happened to set up last) to panic at Daemon::new.

Fix: Added retry logic (up to 3 attempts) to Daemon::new. On startup failure (early exit or timeout waiting for "Done loading"), it kills the process, cleans up the data directory, allocates fresh ports, and retries.

3. PocketIC server panics (external, not addressed here)

Some runs from ic-nervous-system-wasms branch showed 14/15 tests failing with panics inside pocket_ic_server. These were caused by an unrelated broken PocketIC build on that branch and are not addressed by this PR.

Changes

File Change
rs/dogecoin/ckdoge/minter/tests/tests.rs Rewrote should_cancel_and_reimburse_large_withdrawal to use 120 UTXOs with max_num_inputs=100 (set via upgrade) instead of 1,900 UTXOs with max_num_inputs=1,000
rs/bitcoin/adapter/test_utils/src/bitcoind.rs Added retry loop (3 attempts) to Daemon::new for resilience against transient port conflicts

@github-actions github-actions bot added the fix label Feb 21, 2026
@basvandijk basvandijk marked this pull request as ready for review February 21, 2026 11:54
@basvandijk basvandijk requested a review from a team as a code owner February 21, 2026 11:54
@github-actions github-actions bot added the @defi label Feb 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant