Skip to content

Fix flaky artifact downloads in setup-universe CI #7001

@xmfcx

Description

@xmfcx

Problem

The setup-universe CI workflow frequently fails due to transient network timeouts when downloading ML model artifacts. The Ansible artifacts role (ansible/roles/artifacts/tasks/main.yaml) has 77 ansible.builtin.get_url tasks, all using the default 10-second socket timeout with zero retries.

These artifacts are only useful for end users -- they are not used anywhere in CI. Despite this, every PR run downloads them, causing frequent flaky failures.

Example failure from this run:

TimeoutError: The read operation timed out
url: https://s3.ap-northeast-2.wasabisys.com/pinto-model-zoo/136_road-segmentation-adas-0001/resources.tar.gz
timeout: 10

The external servers are healthy (files download fine manually), but CI runners experience variable network performance causing the 10-second timeout to be too tight.

Analysis

Current Architecture

  1. setup-universe.yaml runs ./setup-dev-env.sh --ros-distro ${{ matrix.ros_distro }} -y -v universe
  2. The -y flag unconditionally sets prompt_download_artifacts=y
  3. The universe.yaml playbook conditionally includes the artifacts role
  4. ansible/roles/artifacts/tasks/main.yaml (~776 lines) contains 77 ansible.builtin.get_url tasks

Download sources (77 total):

  • 74 files from awf.ml.dev.web.auto
  • 2 files from autoware-files.s3.us-west-2.amazonaws.com
  • 1 file from s3.ap-northeast-2.wasabisys.com (Wasabi S3, Korea region)

Per-task configuration:

  • timeout: not set (defaults to 10 seconds -- root cause)
  • retries: not set (defaults to 0)
  • force: not set (defaults to false -- skips if file exists with correct checksum)
  • checksum: set on every task (SHA256)

CI runner context:

  • Self-hosted runners ([self-hosted, Linux, X64])
  • Runs inside ephemeral containers (ubuntu:22.04, ubuntu:24.04)
  • No caching between runs, no persistent volume for autoware_data
Options Considered

Option 1: Add timeout and retries to all get_url tasks

Add timeout: 300, retries: 3, delay: 10, until: result is not failed to each of the 77 tasks.

Pros Directly addresses root cause. Improves reliability for both CI and local developers. No infrastructure changes.
Cons Extremely repetitive -- 77 tasks need modification (~308 lines added). Every new artifact must remember to include these parameters.
Effort Small (mechanical, but tedious)
Risk Very low

Option 2: Refactor downloads into a data-driven loop

Replace 77 individual tasks with a variable list of {url, dest, checksum} items and a single get_url task with loop:. Timeout and retry configured once.

Pros File shrinks from ~776 to ~100 lines. Adding new artifacts is a single list entry. Timeout/retry configured once.
Cons Larger refactor. The 3 unarchive tasks need separate handling. Migration could introduce subtle errors (checksums provide a safety net).
Effort Medium
Risk Medium (migration errors, but checksums catch corruption)

Option 3: Skip artifact downloads in PR CI, validate separately

Since the artifacts are only for end users and not used in CI at all, skip them in PR runs. Validate artifact URLs/checksums on a daily schedule and when the artifact task file is modified.

Pros Eliminates the problem entirely. CI runs much faster. Broken URLs still caught within a day or on the PR that changes them.
Cons None significant -- artifacts are not used in CI anyway.
Effort Trivial (few lines in setup-dev-env.sh + scheduled workflow)
Risk Very low

Option 4: GitHub Actions cache for autoware_data

Use actions/cache to persist the autoware_data directory between runs.

Pros Downloads happen once, then are cached.
Cons GitHub Actions cache has a 10GB limit. 77 ML model files may exceed this. Initial run still has the timeout problem.
Effort Small
Risk Medium (cache eviction causes surprise full-download runs)

Option 5: Use module_defaults for timeout only

Ansible module_defaults at the play level applies to all get_url tasks:

module_defaults:
  ansible.builtin.get_url:
    timeout: 300
Pros Single-line change fixes timeout for all 77 tasks.
Cons Only fixes timeout, not retries. Downloads still run on every PR (slow, wasteful).
Effort Trivial
Risk Very low

Option 6: Pre-download artifacts on self-hosted runner host

Maintain autoware_data on the runner host and bind-mount into containers.

Pros Zero code changes.
Cons Requires runner infrastructure access. Stale files if task list changes. Operational burden.
Effort External ops work
Risk Medium (ops drift)

Rejected Alternatives

  • Mirror to GitHub Releases: 2GB per-file limit, mirroring maintenance burden. Existing hosts are fine; the issue is timeout config.
  • Use aria2/wget via ansible.builtin.command: Loses checksum verification, adds complexity.
  • Bake artifacts into Docker images: Would increase image size by hundreds of MB to GB. Project deliberately keeps them out of images.
  • GitHub LFS repo: Expensive bandwidth limits for large binary models.

Tradeoff Summary

Correctness Complexity Maintainability CI Performance
Opt 1: Per-task timeout Full Low Poor (77x) Slower on failure
Opt 2: Loop refactor Full Medium Good Same as Opt 1
Opt 3: Skip + schedule Full (daily + on change) Low Good Much faster
Opt 4: Cache Partial Low-Med Good Fast after 1st run
Opt 5: module_defaults Partial (no retry) Very low Good Slower on failure
Opt 6: Runner pre-download Full High (ops) Poor Fast

Proposed Fix

Phase 1: Unblock PR CI

  1. Skip artifact downloads in PR CI: Add a --no-download-artifacts flag to setup-dev-env.sh and use it in the setup-universe workflow. These artifacts are only for end users and serve no purpose in CI.
  2. Increase timeout globally: Add module_defaults for ansible.builtin.get_url with timeout: 300 in ansible/playbooks/universe.yaml. This makes the download more robust for end users and for the scheduled validation.

Phase 2: Scheduled validation

  1. Add a scheduled workflow: A daily workflow (and on changes to ansible/roles/artifacts/tasks/main.yaml) that runs only artifact downloads to validate URLs and checksums remain correct.

Phase 3: Improve maintainability (optional follow-up)

  1. Refactor into a loop: Replace 77 individual get_url tasks with a data-driven loop and centralized retry/timeout config.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions