Problem
The setup-universe CI workflow frequently fails due to transient network timeouts when downloading ML model artifacts. The Ansible artifacts role (ansible/roles/artifacts/tasks/main.yaml) has 77 ansible.builtin.get_url tasks, all using the default 10-second socket timeout with zero retries.
These artifacts are only useful for end users -- they are not used anywhere in CI. Despite this, every PR run downloads them, causing frequent flaky failures.
Example failure from this run:
TimeoutError: The read operation timed out
url: https://s3.ap-northeast-2.wasabisys.com/pinto-model-zoo/136_road-segmentation-adas-0001/resources.tar.gz
timeout: 10
The external servers are healthy (files download fine manually), but CI runners experience variable network performance causing the 10-second timeout to be too tight.
Analysis
Current Architecture
setup-universe.yaml runs ./setup-dev-env.sh --ros-distro ${{ matrix.ros_distro }} -y -v universe
- The
-y flag unconditionally sets prompt_download_artifacts=y
- The
universe.yaml playbook conditionally includes the artifacts role
ansible/roles/artifacts/tasks/main.yaml (~776 lines) contains 77 ansible.builtin.get_url tasks
Download sources (77 total):
- 74 files from
awf.ml.dev.web.auto
- 2 files from
autoware-files.s3.us-west-2.amazonaws.com
- 1 file from
s3.ap-northeast-2.wasabisys.com (Wasabi S3, Korea region)
Per-task configuration:
timeout: not set (defaults to 10 seconds -- root cause)
retries: not set (defaults to 0)
force: not set (defaults to false -- skips if file exists with correct checksum)
checksum: set on every task (SHA256)
CI runner context:
- Self-hosted runners (
[self-hosted, Linux, X64])
- Runs inside ephemeral containers (
ubuntu:22.04, ubuntu:24.04)
- No caching between runs, no persistent volume for
autoware_data
Options Considered
Option 1: Add timeout and retries to all get_url tasks
Add timeout: 300, retries: 3, delay: 10, until: result is not failed to each of the 77 tasks.
|
|
| Pros |
Directly addresses root cause. Improves reliability for both CI and local developers. No infrastructure changes. |
| Cons |
Extremely repetitive -- 77 tasks need modification (~308 lines added). Every new artifact must remember to include these parameters. |
| Effort |
Small (mechanical, but tedious) |
| Risk |
Very low |
Option 2: Refactor downloads into a data-driven loop
Replace 77 individual tasks with a variable list of {url, dest, checksum} items and a single get_url task with loop:. Timeout and retry configured once.
|
|
| Pros |
File shrinks from ~776 to ~100 lines. Adding new artifacts is a single list entry. Timeout/retry configured once. |
| Cons |
Larger refactor. The 3 unarchive tasks need separate handling. Migration could introduce subtle errors (checksums provide a safety net). |
| Effort |
Medium |
| Risk |
Medium (migration errors, but checksums catch corruption) |
Option 3: Skip artifact downloads in PR CI, validate separately
Since the artifacts are only for end users and not used in CI at all, skip them in PR runs. Validate artifact URLs/checksums on a daily schedule and when the artifact task file is modified.
|
|
| Pros |
Eliminates the problem entirely. CI runs much faster. Broken URLs still caught within a day or on the PR that changes them. |
| Cons |
None significant -- artifacts are not used in CI anyway. |
| Effort |
Trivial (few lines in setup-dev-env.sh + scheduled workflow) |
| Risk |
Very low |
Option 4: GitHub Actions cache for autoware_data
Use actions/cache to persist the autoware_data directory between runs.
|
|
| Pros |
Downloads happen once, then are cached. |
| Cons |
GitHub Actions cache has a 10GB limit. 77 ML model files may exceed this. Initial run still has the timeout problem. |
| Effort |
Small |
| Risk |
Medium (cache eviction causes surprise full-download runs) |
Option 5: Use module_defaults for timeout only
Ansible module_defaults at the play level applies to all get_url tasks:
module_defaults:
ansible.builtin.get_url:
timeout: 300
|
|
| Pros |
Single-line change fixes timeout for all 77 tasks. |
| Cons |
Only fixes timeout, not retries. Downloads still run on every PR (slow, wasteful). |
| Effort |
Trivial |
| Risk |
Very low |
Option 6: Pre-download artifacts on self-hosted runner host
Maintain autoware_data on the runner host and bind-mount into containers.
|
|
| Pros |
Zero code changes. |
| Cons |
Requires runner infrastructure access. Stale files if task list changes. Operational burden. |
| Effort |
External ops work |
| Risk |
Medium (ops drift) |
Rejected Alternatives
- Mirror to GitHub Releases: 2GB per-file limit, mirroring maintenance burden. Existing hosts are fine; the issue is timeout config.
- Use
aria2/wget via ansible.builtin.command: Loses checksum verification, adds complexity.
- Bake artifacts into Docker images: Would increase image size by hundreds of MB to GB. Project deliberately keeps them out of images.
- GitHub LFS repo: Expensive bandwidth limits for large binary models.
Tradeoff Summary
|
Correctness |
Complexity |
Maintainability |
CI Performance |
| Opt 1: Per-task timeout |
Full |
Low |
Poor (77x) |
Slower on failure |
| Opt 2: Loop refactor |
Full |
Medium |
Good |
Same as Opt 1 |
| Opt 3: Skip + schedule |
Full (daily + on change) |
Low |
Good |
Much faster |
| Opt 4: Cache |
Partial |
Low-Med |
Good |
Fast after 1st run |
| Opt 5: module_defaults |
Partial (no retry) |
Very low |
Good |
Slower on failure |
| Opt 6: Runner pre-download |
Full |
High (ops) |
Poor |
Fast |
Proposed Fix
Phase 1: Unblock PR CI
- Skip artifact downloads in PR CI: Add a
--no-download-artifacts flag to setup-dev-env.sh and use it in the setup-universe workflow. These artifacts are only for end users and serve no purpose in CI.
- Increase timeout globally: Add
module_defaults for ansible.builtin.get_url with timeout: 300 in ansible/playbooks/universe.yaml. This makes the download more robust for end users and for the scheduled validation.
Phase 2: Scheduled validation
- Add a scheduled workflow: A daily workflow (and on changes to
ansible/roles/artifacts/tasks/main.yaml) that runs only artifact downloads to validate URLs and checksums remain correct.
Phase 3: Improve maintainability (optional follow-up)
- Refactor into a loop: Replace 77 individual
get_url tasks with a data-driven loop and centralized retry/timeout config.
Problem
The
setup-universeCI workflow frequently fails due to transient network timeouts when downloading ML model artifacts. The Ansibleartifactsrole (ansible/roles/artifacts/tasks/main.yaml) has 77ansible.builtin.get_urltasks, all using the default 10-second socket timeout with zero retries.These artifacts are only useful for end users -- they are not used anywhere in CI. Despite this, every PR run downloads them, causing frequent flaky failures.
Example failure from this run:
The external servers are healthy (files download fine manually), but CI runners experience variable network performance causing the 10-second timeout to be too tight.
Analysis
Current Architecture
setup-universe.yamlruns./setup-dev-env.sh --ros-distro ${{ matrix.ros_distro }} -y -v universe-yflag unconditionally setsprompt_download_artifacts=yuniverse.yamlplaybook conditionally includes theartifactsroleansible/roles/artifacts/tasks/main.yaml(~776 lines) contains 77ansible.builtin.get_urltasksDownload sources (77 total):
awf.ml.dev.web.autoautoware-files.s3.us-west-2.amazonaws.coms3.ap-northeast-2.wasabisys.com(Wasabi S3, Korea region)Per-task configuration:
timeout: not set (defaults to 10 seconds -- root cause)retries: not set (defaults to 0)force: not set (defaults to false -- skips if file exists with correct checksum)checksum: set on every task (SHA256)CI runner context:
[self-hosted, Linux, X64])ubuntu:22.04,ubuntu:24.04)autoware_dataOptions Considered
Option 1: Add timeout and retries to all get_url tasks
Add
timeout: 300,retries: 3,delay: 10,until: result is not failedto each of the 77 tasks.Option 2: Refactor downloads into a data-driven loop
Replace 77 individual tasks with a variable list of
{url, dest, checksum}items and a singleget_urltask withloop:. Timeout and retry configured once.unarchivetasks need separate handling. Migration could introduce subtle errors (checksums provide a safety net).Option 3: Skip artifact downloads in PR CI, validate separately
Since the artifacts are only for end users and not used in CI at all, skip them in PR runs. Validate artifact URLs/checksums on a daily schedule and when the artifact task file is modified.
setup-dev-env.sh+ scheduled workflow)Option 4: GitHub Actions cache for autoware_data
Use
actions/cacheto persist theautoware_datadirectory between runs.Option 5: Use
module_defaultsfor timeout onlyAnsible
module_defaultsat the play level applies to allget_urltasks:Option 6: Pre-download artifacts on self-hosted runner host
Maintain
autoware_dataon the runner host and bind-mount into containers.Rejected Alternatives
aria2/wgetviaansible.builtin.command: Loses checksum verification, adds complexity.Tradeoff Summary
Proposed Fix
Phase 1: Unblock PR CI
--no-download-artifactsflag tosetup-dev-env.shand use it in thesetup-universeworkflow. These artifacts are only for end users and serve no purpose in CI.module_defaultsforansible.builtin.get_urlwithtimeout: 300inansible/playbooks/universe.yaml. This makes the download more robust for end users and for the scheduled validation.Phase 2: Scheduled validation
ansible/roles/artifacts/tasks/main.yaml) that runs only artifact downloads to validate URLs and checksums remain correct.Phase 3: Improve maintainability (optional follow-up)
get_urltasks with a data-driven loop and centralized retry/timeout config.