Fix flaky artifact downloads in setup-universe CI

## Problem

The `setup-universe` CI workflow frequently fails due to transient network timeouts when downloading ML model artifacts. The Ansible `artifacts` role (`ansible/roles/artifacts/tasks/main.yaml`) has 77 `ansible.builtin.get_url` tasks, all using the default 10-second socket timeout with zero retries.

These artifacts are only useful for end users -- they are not used anywhere in CI. Despite this, every PR run downloads them, causing frequent flaky failures.

Example failure from [this run](https://github.com/autowarefoundation/autoware/actions/runs/24246104913/job/70792713893):

```
TimeoutError: The read operation timed out
url: https://s3.ap-northeast-2.wasabisys.com/pinto-model-zoo/136_road-segmentation-adas-0001/resources.tar.gz
timeout: 10
```

The external servers are healthy (files download fine manually), but CI runners experience variable network performance causing the 10-second timeout to be too tight.

## Analysis

### Current Architecture

1. `setup-universe.yaml` runs `./setup-dev-env.sh --ros-distro ${{ matrix.ros_distro }} -y -v universe`
2. The `-y` flag unconditionally sets `prompt_download_artifacts=y`
3. The `universe.yaml` playbook conditionally includes the `artifacts` role
4. `ansible/roles/artifacts/tasks/main.yaml` (~776 lines) contains 77 `ansible.builtin.get_url` tasks

**Download sources (77 total):**
- 74 files from `awf.ml.dev.web.auto`
- 2 files from `autoware-files.s3.us-west-2.amazonaws.com`
- 1 file from `s3.ap-northeast-2.wasabisys.com` (Wasabi S3, Korea region)

**Per-task configuration:**
- `timeout`: not set (defaults to 10 seconds -- root cause)
- `retries`: not set (defaults to 0)
- `force`: not set (defaults to false -- skips if file exists with correct checksum)
- `checksum`: set on every task (SHA256)

**CI runner context:**
- Self-hosted runners (`[self-hosted, Linux, X64]`)
- Runs inside ephemeral containers (`ubuntu:22.04`, `ubuntu:24.04`)
- No caching between runs, no persistent volume for `autoware_data`

<details>
<summary>Options Considered</summary>

#### Option 1: Add timeout and retries to all get_url tasks

Add `timeout: 300`, `retries: 3`, `delay: 10`, `until: result is not failed` to each of the 77 tasks.

| | |
|---|---|
| **Pros** | Directly addresses root cause. Improves reliability for both CI and local developers. No infrastructure changes. |
| **Cons** | Extremely repetitive -- 77 tasks need modification (~308 lines added). Every new artifact must remember to include these parameters. |
| **Effort** | Small (mechanical, but tedious) |
| **Risk** | Very low |

#### Option 2: Refactor downloads into a data-driven loop

Replace 77 individual tasks with a variable list of `{url, dest, checksum}` items and a single `get_url` task with `loop:`. Timeout and retry configured once.

| | |
|---|---|
| **Pros** | File shrinks from ~776 to ~100 lines. Adding new artifacts is a single list entry. Timeout/retry configured once. |
| **Cons** | Larger refactor. The 3 `unarchive` tasks need separate handling. Migration could introduce subtle errors (checksums provide a safety net). |
| **Effort** | Medium |
| **Risk** | Medium (migration errors, but checksums catch corruption) |

#### Option 3: Skip artifact downloads in PR CI, validate separately

Since the artifacts are only for end users and not used in CI at all, skip them in PR runs. Validate artifact URLs/checksums on a daily schedule and when the artifact task file is modified.

| | |
|---|---|
| **Pros** | Eliminates the problem entirely. CI runs much faster. Broken URLs still caught within a day or on the PR that changes them. |
| **Cons** | None significant -- artifacts are not used in CI anyway. |
| **Effort** | Trivial (few lines in `setup-dev-env.sh` + scheduled workflow) |
| **Risk** | Very low |

#### Option 4: GitHub Actions cache for autoware_data

Use `actions/cache` to persist the `autoware_data` directory between runs.

| | |
|---|---|
| **Pros** | Downloads happen once, then are cached. |
| **Cons** | GitHub Actions cache has a 10GB limit. 77 ML model files may exceed this. Initial run still has the timeout problem. |
| **Effort** | Small |
| **Risk** | Medium (cache eviction causes surprise full-download runs) |

#### Option 5: Use `module_defaults` for timeout only

Ansible `module_defaults` at the play level applies to all `get_url` tasks:

```yaml
module_defaults:
  ansible.builtin.get_url:
    timeout: 300
```

| | |
|---|---|
| **Pros** | Single-line change fixes timeout for all 77 tasks. |
| **Cons** | Only fixes timeout, not retries. Downloads still run on every PR (slow, wasteful). |
| **Effort** | Trivial |
| **Risk** | Very low |

#### Option 6: Pre-download artifacts on self-hosted runner host

Maintain `autoware_data` on the runner host and bind-mount into containers.

| | |
|---|---|
| **Pros** | Zero code changes. |
| **Cons** | Requires runner infrastructure access. Stale files if task list changes. Operational burden. |
| **Effort** | External ops work |
| **Risk** | Medium (ops drift) |

### Rejected Alternatives

- **Mirror to GitHub Releases**: 2GB per-file limit, mirroring maintenance burden. Existing hosts are fine; the issue is timeout config.
- **Use `aria2`/`wget` via `ansible.builtin.command`**: Loses checksum verification, adds complexity.
- **Bake artifacts into Docker images**: Would increase image size by hundreds of MB to GB. Project deliberately keeps them out of images.
- **GitHub LFS repo**: Expensive bandwidth limits for large binary models.

### Tradeoff Summary

| | Correctness | Complexity | Maintainability | CI Performance |
|---|---|---|---|---|
| **Opt 1**: Per-task timeout | Full | Low | Poor (77x) | Slower on failure |
| **Opt 2**: Loop refactor | Full | Medium | Good | Same as Opt 1 |
| **Opt 3**: Skip + schedule | Full (daily + on change) | Low | Good | Much faster |
| **Opt 4**: Cache | Partial | Low-Med | Good | Fast after 1st run |
| **Opt 5**: module_defaults | Partial (no retry) | Very low | Good | Slower on failure |
| **Opt 6**: Runner pre-download | Full | High (ops) | Poor | Fast |

</details>

## Proposed Fix

### Phase 1: Unblock PR CI

1. **Skip artifact downloads in PR CI**: Add a `--no-download-artifacts` flag to `setup-dev-env.sh` and use it in the `setup-universe` workflow. These artifacts are only for end users and serve no purpose in CI.
2. **Increase timeout globally**: Add `module_defaults` for `ansible.builtin.get_url` with `timeout: 300` in `ansible/playbooks/universe.yaml`. This makes the download more robust for end users and for the scheduled validation.

### Phase 2: Scheduled validation

1. **Add a scheduled workflow**: A daily workflow (and on changes to `ansible/roles/artifacts/tasks/main.yaml`) that runs only artifact downloads to validate URLs and checksums remain correct.

### Phase 3: Improve maintainability (optional follow-up)

1. **Refactor into a loop**: Replace 77 individual `get_url` tasks with a data-driven loop and centralized retry/timeout config.



Pros	File shrinks from ~776 to ~100 lines. Adding new artifacts is a single list entry. Timeout/retry configured once.
Cons	Larger refactor. The 3 `unarchive` tasks need separate handling. Migration could introduce subtle errors (checksums provide a safety net).
Effort	Medium
Risk	Medium (migration errors, but checksums catch corruption)


Pros	Eliminates the problem entirely. CI runs much faster. Broken URLs still caught within a day or on the PR that changes them.
Cons	None significant -- artifacts are not used in CI anyway.
Effort	Trivial (few lines in `setup-dev-env.sh` + scheduled workflow)
Risk	Very low

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky artifact downloads in setup-universe CI #7001

Problem

Analysis

Current Architecture

Option 1: Add timeout and retries to all get_url tasks

Option 2: Refactor downloads into a data-driven loop

Option 3: Skip artifact downloads in PR CI, validate separately

Option 4: GitHub Actions cache for autoware_data

Option 5: Use `module_defaults` for timeout only

Option 6: Pre-download artifacts on self-hosted runner host

Rejected Alternatives

Tradeoff Summary

Proposed Fix

Phase 1: Unblock PR CI

Phase 2: Scheduled validation

Phase 3: Improve maintainability (optional follow-up)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development


Pros	Directly addresses root cause. Improves reliability for both CI and local developers. No infrastructure changes.
Cons	Extremely repetitive -- 77 tasks need modification (~308 lines added). Every new artifact must remember to include these parameters.
Effort	Small (mechanical, but tedious)
Risk	Very low


Pros	Downloads happen once, then are cached.
Cons	GitHub Actions cache has a 10GB limit. 77 ML model files may exceed this. Initial run still has the timeout problem.
Effort	Small
Risk	Medium (cache eviction causes surprise full-download runs)


Pros	Single-line change fixes timeout for all 77 tasks.
Cons	Only fixes timeout, not retries. Downloads still run on every PR (slow, wasteful).
Effort	Trivial
Risk	Very low


Pros	Zero code changes.
Cons	Requires runner infrastructure access. Stale files if task list changes. Operational burden.
Effort	External ops work
Risk	Medium (ops drift)

	Correctness	Complexity	Maintainability	CI Performance
Opt 1: Per-task timeout	Full	Low	Poor (77x)	Slower on failure
Opt 2: Loop refactor	Full	Medium	Good	Same as Opt 1
Opt 3: Skip + schedule	Full (daily + on change)	Low	Good	Much faster
Opt 4: Cache	Partial	Low-Med	Good	Fast after 1st run
Opt 5: module_defaults	Partial (no retry)	Very low	Good	Slower on failure
Opt 6: Runner pre-download	Full	High (ops)	Poor	Fast

Fix flaky artifact downloads in setup-universe CI #7001

Description

Problem

Analysis

Current Architecture

Option 1: Add timeout and retries to all get_url tasks

Option 2: Refactor downloads into a data-driven loop

Option 3: Skip artifact downloads in PR CI, validate separately

Option 4: GitHub Actions cache for autoware_data

Option 5: Use module_defaults for timeout only

Option 6: Pre-download artifacts on self-hosted runner host

Rejected Alternatives

Tradeoff Summary

Proposed Fix

Phase 1: Unblock PR CI

Phase 2: Scheduled validation

Phase 3: Improve maintainability (optional follow-up)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option 5: Use `module_defaults` for timeout only