Skip to content

port special tests from CircleCI to GHA #7396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 0 additions & 8 deletions .github/unittest.sh → .github/scripts/setup-env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -95,11 +95,3 @@ echo '::endgroup::'
echo '::group::Collect PyTorch environment information'
python -m torch.utils.collect_env
echo '::endgroup::'

echo '::group::Install testing utilities'
pip install --progress-bar=off pytest pytest-mock pytest-cov
echo '::endgroup::'

echo '::group::Run tests'
pytest --durations=25
echo '::endgroup::'
18 changes: 18 additions & 0 deletions .github/scripts/unittest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/usr/bin/env bash

set -euo pipefail

./.github/scripts/setup-env.sh

# Prepare conda
CONDA_PATH=$(which conda)
eval "$(${CONDA_PATH} shell.bash hook)"
conda activate ci
Comment on lines +7 to +10
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osalpekar #7189 (comment) becomes even more relevant now. Without it, we need to repeat the top two lines everywhere. I'll get on it.


echo '::group::Install testing utilities'
pip install --progress-bar=off pytest pytest-mock pytest-cov
echo '::endgroup::'

echo '::group::Run unittests'
pytest --durations=25
echo '::endgroup::'
68 changes: 65 additions & 3 deletions .github/workflows/test-linux.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Unit-tests on Linux
name: Tests on Linux

on:
pull_request:
Expand All @@ -10,7 +10,7 @@ on:
workflow_dispatch:

jobs:
tests:
unittests:
strategy:
matrix:
python-version:
Expand All @@ -34,8 +34,70 @@ jobs:
gpu-arch-version: ${{ matrix.gpu-arch-version }}
timeout: 120
script: |
set -euo pipefail

export PYTHON_VERSION=${{ matrix.python-version }}
export GPU_ARCH_TYPE=${{ matrix.gpu-arch-type }}
export GPU_ARCH_VERSION=${{ matrix.gpu-arch-version }}

./.github/unittest.sh
./.github/scripts/unittest.sh

onnx:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
repository: pytorch/vision
script: |
set -euo pipefail

export PYTHON_VERSION=3.8
export GPU_ARCH_TYPE=cpu

./.github/scripts/setup-env.sh

# Prepare conda
CONDA_PATH=$(which conda)
eval "$(${CONDA_PATH} shell.bash hook)"
conda activate ci

echo '::group::Install ONNX'
pip install --progress-bar=off onnx onnxruntime
echo '::endgroup::'

echo '::group::Install testing utilities'
pip install --progress-bar=off pytest
echo '::endgroup::'

echo '::group::Run ONNX tests'
pytest --durations=25 -v test/test_onnx.py
echo '::endgroup::'

unittests-extended:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
repository: pytorch/vision
script: |
set -euo pipefail

export PYTHON_VERSION=3.8
export GPU_ARCH_TYPE=cpu

./.github/scripts/setup-env.sh

# Prepare conda
CONDA_PATH=$(which conda)
eval "$(${CONDA_PATH} shell.bash hook)"
conda activate ci

echo '::group::Pre-download model weights'
pip install --progress-bar=off aiohttp aiofiles tqdm
python scripts/download_model_urls.py
echo '::endgroup::'

echo '::group::Install testing utilities'
pip install --progress-bar=off pytest
echo '::endgroup::'

echo '::group::Run extended unittests'
export PYTORCH_TEST_WITH_EXTENDED=1
pytest --durations=25 -v test/test_extended_*.py
echo '::endgroup::'
8 changes: 5 additions & 3 deletions .github/workflows/test-macos.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Unit-tests on macOS
name: Tests on macOS

on:
pull_request:
Expand All @@ -10,7 +10,7 @@ on:
workflow_dispatch:

jobs:
tests:
unittests:
strategy:
matrix:
python-version:
Expand All @@ -31,7 +31,9 @@ jobs:
timeout: 240
runner: ${{ matrix.runner }}
script: |
set -euo pipefail

export PYTHON_VERSION=${{ matrix.python-version }}
export GPU_ARCH_TYPE=cpu

./.github/unittest.sh
./.github/scripts/unittest.sh
41 changes: 41 additions & 0 deletions scripts/download_model_urls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import asyncio
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is a implementation of

vision/.circleci/config.yml

Lines 171 to 189 in 5850f37

download_model_weights:
parameters:
extract_roots:
type: string
default: "torchvision/models"
background:
type: boolean
default: true
steps:
- apt_install:
args: parallel wget
descr: Install download utilitites
- run:
name: Download model weights
background: << parameters.background >>
command: |
mkdir -p ~/.cache/torch/hub/checkpoints
python scripts/collect_model_urls.py << parameters.extract_roots >> \
| parallel -j0 'wget --no-verbose -O ~/.cache/torch/hub/checkpoints/`basename {}` {}\?source=ci'

in Python. The old version relied on wget and parallel installed through apt, but they are not available through conda.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One difference is that this PR uses async downloads, while the old version used multiprocessing. It seems async is roughly 5x slower:

I'll try multiprocessing and see if this actually is the root cause or this just comes from the environment change between CircleCI and GHA.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried multiprocessing with threads in 5d6f391. The run aborted to a MemoryError. From the logs we can see though that it also took over 5 minutes: https://github.com/pytorch/vision/actions/runs/4364016074/jobs/7630816354#step:10:894

Thus, I would go with the async solution since that worked. I'm no expert in async / multiprocessing though. If someone sees possible perf improvements for either implementations, feel free to suggest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried the solution with wget and parallel on GHA and it seems it is really the env that is causing the slowdown:

$ time python scripts/collect_model_urls.py torchvision/models/ | parallel -j0 'wget --no-verbose -O foo/`basename {}` {}\?source=ci'
[...]
real    5m0.152s
user    0m49.044s
sys     1m10.467s

Meaning, I'm totally fine using the async solution.

import sys
from pathlib import Path
from time import perf_counter
from urllib.parse import urlsplit

import aiofiles
import aiohttp
from torchvision import models
from tqdm.asyncio import tqdm


async def main(download_root):
download_root.mkdir(parents=True, exist_ok=True)
urls = {weight.url for name in models.list_models() for weight in iter(models.get_model_weights(name))}

async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=None)) as session:
await tqdm.gather(*[download(download_root, session, url) for url in urls])


async def download(download_root, session, url):
response = await session.get(url, params=dict(source="ci"))

assert response.ok

file_name = Path(urlsplit(url).path).name
async with aiofiles.open(download_root / file_name, "wb") as f:
async for data in response.content.iter_any():
await f.write(data)


if __name__ == "__main__":
download_root = (
(Path(sys.argv[1]) if len(sys.argv) > 1 else Path("~/.cache/torch/hub/checkpoints")).expanduser().resolve()
)
print(f"Downloading model weights to {download_root}")
start = perf_counter()
asyncio.get_event_loop().run_until_complete(main(download_root))
stop = perf_counter()
minutes, seconds = divmod(stop - start, 60)
print(f"Download took {minutes:2.0f}m {seconds:2.0f}s")