Conversation
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
.github/workflows/gpu_ci.yaml
Outdated
| push: | ||
| branches: | ||
| - main | ||
| - gpu_ci # TESTING |
There was a problem hiding this comment.
Will be removed before merging the PR.
There was a problem hiding this comment.
Pull Request Overview
This PR initializes GPU CI infrastructure on Anyscale for the SkyRL project. It sets up automated testing on 4xL4 GPU instances to run GPU-specific tests with proper hardware configuration.
- Creates a GitHub Actions workflow for GPU CI that triggers on main branch pushes and manual dispatch
- Adds Anyscale job configuration and execution scripts for running GPU tests in the cloud
- Implements a Ray initialization fixture with NCCL configuration optimized for L4 GPU instances
- Ports one existing GPU test (
test_models.py) to run in the new CI environment
Reviewed Changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/gpu_ci.yaml |
GitHub Actions workflow definition for GPU CI |
skyrl-train/ci/anyscale_gpu_ci.yaml |
Anyscale job configuration for GPU test execution |
skyrl-train/ci/gpu_ci_run.sh |
Shell script to execute GPU tests with proper environment setup |
skyrl-train/tests/gpu/gpu_ci/conftest.py |
Pytest fixture for Ray initialization with L4-optimized NCCL settings |
skyrl-train/tests/gpu/gpu_ci/test_models.py |
Updated test function to use the new Ray initialization fixture |
Comments suppressed due to low confidence (1)
skyrl-train/ci/anyscale_gpu_ci.yaml:1
- The job name 'my-job' is generic and unclear. Consider using a more descriptive name like 'skyrl-gpu-ci' or 'gpu-tests' to better reflect its purpose.
name: my-job
.github/workflows/gpu_ci.yaml
Outdated
| - main | ||
| - gpu_ci # TESTING |
There was a problem hiding this comment.
The 'gpu_ci' branch trigger appears to be for testing purposes only. Consider removing this temporary branch trigger before merging to production to avoid unintended CI runs.
| - main | |
| - gpu_ci # TESTING | |
| - main |
| @@ -0,0 +1,2 @@ | |||
| export CI=true | |||
There was a problem hiding this comment.
The script lacks a shebang line (#!/bin/bash) at the beginning. Adding this would make the script more portable and explicit about the shell interpreter to use.
|
|
||
|
|
||
| def test_actor_model_fwd_with_sequence_parallelism(): | ||
| def test_actor_model_fwd_with_sequence_parallelism(ray_init_fixture): |
There was a problem hiding this comment.
In the future, will we move all GPU tests to take this ray_init_fixture? If so, do we want to run all GPU tests with NCCL_P2P_DISABLE and NCCL_SHM_DISABLE? I guess we need this for the L4s running CI, but setting these flags could miss some issues that we would otherwise catch while running tests in a non-CI environment.
Can we set these envvars only for CI, but otherwise leave them unset when we run GPU tests manually / locally?
There was a problem hiding this comment.
This is a good point!
One of the main purposes of the GPU CI is to catch all issues early for developers, and trying to make sure that we don't have to run E2E tests for every single PR.
With this goal, it would indeed be bad if there's some issue we couldn't find because of the flags, and the only way to see it is if you run an E2E test.
I've now guarded the flags with the standard CI env var. If there's a corner case where things break in CI but work locally due to this, we should be able to replicate CI setup with this CI=true flag.
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
# What does this PR do? FIrst PR in enabling GPU CI. This PR adds a `SkyRL-GPU` workflow that will run GPU CI on Anyscale. We will basic unit and integration tests running on a 4xL4 instance in this workflow. Currently, the CI will run post-merge on `main` . For pull requests, it requires manual trigger by a maintainer. (so it should not run on every commit even for maintainers, since that can be wasteful). In this PR, I've only ported over one of our existing gpu tests - `test_models.py` to run on CI . The plan is to gradually add more and more tests after we've made them more efficient and compatible to be run on a 4xL4 node. --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

What does this PR do?
FIrst PR in enabling GPU CI.
This PR adds a
SkyRL-GPUworkflow that will run GPU CI on Anyscale. We will basic unit and integration tests running on a 4xL4 instance in this workflow. Currently, the CI will run post-merge onmain. For pull requests, it requires manual trigger by a maintainer. (so it should not run on every commit even for maintainers, since that can be wasteful).In this PR, I've only ported over one of our existing gpu tests -
test_models.pyto run on CI . The plan is to gradually add more and more tests after we've made them more efficient and compatible to be run on a 4xL4 node.