Skip to content

[GPU CI 1/N] Init GPU CI on Anyscale#102

Merged
SumanthRH merged 8 commits intomainfrom
gpu_ci
Jul 21, 2025
Merged

[GPU CI 1/N] Init GPU CI on Anyscale#102
SumanthRH merged 8 commits intomainfrom
gpu_ci

Conversation

@SumanthRH
Copy link
Member

What does this PR do?

FIrst PR in enabling GPU CI.

This PR adds a SkyRL-GPU workflow that will run GPU CI on Anyscale. We will basic unit and integration tests running on a 4xL4 instance in this workflow. Currently, the CI will run post-merge on main . For pull requests, it requires manual trigger by a maintainer. (so it should not run on every commit even for maintainers, since that can be wasteful).

In this PR, I've only ported over one of our existing gpu tests - test_models.py to run on CI . The plan is to gradually add more and more tests after we've made them more efficient and compatible to be run on a 4xL4 node.

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
push:
branches:
- main
- gpu_ci # TESTING
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be removed before merging the PR.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR initializes GPU CI infrastructure on Anyscale for the SkyRL project. It sets up automated testing on 4xL4 GPU instances to run GPU-specific tests with proper hardware configuration.

  • Creates a GitHub Actions workflow for GPU CI that triggers on main branch pushes and manual dispatch
  • Adds Anyscale job configuration and execution scripts for running GPU tests in the cloud
  • Implements a Ray initialization fixture with NCCL configuration optimized for L4 GPU instances
  • Ports one existing GPU test (test_models.py) to run in the new CI environment

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
.github/workflows/gpu_ci.yaml GitHub Actions workflow definition for GPU CI
skyrl-train/ci/anyscale_gpu_ci.yaml Anyscale job configuration for GPU test execution
skyrl-train/ci/gpu_ci_run.sh Shell script to execute GPU tests with proper environment setup
skyrl-train/tests/gpu/gpu_ci/conftest.py Pytest fixture for Ray initialization with L4-optimized NCCL settings
skyrl-train/tests/gpu/gpu_ci/test_models.py Updated test function to use the new Ray initialization fixture
Comments suppressed due to low confidence (1)

skyrl-train/ci/anyscale_gpu_ci.yaml:1

  • The job name 'my-job' is generic and unclear. Consider using a more descriptive name like 'skyrl-gpu-ci' or 'gpu-tests' to better reflect its purpose.
name: my-job

Comment on lines 6 to 7
- main
- gpu_ci # TESTING
Copy link

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'gpu_ci' branch trigger appears to be for testing purposes only. Consider removing this temporary branch trigger before merging to production to avoid unintended CI runs.

Suggested change
- main
- gpu_ci # TESTING
- main

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, see: #102 (comment)

@@ -0,0 +1,2 @@
export CI=true
Copy link

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script lacks a shebang line (#!/bin/bash) at the beginning. Adding this would make the script more portable and explicit about the shell interpreter to use.

Copilot uses AI. Check for mistakes.


def test_actor_model_fwd_with_sequence_parallelism():
def test_actor_model_fwd_with_sequence_parallelism(ray_init_fixture):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, will we move all GPU tests to take this ray_init_fixture? If so, do we want to run all GPU tests with NCCL_P2P_DISABLE and NCCL_SHM_DISABLE? I guess we need this for the L4s running CI, but setting these flags could miss some issues that we would otherwise catch while running tests in a non-CI environment.

Can we set these envvars only for CI, but otherwise leave them unset when we run GPU tests manually / locally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point!

One of the main purposes of the GPU CI is to catch all issues early for developers, and trying to make sure that we don't have to run E2E tests for every single PR.

With this goal, it would indeed be bad if there's some issue we couldn't find because of the flags, and the only way to see it is if you run an E2E test.

I've now guarded the flags with the standard CI env var. If there's a corner case where things break in CI but work locally due to this, we should be able to replicate CI setup with this CI=true flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU tests passing in CI and the P2P env vars are disabled:

image

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH merged commit 8a92b43 into main Jul 21, 2025
3 checks passed
@SumanthRH SumanthRH deleted the gpu_ci branch July 23, 2025 08:25
fannie1208 pushed a commit to vinid/SkyRL that referenced this pull request Aug 19, 2025
# What does this PR do?

FIrst PR in enabling GPU CI. 

This PR adds a `SkyRL-GPU` workflow that will run GPU CI on Anyscale. We
will basic unit and integration tests running on a 4xL4 instance in this
workflow. Currently, the CI will run post-merge on `main` . For pull
requests, it requires manual trigger by a maintainer. (so it should not
run on every commit even for maintainers, since that can be wasteful).


In this PR, I've only ported over one of our existing gpu tests -
`test_models.py` to run on CI . The plan is to gradually add more and
more tests after we've made them more efficient and compatible to be run
on a 4xL4 node.

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants