[GPU CI 1/N] Init GPU CI on Anyscale by SumanthRH · Pull Request #102 · NovaSky-AI/SkyRL

SumanthRH · 2025-07-21T01:07:28Z

What does this PR do?

FIrst PR in enabling GPU CI.

This PR adds a SkyRL-GPU workflow that will run GPU CI on Anyscale. We will basic unit and integration tests running on a 4xL4 instance in this workflow. Currently, the CI will run post-merge on main . For pull requests, it requires manual trigger by a maintainer. (so it should not run on every commit even for maintainers, since that can be wasteful).

In this PR, I've only ported over one of our existing gpu tests - test_models.py to run on CI . The plan is to gradually add more and more tests after we've made them more efficient and compatible to be run on a 4xL4 node.

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH · 2025-07-21T01:07:47Z

.github/workflows/gpu_ci.yaml

+  push: 
+    branches: 
+      - main 
+      - gpu_ci # TESTING


Will be removed before merging the PR.

Copilot

Pull Request Overview

This PR initializes GPU CI infrastructure on Anyscale for the SkyRL project. It sets up automated testing on 4xL4 GPU instances to run GPU-specific tests with proper hardware configuration.

Creates a GitHub Actions workflow for GPU CI that triggers on main branch pushes and manual dispatch
Adds Anyscale job configuration and execution scripts for running GPU tests in the cloud
Implements a Ray initialization fixture with NCCL configuration optimized for L4 GPU instances
Ports one existing GPU test (test_models.py) to run in the new CI environment

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`.github/workflows/gpu_ci.yaml`	GitHub Actions workflow definition for GPU CI
`skyrl-train/ci/anyscale_gpu_ci.yaml`	Anyscale job configuration for GPU test execution
`skyrl-train/ci/gpu_ci_run.sh`	Shell script to execute GPU tests with proper environment setup
`skyrl-train/tests/gpu/gpu_ci/conftest.py`	Pytest fixture for Ray initialization with L4-optimized NCCL settings
`skyrl-train/tests/gpu/gpu_ci/test_models.py`	Updated test function to use the new Ray initialization fixture

Comments suppressed due to low confidence (1)

skyrl-train/ci/anyscale_gpu_ci.yaml:1

The job name 'my-job' is generic and unclear. Consider using a more descriptive name like 'skyrl-gpu-ci' or 'gpu-tests' to better reflect its purpose.

name: my-job

Copilot · 2025-07-21T01:08:42Z

.github/workflows/gpu_ci.yaml

+      - main 
+      - gpu_ci # TESTING


The 'gpu_ci' branch trigger appears to be for testing purposes only. Consider removing this temporary branch trigger before merging to production to avoid unintended CI runs.

Suggested change

- main

- gpu_ci # TESTING

- main

Yep, see: #102 (comment)

Copilot · 2025-07-21T01:08:42Z

skyrl-train/ci/gpu_ci_run.sh

@@ -0,0 +1,2 @@
+export CI=true


The script lacks a shebang line (#!/bin/bash) at the beginning. Adding this would make the script more portable and explicit about the shell interpreter to use.

skyrl-train/ci/anyscale_gpu_ci.yaml

tyler-griggs · 2025-07-21T17:57:34Z

skyrl-train/tests/gpu/gpu_ci/test_models.py



-def test_actor_model_fwd_with_sequence_parallelism():
+def test_actor_model_fwd_with_sequence_parallelism(ray_init_fixture):


In the future, will we move all GPU tests to take this ray_init_fixture? If so, do we want to run all GPU tests with NCCL_P2P_DISABLE and NCCL_SHM_DISABLE? I guess we need this for the L4s running CI, but setting these flags could miss some issues that we would otherwise catch while running tests in a non-CI environment.

Can we set these envvars only for CI, but otherwise leave them unset when we run GPU tests manually / locally?

This is a good point!

One of the main purposes of the GPU CI is to catch all issues early for developers, and trying to make sure that we don't have to run E2E tests for every single PR.

With this goal, it would indeed be bad if there's some issue we couldn't find because of the flags, and the only way to see it is if you run an E2E test.

I've now guarded the flags with the standard CI env var. If there's a corner case where things break in CI but work locally due to this, we should be able to replicate CI setup with this CI=true flag.

GPU tests passing in CI and the P2P env vars are disabled:

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

# What does this PR do? FIrst PR in enabling GPU CI. This PR adds a `SkyRL-GPU` workflow that will run GPU CI on Anyscale. We will basic unit and integration tests running on a 4xL4 instance in this workflow. Currently, the CI will run post-merge on `main` . For pull requests, it requires manual trigger by a maintainer. (so it should not run on every commit even for maintainers, since that can be wasteful). In this PR, I've only ported over one of our existing gpu tests - `test_models.py` to run on CI . The plan is to gradually add more and more tests after we've made them more efficient and compatible to be run on a 4xL4 node. --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH added 5 commits July 21, 2025 00:45

init gpu ci

200a380

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

minor

7b2cd11

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

bc16c42

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

switch back to workflow dispatch

7065b36

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

15ed385

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH commented Jul 21, 2025

View reviewed changes

.github/workflows/gpu_ci.yaml Outdated

push:

branches:

- main

- gpu_ci # TESTING

Copy link

Member Author

SumanthRH Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be removed before merging the PR.

SumanthRH requested review from Copilot and tyler-griggs July 21, 2025 01:08

Copilot AI reviewed Jul 21, 2025

View reviewed changes

tyler-griggs approved these changes Jul 21, 2025

View reviewed changes

SumanthRH added 3 commits July 21, 2025 18:28

guard nccl disable for ci env

376c896

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

name

09190db

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

remove gpu_ci from workflows yml

bb71037

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH merged commit 8a92b43 into main Jul 21, 2025
3 checks passed

SumanthRH deleted the gpu_ci branch July 23, 2025 08:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU CI 1/N] Init GPU CI on Anyscale#102

[GPU CI 1/N] Init GPU CI on Anyscale#102
SumanthRH merged 8 commits intomainfrom
gpu_ci

SumanthRH commented Jul 21, 2025

Uh oh!

SumanthRH Jul 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

SumanthRH Jul 21, 2025

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

Uh oh!

tyler-griggs Jul 21, 2025

Uh oh!

SumanthRH Jul 21, 2025

Uh oh!

SumanthRH Jul 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		def test_actor_model_fwd_with_sequence_parallelism():
		def test_actor_model_fwd_with_sequence_parallelism(ray_init_fixture):

Conversation

SumanthRH commented Jul 21, 2025

What does this PR do?

Uh oh!

SumanthRH Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tyler-griggs Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants