feat: Add NCCL loopback tests in preflight test suite#808
feat: Add NCCL loopback tests in preflight test suite#808lalitadithya merged 14 commits intoNVIDIA:mainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughAdds a new NCCL loopback preflight check: CI workflow matrix entries, a Helm initContainer entry, a multi-stage Docker image and Makefile, and a new Go module implementing env-driven config, an NCCL benchmark runner, health-event reporter, and unit tests. Changes
Sequence DiagramsequenceDiagram
participant M as Main
participant C as Config
participant GPU as nvidia-smi
participant B as Benchmark Runner
participant ER as all_reduce_perf
participant H as Health Reporter
participant GS as gRPC Socket
M->>C: FromEnv(ctx)
activate C
C->>GPU: detect GPU count (30s timeout)
GPU-->>C: GPU count
C-->>M: Config
deactivate C
M->>B: NewRunner(binaryPath)
B-->>M: Runner
M->>B: Run(ctx, numGPUs, testSizeMB)
activate B
B->>ER: execute all_reduce_perf (with timeout)
ER-->>B: stdout/stderr
B->>B: parseOutput() -> Result
B-->>M: Result
deactivate B
M->>M: compare bandwidth vs threshold
alt degraded or error
M->>H: SendEvent(isHealthy=false,isFatal=true)
activate H
H->>GS: HealthEventOccurredV1 (with retries)
GS-->>H: response
deactivate H
else healthy
M->>H: SendEvent(isHealthy=true,isFatal=false)
activate H
H->>GS: HealthEventOccurredV1
GS-->>H: response
deactivate H
end
M->>M: exit with appropriate code
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 109-119: The init container definition for preflight-nccl-loopback
is missing required env vars, causing config.FromEnv() to fail; update the
preflight-nccl-loopback container (name: preflight-nccl-loopback) to add
environment variables NODE_NAME and PLATFORM_CONNECTOR_SOCKET (sourcing
appropriate values or Kubernetes fieldRef/secret as needed) so the
config.FromEnv() call can read them at runtime; ensure the new env entries match
the names exactly (NODE_NAME, PLATFORM_CONNECTOR_SOCKET) and use the same
mounting/secret references used elsewhere in the chart.
In `@preflight-checks/nccl-loopback/Dockerfile`:
- Around line 29-31: ARG BUILD_DATE is using a shell substitution that won't be
executed at build time; change the Dockerfile's ARG BUILD_DATE to a static
placeholder (e.g. empty or "none") instead of "$(date ...)" and update your
CI/build invocation to pass the real timestamp via --build-arg BUILD_DATE=$(date
-u +%FT%TZ); reference the ARG lines (ARG VERSION, ARG GIT_COMMIT, ARG
BUILD_DATE) in the Dockerfile and ensure the pipeline sets BUILD_DATE when
calling docker build.
- Around line 78-85: The Dockerfile hardcodes NVCC_GENCODE including compute_100
which breaks builds on older CUDA/PyTorch; refactor the RUN that builds
nccl-tests to use a build ARG (e.g., ARG NVCC_GENCODE with a conservative
default that omits compute_100) and replace the literal NVCC_GENCODE value with
${NVCC_GENCODE} in the make invocation, and optionally add an ARG CUDA_VERSION
(or detect CUDA) to conditionally append compute_100 when CUDA >= 12.8; update
NCCL_TESTS_VERSION usage remains the same.
🧹 Nitpick comments (10)
preflight-checks/nccl-loopback/pkg/config/config.go (2)
15-15: Add package-level documentation.The
configpackage is missing a package-level godoc comment. As per coding guidelines, package-level godoc is required for all Go packages.📝 Proposed fix
+// Package config provides configuration loading and validation for the NCCL loopback preflight check. package config
211-215: Avoid shadowing thectxparameter.The function shadows the incoming
ctxparameter by reassigning it on line 214. While functionally correct here, this pattern can cause confusion in larger functions where readers might expect the original context to still be in scope.♻️ Proposed fix
func detectGPUCount(ctx context.Context) (int, error) { slog.Debug("Detecting GPU count using nvidia-smi") - ctx, cancel := context.WithTimeout(ctx, 30*time.Second) + timeoutCtx, cancel := context.WithTimeout(ctx, 30*time.Second) defer cancel() - cmd := exec.CommandContext(ctx, "nvidia-smi", "--query-gpu=name", "--format=csv,noheader") + cmd := exec.CommandContext(timeoutCtx, "nvidia-smi", "--query-gpu=name", "--format=csv,noheader") var stdout bytes.Buffer cmd.Stdout = &stdout if err := cmd.Run(); err != nil { - if ctx.Err() == context.DeadlineExceeded { + if timeoutCtx.Err() == context.DeadlineExceeded { slog.Error("nvidia-smi timed out", "timeout", "30s") return 0, fmt.Errorf("nvidia-smi timed out after 30s") }preflight-checks/nccl-loopback/pkg/health/reporter.go (3)
15-15: Add package-level documentation.The
healthpackage is missing a package-level godoc comment. As per coding guidelines, package-level godoc is required for all Go packages.📝 Proposed fix
+// Package health provides health event reporting for the NCCL loopback preflight check. package health
47-56: Add function documentation for exported constructor.The exported
NewReporterfunction should have a godoc comment describing its purpose and parameters. As per coding guidelines, function comments are required for all exported Go functions.📝 Proposed fix
+// NewReporter creates a new health event reporter. +// socketPath is the Unix socket path for the Platform Connector (with or without "unix://" prefix). +// nodeName is the Kubernetes node name to include in health events. +// strategy determines how downstream modules handle the event. func NewReporter(socketPath, nodeName string, strategy pb.ProcessingStrategy) *Reporter {
58-98: Add function documentation for exported method.The exported
SendEventmethod should have a godoc comment. As per coding guidelines, function comments are required for all exported Go functions.📝 Proposed fix
+// SendEvent sends a health event to the Platform Connector with retry logic. +// isHealthy indicates the health status, isFatal marks critical failures, +// message provides human-readable details, and errorCode is an optional error identifier. func (r *Reporter) SendEvent(ctx context.Context, isHealthy, isFatal bool, message string, errorCode string) error {preflight-checks/nccl-loopback/pkg/benchmark/benchmark.go (1)
15-15: Add package-level documentation.The
benchmarkpackage is missing a package-level godoc comment. As per coding guidelines, package-level godoc is required for all Go packages.📝 Proposed fix
+// Package benchmark provides NCCL all-reduce benchmark execution and result parsing. package benchmarkpreflight-checks/nccl-loopback/pkg/benchmark/benchmark_test.go (1)
103-136: Consider consolidating error cases into a table-driven test.The error test cases (WrongSize, NoGPUs, MalformedData) share a similar structure. As per coding guidelines, table-driven tests are preferred when testing multiple scenarios.
♻️ Proposed table-driven approach
+func TestParseOutput_ErrorCases(t *testing.T) { + tests := []struct { + name string + output string + testSizeMB int + wantErr string + }{ + { + name: "WrongSize", + output: sampleNCCLOutput, + testSizeMB: 128, // Looking for 128MB, but output has 256MB + wantErr: "could not find results", + }, + { + name: "NoGPUs", + output: `# Some header +# No GPU rank lines here + 268435456 67108864 float sum -1 2374.1 113.07 197.87 0 +`, + testSizeMB: 256, + wantErr: "could not determine number of GPUs", + }, + { + name: "MalformedData", + output: `# Using devices +# Rank 0 Group 0 Pid 12 on nccl-test-1 device 0 [0001:00:00] NVIDIA A100 + not_a_number 67108864 float sum -1 2374.1 113.07 197.87 0 +`, + testSizeMB: 256, + wantErr: "could not find results", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + _, err := parseOutput(tt.output, tt.testSizeMB) + if err == nil { + t.Errorf("expected error containing %q, got nil", tt.wantErr) + return + } + t.Logf("Got expected error: %v", err) + }) + } +}preflight-checks/nccl-loopback/main.go (2)
15-26: Missing package-level godoc comment.As per coding guidelines, package-level godoc is required for all Go packages. Add a comment describing the purpose of this package before the
package maindeclaration.📝 Suggested package documentation
+// Package main implements the NCCL loopback preflight check binary. +// It validates GPU interconnect bandwidth by running NCCL all-reduce benchmarks +// and reports health status to the platform connector. package main
56-63: Consider adding a context timeout for the benchmark execution.The
context.Background()has no timeout, which means the benchmark could run indefinitely if the NCCL test binary hangs. While Kubernetes init container timeouts provide a safety net, adding an explicit timeout improves observability and fail-fast behavior.🛡️ Optional: Add context timeout
func run() int { - ctx := context.Background() + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute) + defer cancel()You'll need to add
"time"to the imports.distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)
110-111: Consider adding securityContext and resource requests.Unlike the main preflight container (lines 31-38, 45-51), the init container lacks
securityContextandresourcesdefinitions. While the NCCL test requires elevated privileges for GPU access, explicitly defining the security context improves clarity and allows for controlled privilege escalation.📝 Suggested additions (similar to dcgm-diag pattern)
- name: preflight-nccl-loopback image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:latest # securityContext: # privileged: true # Required for GPU device access # resources: # requests: # cpu: 100m # memory: 256Mi # limits: # nvidia.com/gpu: 8 # Request all GPUs for loopback test
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@preflight-checks/nccl-loopback/main.go`:
- Line 15: Add a package-level godoc comment immediately above the `package
main` declaration that succinctly describes the purpose and responsibilities of
this binary (e.g., what the nccl-loopback preflight check does, intended usage,
and any important invariants); ensure the comment is a full sentence starting
with the package name `main` to satisfy Go doc conventions.
- Around line 53-54: The run function currently uses context.Background()
without signal handling; change it to a cancellable context that listens for
SIGINT and SIGTERM so runner.Run can be stopped gracefully. Import "os/signal"
and "syscall", create a context via signal.NotifyContext (or create a
context.WithCancel and a goroutine that calls cancel on signals), defer the
cancel, and pass this ctx into runner.Run (and any other downstream calls) so
the external benchmark process can be interrupted cleanly on OS signals.
🧹 Nitpick comments (2)
preflight-checks/nccl-loopback/main.go (2)
78-142: Consider extracting the repeatedSendEvent+ error-handling pattern.The same
SendEvent→ check error → log → returnexitSendEventErrorblock appears three times (lines 81–89, 111–119, 134–142). A small helper would reduce duplication and ensure consistent handling if the pattern evolves.Example helper
func sendEvent(ctx context.Context, reporter *health.Reporter, isHealthy, isFatal bool, message, code string) error { if err := reporter.SendEvent(ctx, isHealthy, isFatal, message, code); err != nil { slog.Error("Failed to send health event", "error", err) return err } return nil }Then each call site becomes:
if sendEvent(ctx, reporter, false, true, msg, "NCCL_BW_DEGRADED") != nil { return exitSendEventError } return exitTestFailed
134-138: Passing empty string""as event code for the success path.On the failure paths, meaningful codes like
"NCCL_TEST_ERROR"and"NCCL_BW_DEGRADED"are provided. The success path passes"". Consider using an explicit success code (e.g.,"NCCL_BW_OK") for consistent event taxonomy and easier downstream filtering/alerting.
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
|
Do we also have a way to run this check without a BW test? I.e., I just want to see if the NCCL loopback passes and I don't care about bandwidth? |
A simple way to do this would be keep threshold very low. Are you thinking of some feature flag? |
I thought of some flag like |
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 113-124: Fix the typo and clarify the SKIP_BANDWIDTH_CHECK docs:
correct "GPU-interconnnect" to "GPU-interconnect" in the comment near
BW_THRESHOLD_GBPS/TEST_SIZE_MB, update the SKIP_BANDWIDTH_CHECK commented block
to document the exact truthy values accepted by strconv.ParseBool (accepted true
values: "1", "t", "T", "true", "TRUE", "True"), and remove the trailing
whitespace at the end of the SKIP_BANDWIDTH_CHECK comment line.
🧹 Nitpick comments (5)
preflight-checks/nccl-loopback/pkg/config/config.go (2)
15-15: Missing package-level godoc comment.As per coding guidelines, "Package-level godoc required for all Go packages".
Suggested fix
+// Package config provides environment-based configuration for the NCCL loopback preflight check. package config
236-257:nvidia-smistderr is not captured — error diagnostics are lost.When
cmd.Run()fails, onlyerris available. GPU driver or device errors are often written to stderr. Capturing stderr here (as done inbenchmark.go) would improve debuggability.Proposed fix
func detectGPUCount(ctx context.Context) (int, error) { slog.Debug("Detecting GPU count using nvidia-smi") ctx, cancel := context.WithTimeout(ctx, 30*time.Second) defer cancel() cmd := exec.CommandContext(ctx, "nvidia-smi", "--query-gpu=name", "--format=csv,noheader") - var stdout bytes.Buffer + var stdout, stderr bytes.Buffer cmd.Stdout = &stdout + cmd.Stderr = &stderr if err := cmd.Run(); err != nil { if ctx.Err() == context.DeadlineExceeded { slog.Error("nvidia-smi timed out", "timeout", "30s") return 0, fmt.Errorf("nvidia-smi timed out after 30s") } - slog.Error("nvidia-smi command failed", "error", err) + slog.Error("nvidia-smi command failed", "error", err, "stderr", stderr.String()) - return 0, fmt.Errorf("nvidia-smi failed: %w", err) + return 0, fmt.Errorf("nvidia-smi failed: %w\nstderr: %s", err, stderr.String()) }preflight-checks/nccl-loopback/pkg/benchmark/benchmark.go (1)
15-15: Missing package-level godoc comment.As per coding guidelines, "Package-level godoc required for all Go packages".
Suggested fix
+// Package benchmark provides an NCCL all-reduce benchmark runner and output parser. package benchmarkpreflight-checks/nccl-loopback/main.go (1)
78-93: Health event message may contain unbounded benchmark output on failure.When
runner.Runfails, its error includes full stdout/stderr (seebenchmark.goLines 100-101). This error is embedded verbatim into the health event message at Line 85 viafmt.Sprintf("NCCL loopback test failed: %v", err). For large benchmark outputs or verbose NCCL debug logs, this could produce an oversized gRPC message.Consider truncating the error message before sending:
Proposed fix
+ errMsg := fmt.Sprintf("NCCL loopback test failed: %v", err) + if len(errMsg) > 4096 { + errMsg = errMsg[:4096] + "... (truncated)" + } + if sendErr := reporter.SendEvent(ctx, false, // isHealthy true, // isFatal - fmt.Sprintf("NCCL loopback test failed: %v", err), + errMsg, "NCCL_TEST_ERROR", ); sendErr != nil {distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)
132-134: Consider expanding the inline documentation fornetworkResourceNames.The comment mentions "RDMA, InfiniBand, etc." but doesn't explain that these are Kubernetes extended resource names advertised by device plugins (e.g., NVIDIA Network Operator). A brief note about when to modify this list (e.g., different NIC types or custom device plugin resource names) would help operators unfamiliar with the setup.
📝 Suggested documentation improvement
-# Network resource names to copy to init containers (RDMA, InfiniBand, etc.) +# Network resource names to copy to init containers (RDMA, InfiniBand, etc.) +# These are Kubernetes extended resources advertised by device plugins (e.g., NVIDIA Network Operator). +# Adjust entries to match the resource names available in your cluster. +# Example: "nvidia.com/mlnxnics" for Mellanox ConnectX NICs networkResourceNames: - "nvidia.com/mlnxnics"As per coding guidelines: "Include examples for non-obvious configurations in Helm chart documentation."
Added env var for this, updated PR with test description. |
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Line 116: Update the inline comment that currently reads "If using PCIe
GPU-interconnect, this threshold should be set to ~15 GB" to include the correct
rate unit so it reads "~15 GB/s"; edit the comment near the PCIe
GPU-interconnect threshold in values.yaml (the comment containing "If using PCIe
GPU-interconnect") to make the unit consistent with the other inline comment
("GB/s").
🧹 Nitpick comments (1)
distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)
110-111: Consider documenting the minimum driver requirement for this image.The PR description notes the NCCL loopback image requires NVIDIA driver R570+. This is a non-obvious operational constraint that users could easily miss. As per coding guidelines, non-obvious configurations should include explanatory comments/examples.
Proposed addition
- name: preflight-nccl-loopback + # Requires NVIDIA driver R570+ on the host image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:latest
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Summary
This PR adds a NCCL loopback test in the preflight tests
What this test tests:
flowchart TB subgraph healthy["✅ HEALTHY NODE"] direction LR H0[GPU 0] --> H1[GPU 1] --> H2[GPU 2] --> H3[...] --> H7[GPU 7] --> H0 end healthy --> PASS["400 GB/s ✅"] PASS --> broken subgraph broken["🔴 BROKEN NODE"] direction LR B0[GPU 0] --> B1[GPU 1] -.->|"PCIe ❌"| B2[GPU 2] --> B3[...] --> B7[GPU 7] --> B0 end broken --> FAIL["< 100 GB/s 🔴"]For bus bandwidth:
For runtime image, I'm using pytorch 25.01 which supports R570+ driver versions, and compiling NCCL tests afterwards. Objective of doing so would be reduced time of subsequent preflight tests allReduce which would be using the same base image. The other option was to use nccl-tests from ECR gallery where they've EFA plugins built-ins, although this would not have been problem in case of loopback test but might not have benefited in the other CSPs.
Testing
Bandwidth across hardwares
With 256 Mb message size:
Healthy node
Unhealthy node:
Event structure:
Multiple init containers
then both initContainers are injected:
Skipping bandwidth check
With
SKIP_BANDWIDTH_CHECK: truein environment variable:Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Tests
Chores