feat: Add NCCL loopback tests in preflight test suite by XRFXLP · Pull Request #808 · NVIDIA/NVSentinel

XRFXLP · 2026-02-05T12:30:20Z

Summary

This PR adds a NCCL loopback test in the preflight tests

What this test tests:

flowchart TB
    subgraph healthy["✅ HEALTHY NODE"]
        direction LR
        H0[GPU 0] --> H1[GPU 1] --> H2[GPU 2] --> H3[...] --> H7[GPU 7] --> H0
    end

    healthy --> PASS["400 GB/s ✅"]
    PASS --> broken

    subgraph broken["🔴 BROKEN NODE"]
        direction LR
        B0[GPU 0] --> B1[GPU 1] -.->|"PCIe ❌"| B2[GPU 2] --> B3[...] --> B7[GPU 7] --> B0
    end

    broken --> FAIL["< 100 GB/s 🔴"]

For bus bandwidth:

                    Message Size (bytes)
Algorithm BW  =  ─────────────────────────
                      Time (seconds)


                              2 × (N - 1)
Bus BW  =  Algorithm BW  ×  ─────────────
                                  N

For runtime image, I'm using pytorch 25.01 which supports R570+ driver versions, and compiling NCCL tests afterwards. Objective of doing so would be reduced time of subsequent preflight tests allReduce which would be using the same base image. The other option was to use nccl-tests from ECR gallery where they've EFA plugins built-ins, although this would not have been problem in case of loopback test but might not have benefited in the other CSPs.

Testing

Bandwidth across hardwares

With 256 Mb message size:

CSP	Bus BandWidth(GBps)	Hardware	GPUs
Azure	195	A100	8
OCI	347	H100	8
AWS	347	H100	8
AWS	596	GB300	4
AWS	597	GB200	4
GCP	341	H100	8

Healthy node

$ klog nginx-with-gpu-qmrlb -c preflight-nccl-loopback -n preflight-test-ns | jq
{
  "time": "2026-02-05T15:56:16.027441109Z",
  "level": "INFO",
  "msg": "Starting preflight NCCL loopback check",
  "version": "dev",
  "commit": "none",
  "date": "$(date"
}
{
  "time": "2026-02-05T15:56:16.129207959Z",
  "level": "INFO",
  "msg": "Detected GPUs",
  "count": 8
}
{
  "time": "2026-02-05T15:56:16.129263995Z",
  "level": "INFO",
  "msg": "Configuration loaded",
  "bw_threshold_gbps": 150,
  "test_size_mb": 256,
  "num_gpus": 8,
  "binary": "/opt/nccl-tests/build/all_reduce_perf",
  "node_name": "aks-gpu-xxx-vmss00000a"
}
{
  "time": "2026-02-05T15:56:16.129281377Z",
  "level": "INFO",
  "msg": "Running NCCL all-reduce benchmark",
  "binary": "/opt/nccl-tests/build/all_reduce_perf",
  "size_mb": 256,
  "num_gpus": 8,
  "timeout": 300000000000
}
{
  "time": "2026-02-05T15:56:25.15855496Z",
  "level": "INFO",
  "msg": "Benchmark completed",
  "bus_bandwidth_gbps": 195.82,
  "algo_bandwidth_gbps": 111.89,
  "num_gpus": 8,
  "test_size_bytes": 268435456
}
{
  "time": "2026-02-05T15:56:25.158607218Z",
  "level": "INFO",
  "msg": "NCCL loopback test PASSED",
  "measured_gbps": 195.82,
  "threshold_gbps": 150
}
{
  "time": "2026-02-05T15:56:25.158631153Z",
  "level": "INFO",
  "msg": "Sending health event",
  "is_healthy": true,
  "is_fatal": false,
  "message": "NCCL all-reduce bus bandwidth 195.82 GB/s meets threshold 150.00 GB/s",
  "error_code": "",
  "recommended_action": "NONE"
}
{
  "time": "2026-02-05T15:56:25.231322728Z",
  "level": "INFO",
  "msg": "Health event sent successfully"
}

Unhealthy node:

$ klog nginx-with-gpu-wflt5 -c preflight-nccl-loopback -n preflight-test-ns | jq
{
  "time": "2026-02-05T16:17:20.485671689Z",
  "level": "INFO",
  "msg": "Starting preflight NCCL loopback check",
  "version": "dev",
  "commit": "none",
  "date": "$(date"
}
{
  "time": "2026-02-05T16:17:20.615041266Z",
  "level": "INFO",
  "msg": "Detected GPUs",
  "count": 8
}
{
  "time": "2026-02-05T16:17:20.615097Z",
  "level": "INFO",
  "msg": "Configuration loaded",
  "bw_threshold_gbps": 150,
  "test_size_mb": 256,
  "num_gpus": 8,
  "binary": "/opt/nccl-tests/build/all_reduce_perf",
  "node_name": "aks-gpu-xxx-vmss00000j"
}
{
  "time": "2026-02-05T16:17:20.615110976Z",
  "level": "INFO",
  "msg": "Running NCCL all-reduce benchmark",
  "binary": "/opt/nccl-tests/build/all_reduce_perf",
  "size_mb": 256,
  "num_gpus": 8,
  "timeout": 300000000000
}
{
  "time":  #"2026-02-05T16:17:20.972182522Z",
  "level": "ERROR",
  "msg": "NCCL benchmark failed",
  "error": "exit status 2",
  "stdout": "# nThread 1 nGpus 8 minBytes 268435456 maxBytes 268435456 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0\n#\n# Using devices\nnginx-with-gpu-wflt5: Test CUDA failure common.cu:941 'system not yet initialized'\n .. nginx-with-gpu-wflt5 pid 15: Test failure common.cu:891\n",
  "stderr": ""
}
{
  "time": "2026-02-05T16:17:20.972232405Z",
  "level": "ERROR",
  "msg": "NCCL benchmark failed",
  "error": "all_reduce_perf failed: exit status 2\nstdout: # nThread 1 nGpus 8 minBytes 268435456 maxBytes 268435456 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0\n#\n# Using devices\nnginx-with-gpu-wflt5: Test CUDA failure common.cu:941 'system not yet initialized'\n .. nginx-with-gpu-wflt5 pid 15: Test failure common.cu:891\n\nstderr: "
}
{
  "time": "2026-02-05T16:17:20.972249076Z",
  "level": "INFO",
  "msg": "Sending health event",
  "is_healthy": false,
  "is_fatal": true,
  "message": "NCCL loopback test failed: all_reduce_perf failed: exit status 2\nstdout: # nThread 1 nGpus 8 minBytes 268435456 maxBytes 268435456 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0\n#\n# Using devices\nnginx-with-gpu-wflt5: Test CUDA failure common.cu:941 'system not yet initialized'\n .. nginx-with-gpu-wflt5 pid 15: Test failure common.cu:891\n\nstderr: ",
  "error_code": "NCCL_TEST_ERROR",
  "recommended_action": "CONTACT_SUPPORT"
}
{
  "time": "2026-02-05T16:17:20.974316344Z",
  "level": "INFO",
  "msg": "Health event sent successfully"
}

Event structure:

{
  "time": "2026-02-05T15:56:25.967126485Z",
  "level": "INFO",
  "msg": "Successfully got item",
  "module": "platform-connectors",
  "version": "v0.7.1",
  "healthEvents": {
    "version": 1,
    "events": [
      {
        "version": 1,
        "agent": "preflight-nccl-loopback",
        "componentClass": "Node",
        "checkName": "NCCLLoopbackTest",
        "isHealthy": true,
        "message": "NCCL all-reduce bus bandwidth 196.85 GB/s meets threshold 150.00 GB/s",
        "metadata": {
          "node.kubernetes.io/instance-type": "Standard_ND96amsr_A100_v4",
          "nvidia.com/cuda.driver-version.full": "570.148.08",
          "nvidia.com/cuda.driver-version.major": "570",
          "nvidia.com/cuda.driver-version.minor": "148",
          "nvidia.com/cuda.driver-version.revision": "08",
          "nvidia.com/cuda.runtime-version.full": "12.8",
          "nvidia.com/cuda.runtime-version.major": "12",
          "nvidia.com/cuda.runtime-version.minor": "8",
          "nvidia.com/gpu.product": "NVIDIA-A100-SXM4-80GB",
          "providerID": "az..0",
          "topology.kubernetes.io/region": "southcentralus",
          "topology.kubernetes.io/zone": "0"
        },
        "generatedTimestamp": {
          "seconds": 1770306985,
          "nanos": 937919942
        },
        "nodeName": "aks-gpu-xxx-vmss000000",
        "processingStrategy": 1
      }
    ]
  }
}

Multiple init containers

$ kg cm preflight-config -n nvsentinel -o yaml | yq .data
config.yaml: |
  gpuResourceNames:
    - nvidia.com/gpu
  networkResourceNames:
    - nvidia.com/mlnxnics
  dcgm:
    hostengineAddr: "nvidia-dcgm.gpu-operator.svc:5555"
    diagLevel: 2
    connectorSocket: "unix:///var/run/nvsentinel.sock"
    processingStrategy: "EXECUTE_REMEDIATION"
  initContainers:
    - image: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag:latest
      name: preflight-dcgm-diag
      volumeMounts:
      - mountPath: /var/run
        name: nvsentinel-socket
    - env:
      - name: BW_THRESHOLD_GBPS
        value: "150"
      - name: TEST_SIZE_MB
        value: "256"
      image: docker.io/xrfxlp/preflight-nccl-loopback:16460502
      name: preflight-nccl-loopback
      volumeMounts:
      - mountPath: /var/run
        name: nvsentinel-socket

then both initContainers are injected:

$ kg po -n preflight-test-ns      nginx-with-gpu-hm54r -o yaml | yq .spec.initContainers 
- env:
    - name: DCGM_DIAG_LEVEL
      value: "2"
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: DCGM_HOSTENGINE_ADDR
      value: nvidia-dcgm.gpu-operator.svc:5555
    - name: PLATFORM_CONNECTOR_SOCKET
      value: unix:///var/run/nvsentinel.sock
    - name: PROCESSING_STRATEGY
      value: EXECUTE_REMEDIATION
  image: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag:latest
  imagePullPolicy: Always
  name: preflight-dcgm-diag
  resources:
    limits:
      nvidia.com/gpu: "8"
    requests:
      cpu: 100m
      memory: 500Mi
      nvidia.com/gpu: "8"
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File
  volumeMounts:
    - mountPath: /var/run
      name: nvsentinel-socket
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-rtp4l
      readOnly: true
- env:
    - name: BW_THRESHOLD_GBPS
      value: "150"
    - name: TEST_SIZE_MB
      value: "256"
  image: docker.io/xrfxlp/preflight-nccl-loopback:16460502
  imagePullPolicy: IfNotPresent
  name: preflight-nccl-loopback
  resources:
    limits:
      nvidia.com/gpu: "8"
    requests:
      cpu: 100m
      memory: 500Mi
      nvidia.com/gpu: "8"
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File
  volumeMounts:
    - mountPath: /var/run
      name: nvsentinel-socket
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-rtp4l
      readOnly: true

Skipping bandwidth check

With SKIP_BANDWIDTH_CHECK: true in environment variable:

{
  "time": "2026-02-06T07:14:20.030558502Z",
  "level": "INFO",
  "msg": "Benchmark completed",
  "module": "nccl-loopback",
  "version": "dev",
  "bus_bandwidth_gbps": 196.11,
  "algo_bandwidth_gbps": 112.06,
  "num_gpus": 8,
  "test_size_bytes": 268435456
}
{
  "time": "2026-02-06T07:14:20.030597825Z",
  "level": "INFO",
  "msg": "NCCL loopback test PASSED (bandwidth check skipped)",
  "module": "nccl-loopback",
  "version": "dev",
  "measured_gbps": 196.11
}
{
  "time": "2026-02-06T07:14:20.030619736Z",
  "level": "INFO",
  "msg": "Sending health event",
  "module": "nccl-loopback",
  "version": "dev",
  "is_healthy": true,
  "is_fatal": false,
  "message": "NCCL all-reduce completed successfully with bus bandwidth 196.11 GB/s (threshold check skipped)",
  "error_code": "",
  "recommended_action": "NONE"
}

Type of Change

Component(s) Affected

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

Summary by CodeRabbit

New Features
- Added an NCCL loopback preflight check with a default 150 GBps threshold and 256 MB test, plus an init container for deployment.
Tests
- CI matrices updated to include NCCL loopback validation; added unit tests for benchmark output parsing.
Chores
- Added build, packaging, image publishing, and runtime tooling for the NCCL loopback check.

copy-pr-bot · 2026-02-05T12:30:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-05T12:30:57Z

Warning

Rate limit exceeded

@XRFXLP has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 24 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Adds a new NCCL loopback preflight check: CI workflow matrix entries, a Helm initContainer entry, a multi-stage Docker image and Makefile, and a new Go module implementing env-driven config, an NCCL benchmark runner, health-event reporter, and unit tests.

Changes

Cohort / File(s)	Summary
CI/CD workflows `.github/workflows/lint-test.yml`, `.github/workflows/container-build-test.yml`, `.github/workflows/publish.yml`	Extended lint-test matrix to include `nccl-loopback`; added container-build-test step and publish matrix entry for `preflight-nccl-loopback` (docker-build / docker-publish targets).
Kubernetes chart `distros/kubernetes/nvsentinel/charts/preflight/values.yaml`	Added `preflight-nccl-loopback` initContainer (`ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:latest`) with env vars `BW_THRESHOLD_GBPS=150`, `TEST_SIZE_MB=256`, and a `nvsentinel-socket` volumeMount.
Container image & build `preflight-checks/nccl-loopback/Dockerfile`, `preflight-checks/nccl-loopback/Makefile`	New multi-stage Dockerfile (Go build stage + PyTorch runtime building `nccl-tests`) and Makefile with Go-module flags, Docker targets, and CI targets (lint/test/docker-build/docker-publish).
Go module & deps `preflight-checks/nccl-loopback/go.mod`	New Go module declaration with dependencies (gRPC, protobuf, k8s libs, etc.) and local `replace` directives linking `commons` and `data-models`.
Application entrypoint `preflight-checks/nccl-loopback/main.go`	New main program: JSON logging, env-based config loading, benchmark execution, health-event sending over connector socket, and distinct exit codes for config/test/send outcomes.
Benchmark package & tests `preflight-checks/nccl-loopback/pkg/benchmark/...`	Added Runner and Result types to execute `all_reduce_perf` with timeout, parse GPU count and bandwidth metrics, and comprehensive unit tests covering parsing edge cases.
Configuration `preflight-checks/nccl-loopback/pkg/config/config.go`	Added Config struct and `FromEnv(ctx)` to parse/validate env vars, detect GPU count via `nvidia-smi` (30s timeout), validate NCCL test binary path, and require connector socket/node name and processing strategy.
Health reporting `preflight-checks/nccl-loopback/pkg/health/reporter.go`	Added Reporter that builds HealthEvent payloads and sends them over a UNIX gRPC socket with exponential-backoff retries and retryable-error classification.

Sequence Diagram

sequenceDiagram
    participant M as Main
    participant C as Config
    participant GPU as nvidia-smi
    participant B as Benchmark Runner
    participant ER as all_reduce_perf
    participant H as Health Reporter
    participant GS as gRPC Socket

    M->>C: FromEnv(ctx)
    activate C
    C->>GPU: detect GPU count (30s timeout)
    GPU-->>C: GPU count
    C-->>M: Config
    deactivate C

    M->>B: NewRunner(binaryPath)
    B-->>M: Runner

    M->>B: Run(ctx, numGPUs, testSizeMB)
    activate B
    B->>ER: execute all_reduce_perf (with timeout)
    ER-->>B: stdout/stderr
    B->>B: parseOutput() -> Result
    B-->>M: Result
    deactivate B

    M->>M: compare bandwidth vs threshold

    alt degraded or error
        M->>H: SendEvent(isHealthy=false,isFatal=true)
        activate H
        H->>GS: HealthEventOccurredV1 (with retries)
        GS-->>H: response
        deactivate H
    else healthy
        M->>H: SendEvent(isHealthy=true,isFatal=false)
        activate H
        H->>GS: HealthEventOccurredV1
        GS-->>H: response
        deactivate H
    end

    M->>M: exit with appropriate code

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped into builds to count each GPU,
Ran tiny all-reduce and listened for the cue,
Parsed bandwidth lines, retried through the socket,
Sent a health event — then tucked in my pocket,
Preflight loopback ready; I nibble logs anew.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: Add NCCL loopback tests in preflight test suite' accurately and concisely summarizes the main change: a new NCCL loopback preflight test feature. It is clear, specific, and directly reflects the primary objective of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 109-119: The init container definition for preflight-nccl-loopback
is missing required env vars, causing config.FromEnv() to fail; update the
preflight-nccl-loopback container (name: preflight-nccl-loopback) to add
environment variables NODE_NAME and PLATFORM_CONNECTOR_SOCKET (sourcing
appropriate values or Kubernetes fieldRef/secret as needed) so the
config.FromEnv() call can read them at runtime; ensure the new env entries match
the names exactly (NODE_NAME, PLATFORM_CONNECTOR_SOCKET) and use the same
mounting/secret references used elsewhere in the chart.

In `@preflight-checks/nccl-loopback/Dockerfile`:
- Around line 29-31: ARG BUILD_DATE is using a shell substitution that won't be
executed at build time; change the Dockerfile's ARG BUILD_DATE to a static
placeholder (e.g. empty or "none") instead of "$(date ...)" and update your
CI/build invocation to pass the real timestamp via --build-arg BUILD_DATE=$(date
-u +%FT%TZ); reference the ARG lines (ARG VERSION, ARG GIT_COMMIT, ARG
BUILD_DATE) in the Dockerfile and ensure the pipeline sets BUILD_DATE when
calling docker build.
- Around line 78-85: The Dockerfile hardcodes NVCC_GENCODE including compute_100
which breaks builds on older CUDA/PyTorch; refactor the RUN that builds
nccl-tests to use a build ARG (e.g., ARG NVCC_GENCODE with a conservative
default that omits compute_100) and replace the literal NVCC_GENCODE value with
${NVCC_GENCODE} in the make invocation, and optionally add an ARG CUDA_VERSION
(or detect CUDA) to conditionally append compute_100 when CUDA >= 12.8; update
NCCL_TESTS_VERSION usage remains the same.

🧹 Nitpick comments (10)

preflight-checks/nccl-loopback/pkg/config/config.go (2)
15-15: Add package-level documentation.

The config package is missing a package-level godoc comment. As per coding guidelines, package-level godoc is required for all Go packages.
📝 Proposed fix
+// Package config provides configuration loading and validation for the NCCL loopback preflight check.
 package config
211-215: Avoid shadowing the ctx parameter.

The function shadows the incoming ctx parameter by reassigning it on line 214. While functionally correct here, this pattern can cause confusion in larger functions where readers might expect the original context to still be in scope.
♻️ Proposed fix
 func detectGPUCount(ctx context.Context) (int, error) {
 	slog.Debug("Detecting GPU count using nvidia-smi")
 
-	ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
+	timeoutCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
 	defer cancel()
 
-	cmd := exec.CommandContext(ctx, "nvidia-smi", "--query-gpu=name", "--format=csv,noheader")
+	cmd := exec.CommandContext(timeoutCtx, "nvidia-smi", "--query-gpu=name", "--format=csv,noheader")
 
 	var stdout bytes.Buffer
 
 	cmd.Stdout = &stdout
 
 	if err := cmd.Run(); err != nil {
-		if ctx.Err() == context.DeadlineExceeded {
+		if timeoutCtx.Err() == context.DeadlineExceeded {
 			slog.Error("nvidia-smi timed out", "timeout", "30s")
 			return 0, fmt.Errorf("nvidia-smi timed out after 30s")
 		}
preflight-checks/nccl-loopback/pkg/health/reporter.go (3)
15-15: Add package-level documentation.

The health package is missing a package-level godoc comment. As per coding guidelines, package-level godoc is required for all Go packages.
📝 Proposed fix
+// Package health provides health event reporting for the NCCL loopback preflight check.
 package health
47-56: Add function documentation for exported constructor.

The exported NewReporter function should have a godoc comment describing its purpose and parameters. As per coding guidelines, function comments are required for all exported Go functions.
📝 Proposed fix
+// NewReporter creates a new health event reporter.
+// socketPath is the Unix socket path for the Platform Connector (with or without "unix://" prefix).
+// nodeName is the Kubernetes node name to include in health events.
+// strategy determines how downstream modules handle the event.
 func NewReporter(socketPath, nodeName string, strategy pb.ProcessingStrategy) *Reporter {
58-98: Add function documentation for exported method.

The exported SendEvent method should have a godoc comment. As per coding guidelines, function comments are required for all exported Go functions.
📝 Proposed fix
+// SendEvent sends a health event to the Platform Connector with retry logic.
+// isHealthy indicates the health status, isFatal marks critical failures,
+// message provides human-readable details, and errorCode is an optional error identifier.
 func (r *Reporter) SendEvent(ctx context.Context, isHealthy, isFatal bool, message string, errorCode string) error {
preflight-checks/nccl-loopback/pkg/benchmark/benchmark.go (1)
15-15: Add package-level documentation.

The benchmark package is missing a package-level godoc comment. As per coding guidelines, package-level godoc is required for all Go packages.
📝 Proposed fix
+// Package benchmark provides NCCL all-reduce benchmark execution and result parsing.
 package benchmark
preflight-checks/nccl-loopback/pkg/benchmark/benchmark_test.go (1)
103-136: Consider consolidating error cases into a table-driven test.

The error test cases (WrongSize, NoGPUs, MalformedData) share a similar structure. As per coding guidelines, table-driven tests are preferred when testing multiple scenarios.
♻️ Proposed table-driven approach
+func TestParseOutput_ErrorCases(t *testing.T) {
+	tests := []struct {
+		name       string
+		output     string
+		testSizeMB int
+		wantErr    string
+	}{
+		{
+			name:       "WrongSize",
+			output:     sampleNCCLOutput,
+			testSizeMB: 128, // Looking for 128MB, but output has 256MB
+			wantErr:    "could not find results",
+		},
+		{
+			name: "NoGPUs",
+			output: `# Some header
+# No GPU rank lines here
+   268435456      67108864     float     sum      -1   2374.1  113.07  197.87      0
+`,
+			testSizeMB: 256,
+			wantErr:    "could not determine number of GPUs",
+		},
+		{
+			name: "MalformedData",
+			output: `# Using devices
+#  Rank  0 Group  0 Pid     12 on nccl-test-1 device  0 [0001:00:00] NVIDIA A100
+   not_a_number      67108864     float     sum      -1   2374.1  113.07  197.87      0
+`,
+			testSizeMB: 256,
+			wantErr:    "could not find results",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			_, err := parseOutput(tt.output, tt.testSizeMB)
+			if err == nil {
+				t.Errorf("expected error containing %q, got nil", tt.wantErr)
+				return
+			}
+			t.Logf("Got expected error: %v", err)
+		})
+	}
+}
preflight-checks/nccl-loopback/main.go (2)
15-26: Missing package-level godoc comment.

As per coding guidelines, package-level godoc is required for all Go packages. Add a comment describing the purpose of this package before the package main declaration.
📝 Suggested package documentation
+// Package main implements the NCCL loopback preflight check binary.
+// It validates GPU interconnect bandwidth by running NCCL all-reduce benchmarks
+// and reports health status to the platform connector.
 package main
56-63: Consider adding a context timeout for the benchmark execution.

The context.Background() has no timeout, which means the benchmark could run indefinitely if the NCCL test binary hangs. While Kubernetes init container timeouts provide a safety net, adding an explicit timeout improves observability and fail-fast behavior.
🛡️ Optional: Add context timeout
 func run() int {
-	ctx := context.Background()
+	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
+	defer cancel()
You'll need to add "time" to the imports.
distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)
110-111: Consider adding securityContext and resource requests.

Unlike the main preflight container (lines 31-38, 45-51), the init container lacks securityContext and resources definitions. While the NCCL test requires elevated privileges for GPU access, explicitly defining the security context improves clarity and allows for controlled privilege escalation.
📝 Suggested additions (similar to dcgm-diag pattern)
  - name: preflight-nccl-loopback
    image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:latest
    # securityContext:
    #   privileged: true  # Required for GPU device access
    # resources:
    #   requests:
    #     cpu: 100m
    #     memory: 256Mi
    #   limits:
    #     nvidia.com/gpu: 8  # Request all GPUs for loopback test

distros/kubernetes/nvsentinel/charts/preflight/values.yaml

preflight-checks/nccl-loopback/Dockerfile

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@preflight-checks/nccl-loopback/main.go`:
- Line 15: Add a package-level godoc comment immediately above the `package
main` declaration that succinctly describes the purpose and responsibilities of
this binary (e.g., what the nccl-loopback preflight check does, intended usage,
and any important invariants); ensure the comment is a full sentence starting
with the package name `main` to satisfy Go doc conventions.
- Around line 53-54: The run function currently uses context.Background()
without signal handling; change it to a cancellable context that listens for
SIGINT and SIGTERM so runner.Run can be stopped gracefully. Import "os/signal"
and "syscall", create a context via signal.NotifyContext (or create a
context.WithCancel and a goroutine that calls cancel on signals), defer the
cancel, and pass this ctx into runner.Run (and any other downstream calls) so
the external benchmark process can be interrupted cleanly on OS signals.

🧹 Nitpick comments (2)

preflight-checks/nccl-loopback/main.go (2)
78-142: Consider extracting the repeated SendEvent + error-handling pattern.

The same SendEvent → check error → log → return exitSendEventError block appears three times (lines 81–89, 111–119, 134–142). A small helper would reduce duplication and ensure consistent handling if the pattern evolves.
Example helper
func sendEvent(ctx context.Context, reporter *health.Reporter, isHealthy, isFatal bool, message, code string) error {
	if err := reporter.SendEvent(ctx, isHealthy, isFatal, message, code); err != nil {
		slog.Error("Failed to send health event", "error", err)
		return err
	}
	return nil
}
Then each call site becomes:
if sendEvent(ctx, reporter, false, true, msg, "NCCL_BW_DEGRADED") != nil {
    return exitSendEventError
}
return exitTestFailed
134-138: Passing empty string "" as event code for the success path.

On the failure paths, meaningful codes like "NCCL_TEST_ERROR" and "NCCL_BW_DEGRADED" are provided. The success path passes "". Consider using an explicit success code (e.g., "NCCL_BW_OK") for consistent event taxonomy and easier downstream filtering/alerting.

preflight-checks/nccl-loopback/main.go

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

preflight-checks/nccl-loopback/pkg/benchmark/benchmark.go

preflight-checks/nccl-loopback/pkg/benchmark/benchmark_test.go

preflight-checks/nccl-loopback/pkg/benchmark/benchmark.go

lalitadithya · 2026-02-06T06:34:34Z

Do we also have a way to run this check without a BW test? I.e., I just want to see if the NCCL loopback passes and I don't care about bandwidth?

XRFXLP · 2026-02-06T06:41:11Z

Do we also have a way to run this check without a BW test? I.e., I just want to see if the NCCL loopback passes and I don't care about bandwidth?

A simple way to do this would be keep threshold very low. Are you thinking of some feature flag?

lalitadithya · 2026-02-06T06:48:50Z

Do we also have a way to run this check without a BW test? I.e., I just want to see if the NCCL loopback passes and I don't care about bandwidth?

A simple way to do this would be keep threshold very low. Are you thinking of some feature flag?

I thought of some flag like --no-bandwidth-check or equivalent env var

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 113-124: Fix the typo and clarify the SKIP_BANDWIDTH_CHECK docs:
correct "GPU-interconnnect" to "GPU-interconnect" in the comment near
BW_THRESHOLD_GBPS/TEST_SIZE_MB, update the SKIP_BANDWIDTH_CHECK commented block
to document the exact truthy values accepted by strconv.ParseBool (accepted true
values: "1", "t", "T", "true", "TRUE", "True"), and remove the trailing
whitespace at the end of the SKIP_BANDWIDTH_CHECK comment line.

🧹 Nitpick comments (5)

preflight-checks/nccl-loopback/pkg/config/config.go (2)
15-15: Missing package-level godoc comment.

As per coding guidelines, "Package-level godoc required for all Go packages".
Suggested fix
+// Package config provides environment-based configuration for the NCCL loopback preflight check.
 package config
236-257: nvidia-smi stderr is not captured — error diagnostics are lost.

When cmd.Run() fails, only err is available. GPU driver or device errors are often written to stderr. Capturing stderr here (as done in benchmark.go) would improve debuggability.
Proposed fix
 func detectGPUCount(ctx context.Context) (int, error) {
 	slog.Debug("Detecting GPU count using nvidia-smi")
 
 	ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
 	defer cancel()
 
 	cmd := exec.CommandContext(ctx, "nvidia-smi", "--query-gpu=name", "--format=csv,noheader")
 
-	var stdout bytes.Buffer
+	var stdout, stderr bytes.Buffer
 
 	cmd.Stdout = &stdout
+	cmd.Stderr = &stderr
 
 	if err := cmd.Run(); err != nil {
 		if ctx.Err() == context.DeadlineExceeded {
 			slog.Error("nvidia-smi timed out", "timeout", "30s")
 			return 0, fmt.Errorf("nvidia-smi timed out after 30s")
 		}
 
-		slog.Error("nvidia-smi command failed", "error", err)
+		slog.Error("nvidia-smi command failed", "error", err, "stderr", stderr.String())
 
-		return 0, fmt.Errorf("nvidia-smi failed: %w", err)
+		return 0, fmt.Errorf("nvidia-smi failed: %w\nstderr: %s", err, stderr.String())
 	}
preflight-checks/nccl-loopback/pkg/benchmark/benchmark.go (1)
15-15: Missing package-level godoc comment.

As per coding guidelines, "Package-level godoc required for all Go packages".
Suggested fix
+// Package benchmark provides an NCCL all-reduce benchmark runner and output parser.
 package benchmark
preflight-checks/nccl-loopback/main.go (1)
78-93: Health event message may contain unbounded benchmark output on failure.

When runner.Run fails, its error includes full stdout/stderr (see benchmark.go Lines 100-101). This error is embedded verbatim into the health event message at Line 85 via fmt.Sprintf("NCCL loopback test failed: %v", err). For large benchmark outputs or verbose NCCL debug logs, this could produce an oversized gRPC message.

Consider truncating the error message before sending:
Proposed fix
+	errMsg := fmt.Sprintf("NCCL loopback test failed: %v", err)
+	if len(errMsg) > 4096 {
+		errMsg = errMsg[:4096] + "... (truncated)"
+	}
+
 	if sendErr := reporter.SendEvent(ctx,
 		false, // isHealthy
 		true,  // isFatal
-		fmt.Sprintf("NCCL loopback test failed: %v", err),
+		errMsg,
 		"NCCL_TEST_ERROR",
 	); sendErr != nil {
distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)
132-134: Consider expanding the inline documentation for networkResourceNames.

The comment mentions "RDMA, InfiniBand, etc." but doesn't explain that these are Kubernetes extended resource names advertised by device plugins (e.g., NVIDIA Network Operator). A brief note about when to modify this list (e.g., different NIC types or custom device plugin resource names) would help operators unfamiliar with the setup.
📝 Suggested documentation improvement
-# Network resource names to copy to init containers (RDMA, InfiniBand, etc.)
+# Network resource names to copy to init containers (RDMA, InfiniBand, etc.)
+# These are Kubernetes extended resources advertised by device plugins (e.g., NVIDIA Network Operator).
+# Adjust entries to match the resource names available in your cluster.
+# Example: "nvidia.com/mlnxnics" for Mellanox ConnectX NICs
 networkResourceNames:
   - "nvidia.com/mlnxnics"
As per coding guidelines: "Include examples for non-obvious configurations in Helm chart documentation."

distros/kubernetes/nvsentinel/charts/preflight/values.yaml

XRFXLP · 2026-02-06T08:13:29Z

Do we also have a way to run this check without a BW test? I.e., I just want to see if the NCCL loopback passes and I don't care about bandwidth?

A simple way to do this would be keep threshold very low. Are you thinking of some feature flag?

I thought of some flag like --no-bandwidth-check or equivalent env var

Added env var for this, updated PR with test description.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Line 116: Update the inline comment that currently reads "If using PCIe
GPU-interconnect, this threshold should be set to ~15 GB" to include the correct
rate unit so it reads "~15 GB/s"; edit the comment near the PCIe
GPU-interconnect threshold in values.yaml (the comment containing "If using PCIe
GPU-interconnect") to make the unit consistent with the other inline comment
("GB/s").

🧹 Nitpick comments (1)

distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)
110-111: Consider documenting the minimum driver requirement for this image.

The PR description notes the NCCL loopback image requires NVIDIA driver R570+. This is a non-obvious operational constraint that users could easily miss. As per coding guidelines, non-obvious configurations should include explanatory comments/examples.
Proposed addition
   - name: preflight-nccl-loopback
+    # Requires NVIDIA driver R570+ on the host
     image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:latest

distros/kubernetes/nvsentinel/charts/preflight/values.yaml

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

XRFXLP added 5 commits February 5, 2026 18:05

commit

92a8a67

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

chore: minor changes

7acf5df

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

feat: add NCCL loopback as a preflight test

dd28bb5

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

chore: remove test files

acf103c

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

chore: use pytorch image as a base

59090d9

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

XRFXLP force-pushed the xrfxlp/764 branch from 5d4b917 to 59090d9 Compare February 5, 2026 12:35

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

distros/kubernetes/nvsentinel/charts/preflight/values.yaml Show resolved Hide resolved

preflight-checks/nccl-loopback/Dockerfile Outdated Show resolved Hide resolved

preflight-checks/nccl-loopback/Dockerfile Show resolved Hide resolved

XRFXLP added 3 commits February 6, 2026 09:52

chore: minor changes

c8b3af5

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

Merge branch 'main' into xrfxlp/764

e2764d8

chore: few changes in go-mod tidy

9c4e441

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

XRFXLP self-assigned this Feb 6, 2026

XRFXLP added the enhancement New feature or request label Feb 6, 2026

XRFXLP added this to the v0.9.0 milestone Feb 6, 2026

This was linked to issues Feb 6, 2026

[Feature]: Add preflight check for NCCL allreduce test #766

Closed

[Feature]: Add preflight check targetting NCCL loopback test #764

Closed

XRFXLP removed a link to an issue Feb 6, 2026

[Feature]: Add preflight check for NCCL allreduce test #766

Closed

1 task

XRFXLP changed the title ~~[WIP] NCCL Loopback test~~ feat: Add NCCL loopback tests in prefligth test suite Feb 6, 2026

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

preflight-checks/nccl-loopback/main.go Show resolved Hide resolved

preflight-checks/nccl-loopback/main.go Show resolved Hide resolved

chore: add nccl loopback in github workflows

8b425d4

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

XRFXLP changed the title ~~feat: Add NCCL loopback tests in prefligth test suite~~ feat: Add NCCL loopback tests in preflight test suite Feb 6, 2026

chore: fix dockerfile

1b571d6

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

lalitadithya reviewed Feb 6, 2026

View reviewed changes

preflight-checks/nccl-loopback/pkg/benchmark/benchmark.go Show resolved Hide resolved

preflight-checks/nccl-loopback/pkg/benchmark/benchmark_test.go Show resolved Hide resolved

preflight-checks/nccl-loopback/pkg/benchmark/benchmark.go Show resolved Hide resolved

chore: address review comments

812d0a8

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

distros/kubernetes/nvsentinel/charts/preflight/values.yaml Show resolved Hide resolved

chore: fix typo

1f2d8ff

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

distros/kubernetes/nvsentinel/charts/preflight/values.yaml Outdated Show resolved Hide resolved

XRFXLP added 2 commits February 6, 2026 13:47

chore: one more typo

d48fdef

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

Merge branch 'main' into xrfxlp/764

4d44fae

lalitadithya approved these changes Feb 6, 2026

View reviewed changes

lalitadithya merged commit b7000ff into NVIDIA:main Feb 6, 2026
41 checks passed

Conversation

XRFXLP commented Feb 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Bandwidth across hardwares

Healthy node

Unhealthy node:

Event structure:

Multiple init containers

Skipping bandwidth check

Type of Change

Component(s) Affected

Testing

Checklist

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lalitadithya commented Feb 6, 2026

Uh oh!

XRFXLP commented Feb 6, 2026

Uh oh!

lalitadithya commented Feb 6, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

XRFXLP commented Feb 6, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

XRFXLP commented Feb 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 5, 2026 •

edited

Loading