Skip to content

feat: Add NCCL loopback tests in preflight test suite#808

Merged
lalitadithya merged 14 commits intoNVIDIA:mainfrom
XRFXLP:xrfxlp/764
Feb 6, 2026
Merged

feat: Add NCCL loopback tests in preflight test suite#808
lalitadithya merged 14 commits intoNVIDIA:mainfrom
XRFXLP:xrfxlp/764

Conversation

@XRFXLP
Copy link
Member

@XRFXLP XRFXLP commented Feb 5, 2026

Summary

This PR adds a NCCL loopback test in the preflight tests

What this test tests:

flowchart TB
    subgraph healthy["✅ HEALTHY NODE"]
        direction LR
        H0[GPU 0] --> H1[GPU 1] --> H2[GPU 2] --> H3[...] --> H7[GPU 7] --> H0
    end

    healthy --> PASS["400 GB/s ✅"]
    PASS --> broken

    subgraph broken["🔴 BROKEN NODE"]
        direction LR
        B0[GPU 0] --> B1[GPU 1] -.->|"PCIe ❌"| B2[GPU 2] --> B3[...] --> B7[GPU 7] --> B0
    end

    broken --> FAIL["< 100 GB/s 🔴"]
Loading

For bus bandwidth:

                    Message Size (bytes)
Algorithm BW  =  ─────────────────────────
                      Time (seconds)


                              2 × (N - 1)
Bus BW  =  Algorithm BW  ×  ─────────────
                                  N

For runtime image, I'm using pytorch 25.01 which supports R570+ driver versions, and compiling NCCL tests afterwards. Objective of doing so would be reduced time of subsequent preflight tests allReduce which would be using the same base image. The other option was to use nccl-tests from ECR gallery where they've EFA plugins built-ins, although this would not have been problem in case of loopback test but might not have benefited in the other CSPs.

Testing

Bandwidth across hardwares

With 256 Mb message size:

CSP Bus BandWidth(GBps) Hardware GPUs
Azure 195 A100 8
OCI 347 H100 8
AWS 347 H100 8
AWS 596 GB300 4
AWS 597 GB200 4
GCP 341 H100 8

Healthy node

$ klog nginx-with-gpu-qmrlb -c preflight-nccl-loopback -n preflight-test-ns | jq
{
  "time": "2026-02-05T15:56:16.027441109Z",
  "level": "INFO",
  "msg": "Starting preflight NCCL loopback check",
  "version": "dev",
  "commit": "none",
  "date": "$(date"
}
{
  "time": "2026-02-05T15:56:16.129207959Z",
  "level": "INFO",
  "msg": "Detected GPUs",
  "count": 8
}
{
  "time": "2026-02-05T15:56:16.129263995Z",
  "level": "INFO",
  "msg": "Configuration loaded",
  "bw_threshold_gbps": 150,
  "test_size_mb": 256,
  "num_gpus": 8,
  "binary": "/opt/nccl-tests/build/all_reduce_perf",
  "node_name": "aks-gpu-xxx-vmss00000a"
}
{
  "time": "2026-02-05T15:56:16.129281377Z",
  "level": "INFO",
  "msg": "Running NCCL all-reduce benchmark",
  "binary": "/opt/nccl-tests/build/all_reduce_perf",
  "size_mb": 256,
  "num_gpus": 8,
  "timeout": 300000000000
}
{
  "time": "2026-02-05T15:56:25.15855496Z",
  "level": "INFO",
  "msg": "Benchmark completed",
  "bus_bandwidth_gbps": 195.82,
  "algo_bandwidth_gbps": 111.89,
  "num_gpus": 8,
  "test_size_bytes": 268435456
}
{
  "time": "2026-02-05T15:56:25.158607218Z",
  "level": "INFO",
  "msg": "NCCL loopback test PASSED",
  "measured_gbps": 195.82,
  "threshold_gbps": 150
}
{
  "time": "2026-02-05T15:56:25.158631153Z",
  "level": "INFO",
  "msg": "Sending health event",
  "is_healthy": true,
  "is_fatal": false,
  "message": "NCCL all-reduce bus bandwidth 195.82 GB/s meets threshold 150.00 GB/s",
  "error_code": "",
  "recommended_action": "NONE"
}
{
  "time": "2026-02-05T15:56:25.231322728Z",
  "level": "INFO",
  "msg": "Health event sent successfully"
}

Unhealthy node:

$ klog nginx-with-gpu-wflt5 -c preflight-nccl-loopback -n preflight-test-ns | jq
{
  "time": "2026-02-05T16:17:20.485671689Z",
  "level": "INFO",
  "msg": "Starting preflight NCCL loopback check",
  "version": "dev",
  "commit": "none",
  "date": "$(date"
}
{
  "time": "2026-02-05T16:17:20.615041266Z",
  "level": "INFO",
  "msg": "Detected GPUs",
  "count": 8
}
{
  "time": "2026-02-05T16:17:20.615097Z",
  "level": "INFO",
  "msg": "Configuration loaded",
  "bw_threshold_gbps": 150,
  "test_size_mb": 256,
  "num_gpus": 8,
  "binary": "/opt/nccl-tests/build/all_reduce_perf",
  "node_name": "aks-gpu-xxx-vmss00000j"
}
{
  "time": "2026-02-05T16:17:20.615110976Z",
  "level": "INFO",
  "msg": "Running NCCL all-reduce benchmark",
  "binary": "/opt/nccl-tests/build/all_reduce_perf",
  "size_mb": 256,
  "num_gpus": 8,
  "timeout": 300000000000
}
{
  "time":  #"2026-02-05T16:17:20.972182522Z",
  "level": "ERROR",
  "msg": "NCCL benchmark failed",
  "error": "exit status 2",
  "stdout": "# nThread 1 nGpus 8 minBytes 268435456 maxBytes 268435456 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0\n#\n# Using devices\nnginx-with-gpu-wflt5: Test CUDA failure common.cu:941 'system not yet initialized'\n .. nginx-with-gpu-wflt5 pid 15: Test failure common.cu:891\n",
  "stderr": ""
}
{
  "time": "2026-02-05T16:17:20.972232405Z",
  "level": "ERROR",
  "msg": "NCCL benchmark failed",
  "error": "all_reduce_perf failed: exit status 2\nstdout: # nThread 1 nGpus 8 minBytes 268435456 maxBytes 268435456 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0\n#\n# Using devices\nnginx-with-gpu-wflt5: Test CUDA failure common.cu:941 'system not yet initialized'\n .. nginx-with-gpu-wflt5 pid 15: Test failure common.cu:891\n\nstderr: "
}
{
  "time": "2026-02-05T16:17:20.972249076Z",
  "level": "INFO",
  "msg": "Sending health event",
  "is_healthy": false,
  "is_fatal": true,
  "message": "NCCL loopback test failed: all_reduce_perf failed: exit status 2\nstdout: # nThread 1 nGpus 8 minBytes 268435456 maxBytes 268435456 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0\n#\n# Using devices\nnginx-with-gpu-wflt5: Test CUDA failure common.cu:941 'system not yet initialized'\n .. nginx-with-gpu-wflt5 pid 15: Test failure common.cu:891\n\nstderr: ",
  "error_code": "NCCL_TEST_ERROR",
  "recommended_action": "CONTACT_SUPPORT"
}
{
  "time": "2026-02-05T16:17:20.974316344Z",
  "level": "INFO",
  "msg": "Health event sent successfully"
}

Event structure:

{
  "time": "2026-02-05T15:56:25.967126485Z",
  "level": "INFO",
  "msg": "Successfully got item",
  "module": "platform-connectors",
  "version": "v0.7.1",
  "healthEvents": {
    "version": 1,
    "events": [
      {
        "version": 1,
        "agent": "preflight-nccl-loopback",
        "componentClass": "Node",
        "checkName": "NCCLLoopbackTest",
        "isHealthy": true,
        "message": "NCCL all-reduce bus bandwidth 196.85 GB/s meets threshold 150.00 GB/s",
        "metadata": {
          "node.kubernetes.io/instance-type": "Standard_ND96amsr_A100_v4",
          "nvidia.com/cuda.driver-version.full": "570.148.08",
          "nvidia.com/cuda.driver-version.major": "570",
          "nvidia.com/cuda.driver-version.minor": "148",
          "nvidia.com/cuda.driver-version.revision": "08",
          "nvidia.com/cuda.runtime-version.full": "12.8",
          "nvidia.com/cuda.runtime-version.major": "12",
          "nvidia.com/cuda.runtime-version.minor": "8",
          "nvidia.com/gpu.product": "NVIDIA-A100-SXM4-80GB",
          "providerID": "az..0",
          "topology.kubernetes.io/region": "southcentralus",
          "topology.kubernetes.io/zone": "0"
        },
        "generatedTimestamp": {
          "seconds": 1770306985,
          "nanos": 937919942
        },
        "nodeName": "aks-gpu-xxx-vmss000000",
        "processingStrategy": 1
      }
    ]
  }
}

Multiple init containers

$ kg cm preflight-config -n nvsentinel -o yaml | yq .data
config.yaml: |
  gpuResourceNames:
    - nvidia.com/gpu
  networkResourceNames:
    - nvidia.com/mlnxnics
  dcgm:
    hostengineAddr: "nvidia-dcgm.gpu-operator.svc:5555"
    diagLevel: 2
    connectorSocket: "unix:///var/run/nvsentinel.sock"
    processingStrategy: "EXECUTE_REMEDIATION"
  initContainers:
    - image: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag:latest
      name: preflight-dcgm-diag
      volumeMounts:
      - mountPath: /var/run
        name: nvsentinel-socket
    - env:
      - name: BW_THRESHOLD_GBPS
        value: "150"
      - name: TEST_SIZE_MB
        value: "256"
      image: docker.io/xrfxlp/preflight-nccl-loopback:16460502
      name: preflight-nccl-loopback
      volumeMounts:
      - mountPath: /var/run
        name: nvsentinel-socket

then both initContainers are injected:

$ kg po -n preflight-test-ns      nginx-with-gpu-hm54r -o yaml | yq .spec.initContainers 
- env:
    - name: DCGM_DIAG_LEVEL
      value: "2"
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: DCGM_HOSTENGINE_ADDR
      value: nvidia-dcgm.gpu-operator.svc:5555
    - name: PLATFORM_CONNECTOR_SOCKET
      value: unix:///var/run/nvsentinel.sock
    - name: PROCESSING_STRATEGY
      value: EXECUTE_REMEDIATION
  image: ghcr.io/nvidia/nvsentinel/preflight-dcgm-diag:latest
  imagePullPolicy: Always
  name: preflight-dcgm-diag
  resources:
    limits:
      nvidia.com/gpu: "8"
    requests:
      cpu: 100m
      memory: 500Mi
      nvidia.com/gpu: "8"
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File
  volumeMounts:
    - mountPath: /var/run
      name: nvsentinel-socket
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-rtp4l
      readOnly: true
- env:
    - name: BW_THRESHOLD_GBPS
      value: "150"
    - name: TEST_SIZE_MB
      value: "256"
  image: docker.io/xrfxlp/preflight-nccl-loopback:16460502
  imagePullPolicy: IfNotPresent
  name: preflight-nccl-loopback
  resources:
    limits:
      nvidia.com/gpu: "8"
    requests:
      cpu: 100m
      memory: 500Mi
      nvidia.com/gpu: "8"
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File
  volumeMounts:
    - mountPath: /var/run
      name: nvsentinel-socket
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-rtp4l
      readOnly: true

Skipping bandwidth check

With SKIP_BANDWIDTH_CHECK: true in environment variable:

{
  "time": "2026-02-06T07:14:20.030558502Z",
  "level": "INFO",
  "msg": "Benchmark completed",
  "module": "nccl-loopback",
  "version": "dev",
  "bus_bandwidth_gbps": 196.11,
  "algo_bandwidth_gbps": 112.06,
  "num_gpus": 8,
  "test_size_bytes": 268435456
}
{
  "time": "2026-02-06T07:14:20.030597825Z",
  "level": "INFO",
  "msg": "NCCL loopback test PASSED (bandwidth check skipped)",
  "module": "nccl-loopback",
  "version": "dev",
  "measured_gbps": 196.11
}
{
  "time": "2026-02-06T07:14:20.030619736Z",
  "level": "INFO",
  "msg": "Sending health event",
  "module": "nccl-loopback",
  "version": "dev",
  "is_healthy": true,
  "is_fatal": false,
  "message": "NCCL all-reduce completed successfully with bus bandwidth 196.11 GB/s (threshold check skipped)",
  "error_code": "",
  "recommended_action": "NONE"
}

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Added an NCCL loopback preflight check with a default 150 GBps threshold and 256 MB test, plus an init container for deployment.
  • Tests

    • CI matrices updated to include NCCL loopback validation; added unit tests for benchmark output parsing.
  • Chores

    • Added build, packaging, image publishing, and runtime tooling for the NCCL loopback check.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 5, 2026

Warning

Rate limit exceeded

@XRFXLP has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 24 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Adds a new NCCL loopback preflight check: CI workflow matrix entries, a Helm initContainer entry, a multi-stage Docker image and Makefile, and a new Go module implementing env-driven config, an NCCL benchmark runner, health-event reporter, and unit tests.

Changes

Cohort / File(s) Summary
CI/CD workflows
.github/workflows/lint-test.yml, .github/workflows/container-build-test.yml, .github/workflows/publish.yml
Extended lint-test matrix to include nccl-loopback; added container-build-test step and publish matrix entry for preflight-nccl-loopback (docker-build / docker-publish targets).
Kubernetes chart
distros/kubernetes/nvsentinel/charts/preflight/values.yaml
Added preflight-nccl-loopback initContainer (ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:latest) with env vars BW_THRESHOLD_GBPS=150, TEST_SIZE_MB=256, and a nvsentinel-socket volumeMount.
Container image & build
preflight-checks/nccl-loopback/Dockerfile, preflight-checks/nccl-loopback/Makefile
New multi-stage Dockerfile (Go build stage + PyTorch runtime building nccl-tests) and Makefile with Go-module flags, Docker targets, and CI targets (lint/test/docker-build/docker-publish).
Go module & deps
preflight-checks/nccl-loopback/go.mod
New Go module declaration with dependencies (gRPC, protobuf, k8s libs, etc.) and local replace directives linking commons and data-models.
Application entrypoint
preflight-checks/nccl-loopback/main.go
New main program: JSON logging, env-based config loading, benchmark execution, health-event sending over connector socket, and distinct exit codes for config/test/send outcomes.
Benchmark package & tests
preflight-checks/nccl-loopback/pkg/benchmark/...
Added Runner and Result types to execute all_reduce_perf with timeout, parse GPU count and bandwidth metrics, and comprehensive unit tests covering parsing edge cases.
Configuration
preflight-checks/nccl-loopback/pkg/config/config.go
Added Config struct and FromEnv(ctx) to parse/validate env vars, detect GPU count via nvidia-smi (30s timeout), validate NCCL test binary path, and require connector socket/node name and processing strategy.
Health reporting
preflight-checks/nccl-loopback/pkg/health/reporter.go
Added Reporter that builds HealthEvent payloads and sends them over a UNIX gRPC socket with exponential-backoff retries and retryable-error classification.

Sequence Diagram

sequenceDiagram
    participant M as Main
    participant C as Config
    participant GPU as nvidia-smi
    participant B as Benchmark Runner
    participant ER as all_reduce_perf
    participant H as Health Reporter
    participant GS as gRPC Socket

    M->>C: FromEnv(ctx)
    activate C
    C->>GPU: detect GPU count (30s timeout)
    GPU-->>C: GPU count
    C-->>M: Config
    deactivate C

    M->>B: NewRunner(binaryPath)
    B-->>M: Runner

    M->>B: Run(ctx, numGPUs, testSizeMB)
    activate B
    B->>ER: execute all_reduce_perf (with timeout)
    ER-->>B: stdout/stderr
    B->>B: parseOutput() -> Result
    B-->>M: Result
    deactivate B

    M->>M: compare bandwidth vs threshold

    alt degraded or error
        M->>H: SendEvent(isHealthy=false,isFatal=true)
        activate H
        H->>GS: HealthEventOccurredV1 (with retries)
        GS-->>H: response
        deactivate H
    else healthy
        M->>H: SendEvent(isHealthy=true,isFatal=false)
        activate H
        H->>GS: HealthEventOccurredV1
        GS-->>H: response
        deactivate H
    end

    M->>M: exit with appropriate code
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped into builds to count each GPU,
Ran tiny all-reduce and listened for the cue,
Parsed bandwidth lines, retried through the socket,
Sent a health event — then tucked in my pocket,
Preflight loopback ready; I nibble logs anew.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Add NCCL loopback tests in preflight test suite' accurately and concisely summarizes the main change: a new NCCL loopback preflight test feature. It is clear, specific, and directly reflects the primary objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 109-119: The init container definition for preflight-nccl-loopback
is missing required env vars, causing config.FromEnv() to fail; update the
preflight-nccl-loopback container (name: preflight-nccl-loopback) to add
environment variables NODE_NAME and PLATFORM_CONNECTOR_SOCKET (sourcing
appropriate values or Kubernetes fieldRef/secret as needed) so the
config.FromEnv() call can read them at runtime; ensure the new env entries match
the names exactly (NODE_NAME, PLATFORM_CONNECTOR_SOCKET) and use the same
mounting/secret references used elsewhere in the chart.

In `@preflight-checks/nccl-loopback/Dockerfile`:
- Around line 29-31: ARG BUILD_DATE is using a shell substitution that won't be
executed at build time; change the Dockerfile's ARG BUILD_DATE to a static
placeholder (e.g. empty or "none") instead of "$(date ...)" and update your
CI/build invocation to pass the real timestamp via --build-arg BUILD_DATE=$(date
-u +%FT%TZ); reference the ARG lines (ARG VERSION, ARG GIT_COMMIT, ARG
BUILD_DATE) in the Dockerfile and ensure the pipeline sets BUILD_DATE when
calling docker build.
- Around line 78-85: The Dockerfile hardcodes NVCC_GENCODE including compute_100
which breaks builds on older CUDA/PyTorch; refactor the RUN that builds
nccl-tests to use a build ARG (e.g., ARG NVCC_GENCODE with a conservative
default that omits compute_100) and replace the literal NVCC_GENCODE value with
${NVCC_GENCODE} in the make invocation, and optionally add an ARG CUDA_VERSION
(or detect CUDA) to conditionally append compute_100 when CUDA >= 12.8; update
NCCL_TESTS_VERSION usage remains the same.
🧹 Nitpick comments (10)
preflight-checks/nccl-loopback/pkg/config/config.go (2)

15-15: Add package-level documentation.

The config package is missing a package-level godoc comment. As per coding guidelines, package-level godoc is required for all Go packages.

📝 Proposed fix
+// Package config provides configuration loading and validation for the NCCL loopback preflight check.
 package config

211-215: Avoid shadowing the ctx parameter.

The function shadows the incoming ctx parameter by reassigning it on line 214. While functionally correct here, this pattern can cause confusion in larger functions where readers might expect the original context to still be in scope.

♻️ Proposed fix
 func detectGPUCount(ctx context.Context) (int, error) {
 	slog.Debug("Detecting GPU count using nvidia-smi")
 
-	ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
+	timeoutCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
 	defer cancel()
 
-	cmd := exec.CommandContext(ctx, "nvidia-smi", "--query-gpu=name", "--format=csv,noheader")
+	cmd := exec.CommandContext(timeoutCtx, "nvidia-smi", "--query-gpu=name", "--format=csv,noheader")
 
 	var stdout bytes.Buffer
 
 	cmd.Stdout = &stdout
 
 	if err := cmd.Run(); err != nil {
-		if ctx.Err() == context.DeadlineExceeded {
+		if timeoutCtx.Err() == context.DeadlineExceeded {
 			slog.Error("nvidia-smi timed out", "timeout", "30s")
 			return 0, fmt.Errorf("nvidia-smi timed out after 30s")
 		}
preflight-checks/nccl-loopback/pkg/health/reporter.go (3)

15-15: Add package-level documentation.

The health package is missing a package-level godoc comment. As per coding guidelines, package-level godoc is required for all Go packages.

📝 Proposed fix
+// Package health provides health event reporting for the NCCL loopback preflight check.
 package health

47-56: Add function documentation for exported constructor.

The exported NewReporter function should have a godoc comment describing its purpose and parameters. As per coding guidelines, function comments are required for all exported Go functions.

📝 Proposed fix
+// NewReporter creates a new health event reporter.
+// socketPath is the Unix socket path for the Platform Connector (with or without "unix://" prefix).
+// nodeName is the Kubernetes node name to include in health events.
+// strategy determines how downstream modules handle the event.
 func NewReporter(socketPath, nodeName string, strategy pb.ProcessingStrategy) *Reporter {

58-98: Add function documentation for exported method.

The exported SendEvent method should have a godoc comment. As per coding guidelines, function comments are required for all exported Go functions.

📝 Proposed fix
+// SendEvent sends a health event to the Platform Connector with retry logic.
+// isHealthy indicates the health status, isFatal marks critical failures,
+// message provides human-readable details, and errorCode is an optional error identifier.
 func (r *Reporter) SendEvent(ctx context.Context, isHealthy, isFatal bool, message string, errorCode string) error {
preflight-checks/nccl-loopback/pkg/benchmark/benchmark.go (1)

15-15: Add package-level documentation.

The benchmark package is missing a package-level godoc comment. As per coding guidelines, package-level godoc is required for all Go packages.

📝 Proposed fix
+// Package benchmark provides NCCL all-reduce benchmark execution and result parsing.
 package benchmark
preflight-checks/nccl-loopback/pkg/benchmark/benchmark_test.go (1)

103-136: Consider consolidating error cases into a table-driven test.

The error test cases (WrongSize, NoGPUs, MalformedData) share a similar structure. As per coding guidelines, table-driven tests are preferred when testing multiple scenarios.

♻️ Proposed table-driven approach
+func TestParseOutput_ErrorCases(t *testing.T) {
+	tests := []struct {
+		name       string
+		output     string
+		testSizeMB int
+		wantErr    string
+	}{
+		{
+			name:       "WrongSize",
+			output:     sampleNCCLOutput,
+			testSizeMB: 128, // Looking for 128MB, but output has 256MB
+			wantErr:    "could not find results",
+		},
+		{
+			name: "NoGPUs",
+			output: `# Some header
+# No GPU rank lines here
+   268435456      67108864     float     sum      -1   2374.1  113.07  197.87      0
+`,
+			testSizeMB: 256,
+			wantErr:    "could not determine number of GPUs",
+		},
+		{
+			name: "MalformedData",
+			output: `# Using devices
+#  Rank  0 Group  0 Pid     12 on nccl-test-1 device  0 [0001:00:00] NVIDIA A100
+   not_a_number      67108864     float     sum      -1   2374.1  113.07  197.87      0
+`,
+			testSizeMB: 256,
+			wantErr:    "could not find results",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			_, err := parseOutput(tt.output, tt.testSizeMB)
+			if err == nil {
+				t.Errorf("expected error containing %q, got nil", tt.wantErr)
+				return
+			}
+			t.Logf("Got expected error: %v", err)
+		})
+	}
+}
preflight-checks/nccl-loopback/main.go (2)

15-26: Missing package-level godoc comment.

As per coding guidelines, package-level godoc is required for all Go packages. Add a comment describing the purpose of this package before the package main declaration.

📝 Suggested package documentation
+// Package main implements the NCCL loopback preflight check binary.
+// It validates GPU interconnect bandwidth by running NCCL all-reduce benchmarks
+// and reports health status to the platform connector.
 package main

56-63: Consider adding a context timeout for the benchmark execution.

The context.Background() has no timeout, which means the benchmark could run indefinitely if the NCCL test binary hangs. While Kubernetes init container timeouts provide a safety net, adding an explicit timeout improves observability and fail-fast behavior.

🛡️ Optional: Add context timeout
 func run() int {
-	ctx := context.Background()
+	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
+	defer cancel()

You'll need to add "time" to the imports.

distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)

110-111: Consider adding securityContext and resource requests.

Unlike the main preflight container (lines 31-38, 45-51), the init container lacks securityContext and resources definitions. While the NCCL test requires elevated privileges for GPU access, explicitly defining the security context improves clarity and allows for controlled privilege escalation.

📝 Suggested additions (similar to dcgm-diag pattern)
  - name: preflight-nccl-loopback
    image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:latest
    # securityContext:
    #   privileged: true  # Required for GPU device access
    # resources:
    #   requests:
    #     cpu: 100m
    #     memory: 256Mi
    #   limits:
    #     nvidia.com/gpu: 8  # Request all GPUs for loopback test

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
@XRFXLP XRFXLP self-assigned this Feb 6, 2026
@XRFXLP XRFXLP added the enhancement New feature or request label Feb 6, 2026
@XRFXLP XRFXLP added this to the v0.9.0 milestone Feb 6, 2026
@XRFXLP XRFXLP changed the title [WIP] NCCL Loopback test feat: Add NCCL loopback tests in prefligth test suite Feb 6, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@preflight-checks/nccl-loopback/main.go`:
- Line 15: Add a package-level godoc comment immediately above the `package
main` declaration that succinctly describes the purpose and responsibilities of
this binary (e.g., what the nccl-loopback preflight check does, intended usage,
and any important invariants); ensure the comment is a full sentence starting
with the package name `main` to satisfy Go doc conventions.
- Around line 53-54: The run function currently uses context.Background()
without signal handling; change it to a cancellable context that listens for
SIGINT and SIGTERM so runner.Run can be stopped gracefully. Import "os/signal"
and "syscall", create a context via signal.NotifyContext (or create a
context.WithCancel and a goroutine that calls cancel on signals), defer the
cancel, and pass this ctx into runner.Run (and any other downstream calls) so
the external benchmark process can be interrupted cleanly on OS signals.
🧹 Nitpick comments (2)
preflight-checks/nccl-loopback/main.go (2)

78-142: Consider extracting the repeated SendEvent + error-handling pattern.

The same SendEvent → check error → log → return exitSendEventError block appears three times (lines 81–89, 111–119, 134–142). A small helper would reduce duplication and ensure consistent handling if the pattern evolves.

Example helper
func sendEvent(ctx context.Context, reporter *health.Reporter, isHealthy, isFatal bool, message, code string) error {
	if err := reporter.SendEvent(ctx, isHealthy, isFatal, message, code); err != nil {
		slog.Error("Failed to send health event", "error", err)
		return err
	}
	return nil
}

Then each call site becomes:

if sendEvent(ctx, reporter, false, true, msg, "NCCL_BW_DEGRADED") != nil {
    return exitSendEventError
}
return exitTestFailed

134-138: Passing empty string "" as event code for the success path.

On the failure paths, meaningful codes like "NCCL_TEST_ERROR" and "NCCL_BW_DEGRADED" are provided. The success path passes "". Consider using an explicit success code (e.g., "NCCL_BW_OK") for consistent event taxonomy and easier downstream filtering/alerting.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
@XRFXLP XRFXLP changed the title feat: Add NCCL loopback tests in prefligth test suite feat: Add NCCL loopback tests in preflight test suite Feb 6, 2026
Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
@lalitadithya
Copy link
Collaborator

Do we also have a way to run this check without a BW test? I.e., I just want to see if the NCCL loopback passes and I don't care about bandwidth?

@XRFXLP
Copy link
Member Author

XRFXLP commented Feb 6, 2026

Do we also have a way to run this check without a BW test? I.e., I just want to see if the NCCL loopback passes and I don't care about bandwidth?

A simple way to do this would be keep threshold very low. Are you thinking of some feature flag?

@lalitadithya
Copy link
Collaborator

Do we also have a way to run this check without a BW test? I.e., I just want to see if the NCCL loopback passes and I don't care about bandwidth?

A simple way to do this would be keep threshold very low. Are you thinking of some feature flag?

I thought of some flag like --no-bandwidth-check or equivalent env var

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 113-124: Fix the typo and clarify the SKIP_BANDWIDTH_CHECK docs:
correct "GPU-interconnnect" to "GPU-interconnect" in the comment near
BW_THRESHOLD_GBPS/TEST_SIZE_MB, update the SKIP_BANDWIDTH_CHECK commented block
to document the exact truthy values accepted by strconv.ParseBool (accepted true
values: "1", "t", "T", "true", "TRUE", "True"), and remove the trailing
whitespace at the end of the SKIP_BANDWIDTH_CHECK comment line.
🧹 Nitpick comments (5)
preflight-checks/nccl-loopback/pkg/config/config.go (2)

15-15: Missing package-level godoc comment.

As per coding guidelines, "Package-level godoc required for all Go packages".

Suggested fix
+// Package config provides environment-based configuration for the NCCL loopback preflight check.
 package config

236-257: nvidia-smi stderr is not captured — error diagnostics are lost.

When cmd.Run() fails, only err is available. GPU driver or device errors are often written to stderr. Capturing stderr here (as done in benchmark.go) would improve debuggability.

Proposed fix
 func detectGPUCount(ctx context.Context) (int, error) {
 	slog.Debug("Detecting GPU count using nvidia-smi")
 
 	ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
 	defer cancel()
 
 	cmd := exec.CommandContext(ctx, "nvidia-smi", "--query-gpu=name", "--format=csv,noheader")
 
-	var stdout bytes.Buffer
+	var stdout, stderr bytes.Buffer
 
 	cmd.Stdout = &stdout
+	cmd.Stderr = &stderr
 
 	if err := cmd.Run(); err != nil {
 		if ctx.Err() == context.DeadlineExceeded {
 			slog.Error("nvidia-smi timed out", "timeout", "30s")
 			return 0, fmt.Errorf("nvidia-smi timed out after 30s")
 		}
 
-		slog.Error("nvidia-smi command failed", "error", err)
+		slog.Error("nvidia-smi command failed", "error", err, "stderr", stderr.String())
 
-		return 0, fmt.Errorf("nvidia-smi failed: %w", err)
+		return 0, fmt.Errorf("nvidia-smi failed: %w\nstderr: %s", err, stderr.String())
 	}
preflight-checks/nccl-loopback/pkg/benchmark/benchmark.go (1)

15-15: Missing package-level godoc comment.

As per coding guidelines, "Package-level godoc required for all Go packages".

Suggested fix
+// Package benchmark provides an NCCL all-reduce benchmark runner and output parser.
 package benchmark
preflight-checks/nccl-loopback/main.go (1)

78-93: Health event message may contain unbounded benchmark output on failure.

When runner.Run fails, its error includes full stdout/stderr (see benchmark.go Lines 100-101). This error is embedded verbatim into the health event message at Line 85 via fmt.Sprintf("NCCL loopback test failed: %v", err). For large benchmark outputs or verbose NCCL debug logs, this could produce an oversized gRPC message.

Consider truncating the error message before sending:

Proposed fix
+	errMsg := fmt.Sprintf("NCCL loopback test failed: %v", err)
+	if len(errMsg) > 4096 {
+		errMsg = errMsg[:4096] + "... (truncated)"
+	}
+
 	if sendErr := reporter.SendEvent(ctx,
 		false, // isHealthy
 		true,  // isFatal
-		fmt.Sprintf("NCCL loopback test failed: %v", err),
+		errMsg,
 		"NCCL_TEST_ERROR",
 	); sendErr != nil {
distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)

132-134: Consider expanding the inline documentation for networkResourceNames.

The comment mentions "RDMA, InfiniBand, etc." but doesn't explain that these are Kubernetes extended resource names advertised by device plugins (e.g., NVIDIA Network Operator). A brief note about when to modify this list (e.g., different NIC types or custom device plugin resource names) would help operators unfamiliar with the setup.

📝 Suggested documentation improvement
-# Network resource names to copy to init containers (RDMA, InfiniBand, etc.)
+# Network resource names to copy to init containers (RDMA, InfiniBand, etc.)
+# These are Kubernetes extended resources advertised by device plugins (e.g., NVIDIA Network Operator).
+# Adjust entries to match the resource names available in your cluster.
+# Example: "nvidia.com/mlnxnics" for Mellanox ConnectX NICs
 networkResourceNames:
   - "nvidia.com/mlnxnics"

As per coding guidelines: "Include examples for non-obvious configurations in Helm chart documentation."

@XRFXLP
Copy link
Member Author

XRFXLP commented Feb 6, 2026

Do we also have a way to run this check without a BW test? I.e., I just want to see if the NCCL loopback passes and I don't care about bandwidth?

A simple way to do this would be keep threshold very low. Are you thinking of some feature flag?

I thought of some flag like --no-bandwidth-check or equivalent env var

Added env var for this, updated PR with test description.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Line 116: Update the inline comment that currently reads "If using PCIe
GPU-interconnect, this threshold should be set to ~15 GB" to include the correct
rate unit so it reads "~15 GB/s"; edit the comment near the PCIe
GPU-interconnect threshold in values.yaml (the comment containing "If using PCIe
GPU-interconnect") to make the unit consistent with the other inline comment
("GB/s").
🧹 Nitpick comments (1)
distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)

110-111: Consider documenting the minimum driver requirement for this image.

The PR description notes the NCCL loopback image requires NVIDIA driver R570+. This is a non-obvious operational constraint that users could easily miss. As per coding guidelines, non-obvious configurations should include explanatory comments/examples.

Proposed addition
   - name: preflight-nccl-loopback
+    # Requires NVIDIA driver R570+ on the host
     image: ghcr.io/nvidia/nvsentinel/preflight-nccl-loopback:latest

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>
@lalitadithya lalitadithya merged commit b7000ff into NVIDIA:main Feb 6, 2026
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add preflight check targetting NCCL loopback test

2 participants