feat: Add dcgm diagnostics as a preflight check by XRFXLP · Pull Request #772 · NVIDIA/NVSentinel

XRFXLP · 2026-02-02T11:15:49Z

Summary

Adds DCGM diagnostic tests as a part of preflight test suits. Overflow is:

Initialize config, set defaults for missing parameters
Find out the GPUs allocated for the pod (using gonvml)
Run DCGM diag (using pydcgm, DCGM python bindings)
Send health events per test per GPU UUID:
- Warning are sent as non-fatal+unhealthy events
- Failure are sent as fatal events
- Pass is sent as healthy events

Package structure:

$ tree .
.
├── dcgm_diag
│   ├── config.py
│   ├── diag.py
│   ├── errors.py
│   ├── gpu.py
│   ├── health.py
│   ├── __init__.py
│   ├── logger.py
│   ├── __main__.py
│   ├── protos
│   │   ├── health_event_pb2_grpc.py
│   │   ├── health_event_pb2.py
│   │   ├── health_event_pb2.pyi
│   │   └── __init__.py
│   └── tests
│       ├── conftest.py
│       ├── __init__.py
│       ├── test_config.py
│       ├── test_diag.py
│       ├── test_errors.py
│       └── test_health.py
├── dist
├── Dockerfile
├── Makefile
├── poetry.lock
└── pyproject.toml

5 directories, 22 files

Testing

Tested with DCGM 4.3.1-1-ubuntu22.04 and 4.2.3-1-ubuntu22.04

Created pod:

apiVersion: v1
kind: Pod
metadata:
  name: nginx-with-gpu
  namespace: preflight-test-ns
spec:
  containers:
  - name: nginx
    image: public.ecr.aws/docker/library/nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
        nvidia.com/gpu: 2
      limits:
        nvidia.com/gpu: 2

Verified that init container was injected:

$ kg po -n preflight-test-ns nginx-with-gpu -o json | jq .spec.initContainers
[
  {
    "env": [
      {
        "name": "DCGM_DIAG_LEVEL",
        "value": "2"
      },
      {
        "name": "NODE_NAME",
        "valueFrom": {
          "fieldRef": {
            "apiVersion": "v1",
            "fieldPath": "spec.nodeName"
          }
        }
      },
      {
        "name": "DCGM_HOSTENGINE_ADDR",
        "value": "nvidia-dcgm.gpu-operator.svc:5555"
      },
      {
        "name": "PLATFORM_CONNECTOR_SOCKET",
        "value": "unix:///var/run/nvsentinel.sock"
      }
    ],
    "image": "xrfxlp/preflight-dcgm-diag:10150302",
    "imagePullPolicy": "IfNotPresent",
    "name": "preflight-dcgm-diag",
    "resources": {
      "limits": {
        "nvidia.com/gpu": "2"
      },
      "requests": {
        "cpu": "100m",
        "memory": "500Mi",
        "nvidia.com/gpu": "2"
      }
    },
    "terminationMessagePath": "/dev/termination-log",
    "terminationMessagePolicy": "File",
    "volumeMounts": [
      {
        "mountPath": "/var/run",
        "name": "nvsentinel-socket"
      },
      {
        "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
        "name": "kube-api-access-vbz4f",
        "readOnly": true
      }
    ]
  }
]

Healthy node

$ klog nginx-with-gpu-7bzvk -n preflight-test-ns -c preflight-dcgm-diag | jq
{
  "event": "Starting preflight dcgm-diag check",
  "level": "info",
  "diag_level": 2,
  "processing_strategy": "EXECUTE_REMEDIATION",
  "timestamp": "2026-02-04T05:14:31.707439Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Connecting to DCGM hostengine at nvidia-dcgm.gpu-operator.svc:5555",
  "level": "info",
  "timestamp": "2026-02-04T05:14:31.777339Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Running DCGM diagnostic level=2 gpus=[0, 1]",
  "level": "info",
  "timestamp": "2026-02-04T05:14:31.792671Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Diagnostic summary",
  "level": "info",
  "passed": 6,
  "failed": 0,
  "warned": 0,
  "skipped": 0,
  "total": 6,
  "timestamp": "2026-02-04T05:15:07.320809Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Test pass",
  "level": "info",
  "gpu": "GPU-ea521d5f-c3f3-b2a9-a6dc-c03a4fb64865",
  "test": "software",
  "error_code": 0,
  "detail": "Test passed",
  "timestamp": "2026-02-04T05:15:07.320942Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Sending health event",
  "level": "info",
  "gpu": "GPU-ea521d5f-c3f3-b2a9-a6dc-c03a4fb64865",
  "check_name": "DcgmDiagnosticSoftware",
  "is_healthy": true,
  "is_fatal": false,
  "error_code": null,
  "recommended_action": "NONE",
  "event_message": "Test passed",
  "timestamp": "2026-02-04T05:15:07.321116Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Health event sent successfully",
  "level": "info",
  "timestamp": "2026-02-04T05:15:07.328008Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Test pass",
  "level": "info",
  "gpu": "GPU-4c4aaa9c-e3af-30e7-e9e9-e3f630e0158b",
  "test": "software",
  "error_code": 0,
  "detail": "Test passed",
  "timestamp": "2026-02-04T05:15:07.328163Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Sending health event",
  "level": "info",
  "gpu": "GPU-4c4aaa9c-e3af-30e7-e9e9-e3f630e0158b",
  "check_name": "DcgmDiagnosticSoftware",
  "is_healthy": true,
  "is_fatal": false,
  "error_code": null,
  "recommended_action": "NONE",
  "event_message": "Test passed",
  "timestamp": "2026-02-04T05:15:07.328292Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Health event sent successfully",
  "level": "info",
  "timestamp": "2026-02-04T05:15:07.333962Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Test pass",
  "level": "info",
  "gpu": "GPU-ea521d5f-c3f3-b2a9-a6dc-c03a4fb64865",
  "test": "memory",
  "error_code": 0,
  "detail": "Test passed",
  "timestamp": "2026-02-04T05:15:07.334079Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Sending health event",
  "level": "info",
  "gpu": "GPU-ea521d5f-c3f3-b2a9-a6dc-c03a4fb64865",
  "check_name": "DcgmDiagnosticMemory",
  "is_healthy": true,
  "is_fatal": false,
  "error_code": null,
  "recommended_action": "NONE",
  "event_message": "Test passed",
  "timestamp": "2026-02-04T05:15:07.334203Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Health event sent successfully",
  "level": "info",
  "timestamp": "2026-02-04T05:15:07.336920Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Test pass",
  "level": "info",
  "gpu": "GPU-4c4aaa9c-e3af-30e7-e9e9-e3f630e0158b",
  "test": "memory",
  "error_code": 0,
  "detail": "Test passed",
  "timestamp": "2026-02-04T05:15:07.337037Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Sending health event",
  "level": "info",
  "gpu": "GPU-4c4aaa9c-e3af-30e7-e9e9-e3f630e0158b",
  "check_name": "DcgmDiagnosticMemory",
  "is_healthy": true,
  "is_fatal": false,
  "error_code": null,
  "recommended_action": "NONE",
  "event_message": "Test passed",
  "timestamp": "2026-02-04T05:15:07.337147Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Health event sent successfully",
  "level": "info",
  "timestamp": "2026-02-04T05:15:07.340320Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Test pass",
  "level": "info",
  "gpu": "GPU-ea521d5f-c3f3-b2a9-a6dc-c03a4fb64865",
  "test": "pcie",
  "error_code": 0,
  "detail": "Test passed",
  "timestamp": "2026-02-04T05:15:07.340437Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Sending health event",
  "level": "info",
  "gpu": "GPU-ea521d5f-c3f3-b2a9-a6dc-c03a4fb64865",
  "check_name": "DcgmDiagnosticPcie",
  "is_healthy": true,
  "is_fatal": false,
  "error_code": null,
  "recommended_action": "NONE",
  "event_message": "Test passed",
  "timestamp": "2026-02-04T05:15:07.340548Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Health event sent successfully",
  "level": "info",
  "timestamp": "2026-02-04T05:15:07.343896Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Test pass",
  "level": "info",
  "gpu": "GPU-4c4aaa9c-e3af-30e7-e9e9-e3f630e0158b",
  "test": "pcie",
  "error_code": 0,
  "detail": "Test passed",
  "timestamp": "2026-02-04T05:15:07.344091Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Sending health event",
  "level": "info",
  "gpu": "GPU-4c4aaa9c-e3af-30e7-e9e9-e3f630e0158b",
  "check_name": "DcgmDiagnosticPcie",
  "is_healthy": true,
  "is_fatal": false,
  "error_code": null,
  "recommended_action": "NONE",
  "event_message": "Test passed",
  "timestamp": "2026-02-04T05:15:07.344255Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Health event sent successfully",
  "level": "info",
  "timestamp": "2026-02-04T05:15:07.348027Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "DCGM diagnostic check passed",
  "level": "info",
  "timestamp": "2026-02-04T05:15:07.348216Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}

Platform connector logs (per GPU, per check event)

{
  "time": "2026-02-04T05:15:07.626562018Z",
  "level": "INFO",
  "msg": "Successfully got item",
  "module": "platform-connectors",
  "version": "v0.7.1",
  "healthEvents": {
    "version": 1,
    "events": [
      {
        "version": 1,
        "agent": "preflight-dcgm-diag",
        "componentClass": "GPU",
        "checkName": "DcgmDiagnosticPcie",
        "isHealthy": true,
        "message": "Test passed",
        "entitiesImpacted": [
          {
            "entityType": "GPU_UUID",
            "entityValue": "GPU-4c4aaa9c-e3af-30e7-e9e9-e3f630e0158b"
          }
        ],
        "metadata": {
          "node.kubernetes.io/instance-type": "Standard_ND96amsr_A100_v4",
          "nvidia.com/cuda.driver-version.full": "570.148.08",
          "nvidia.com/cuda.driver-version.major": "570",
          "nvidia.com/cuda.driver-version.minor": "148",
          "nvidia.com/cuda.driver-version.revision": "08",
          "nvidia.com/cuda.runtime-version.full": "12.8",
          "nvidia.com/cuda.runtime-version.major": "12",
          "nvidia.com/cuda.runtime-version.minor": "8",
          "nvidia.com/gpu.product": "NVIDIA-A100-SXM4-80GB",
          "providerID": "az..3",
          "topology.kubernetes.io/region": "southcentralus",
          "topology.kubernetes.io/zone": "0"
        },
        "generatedTimestamp": {
          "seconds": 1770182107,
          "nanos": 344164000
        },
        "nodeName": "aks-gpu-12493808-vmss000003",
        "processingStrategy": 1
      }
    ]
  }
}

Unhealthy Node

$ klog nginx-with-gpu-gxxsk -n preflight-test-ns -c preflight-dcgm-diag  | jq
$ klog nginx-with-gpu-gxxsk -n preflight-test-ns -c preflight-dcgm-diag | jq
{
  "event": "Starting preflight dcgm-diag check",
  "level": "info",
  "diag_level": 2,
  "processing_strategy": "EXECUTE_REMEDIATION",
  "timestamp": "2026-02-04T05:19:34.002764Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Connecting to DCGM hostengine at nvidia-dcgm.gpu-operator.svc:5555",
  "level": "info",
  "timestamp": "2026-02-04T05:19:34.076777Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Running DCGM diagnostic level=2 gpus=[0, 1]",
  "level": "info",
  "timestamp": "2026-02-04T05:19:34.084606Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Diagnostic summary",
  "level": "info",
  "passed": 0,
  "failed": 2,
  "warned": 2,
  "skipped": 2,
  "total": 6,
  "timestamp": "2026-02-04T05:19:40.606771Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Test warn",
  "level": "warning",
  "gpu": "GPU-64c6592e-0317-762b-fb32-0d561ed649ae",
  "test": "software",
  "error_code": 29,
  "detail": "Persistence Mode: Persistence mode for GPU 0 is disabled. Enable persistence mode by running \"nvidia-smi -i <gpuId> -pm 1 \" as root.",
  "timestamp": "2026-02-04T05:19:40.606906Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Loaded 121 DCGM error mappings from /etc/dcgm/dcgmerrorsmapping.csv",
  "level": "info",
  "timestamp": "2026-02-04T05:19:40.609632Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Sending health event",
  "level": "info",
  "gpu": "GPU-64c6592e-0317-762b-fb32-0d561ed649ae",
  "check_name": "DcgmDiagnosticSoftware",
  "is_healthy": false,
  "is_fatal": false,
  "error_code": "DCGM_FR_PERSISTENCE_MODE",
  "recommended_action": "CONTACT_SUPPORT",
  "event_message": "Persistence Mode: Persistence mode for GPU 0 is disabled. Enable persistence mode by running \"nvidia-smi -i <gpuId> -pm 1 \" as root.",
  "timestamp": "2026-02-04T05:19:40.609787Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Health event sent successfully",
  "level": "info",
  "timestamp": "2026-02-04T05:19:40.616362Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Test warn",
  "level": "warning",
  "gpu": "GPU-9c06186e-e6c1-6884-a4ae-ff2174ff0c51",
  "test": "software",
  "error_code": 29,
  "detail": "Persistence Mode: Persistence mode for GPU 1 is disabled. Enable persistence mode by running \"nvidia-smi -i <gpuId> -pm 1 \" as root.",
  "timestamp": "2026-02-04T05:19:40.616598Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Sending health event",
  "level": "info",
  "gpu": "GPU-9c06186e-e6c1-6884-a4ae-ff2174ff0c51",
  "check_name": "DcgmDiagnosticSoftware",
  "is_healthy": false,
  "is_fatal": false,
  "error_code": "DCGM_FR_PERSISTENCE_MODE",
  "recommended_action": "CONTACT_SUPPORT",
  "event_message": "Persistence Mode: Persistence mode for GPU 1 is disabled. Enable persistence mode by running \"nvidia-smi -i <gpuId> -pm 1 \" as root.",
  "timestamp": "2026-02-04T05:19:40.616780Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Health event sent successfully",
  "level": "info",
  "timestamp": "2026-02-04T05:19:40.620754Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Test fail",
  "level": "error",
  "gpu": "GPU-64c6592e-0317-762b-fb32-0d561ed649ae",
  "test": "memory",
  "error_code": 51,
  "detail": "Error using CUDA API cuInit Check DCGM and system logs for errors. Reset GPU. Restart DCGM. Rerun diagnostics. Unable to initialize CUDA library: 'system not yet initialized'. . Please check if a CUDA sample program can be run successfully on this host. Refer to https://github.com/nvidia/cuda-samples",
  "timestamp": "2026-02-04T05:19:40.620884Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Sending health event",
  "level": "info",
  "gpu": "GPU-64c6592e-0317-762b-fb32-0d561ed649ae",
  "check_name": "DcgmDiagnosticMemory",
  "is_healthy": false,
  "is_fatal": true,
  "error_code": "DCGM_FR_CUDA_API",
  "recommended_action": "CONTACT_SUPPORT",
  "event_message": "Error using CUDA API cuInit Check DCGM and system logs for errors. Reset GPU. Restart DCGM. Rerun diagnostics. Unable to initialize CUDA library: 'system not yet initialized'. . Please check if a CUDA sample program can be run successfully on this host. Refer to https://github.com/nvidia/cuda-samples",
  "timestamp": "2026-02-04T05:19:40.621015Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Health event sent successfully",
  "level": "info",
  "timestamp": "2026-02-04T05:19:40.624876Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Test fail",
  "level": "error",
  "gpu": "GPU-9c06186e-e6c1-6884-a4ae-ff2174ff0c51",
  "test": "memory",
  "error_code": 51,
  "detail": "Error using CUDA API cuInit Check DCGM and system logs for errors. Reset GPU. Restart DCGM. Rerun diagnostics. Unable to initialize CUDA library: 'system not yet initialized'. . Please check if a CUDA sample program can be run successfully on this host. Refer to https://github.com/nvidia/cuda-samples",
  "timestamp": "2026-02-04T05:19:40.625069Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Sending health event",
  "level": "info",
  "gpu": "GPU-9c06186e-e6c1-6884-a4ae-ff2174ff0c51",
  "check_name": "DcgmDiagnosticMemory",
  "is_healthy": false,
  "is_fatal": true,
  "error_code": "DCGM_FR_CUDA_API",
  "recommended_action": "CONTACT_SUPPORT",
  "event_message": "Error using CUDA API cuInit Check DCGM and system logs for errors. Reset GPU. Restart DCGM. Rerun diagnostics. Unable to initialize CUDA library: 'system not yet initialized'. . Please check if a CUDA sample program can be run successfully on this host. Refer to https://github.com/nvidia/cuda-samples",
  "timestamp": "2026-02-04T05:19:40.625228Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "Health event sent successfully",
  "level": "info",
  "timestamp": "2026-02-04T05:19:40.628585Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}
{
  "event": "DCGM diagnostic check failed",
  "level": "error",
  "timestamp": "2026-02-04T05:19:40.628780Z",
  "module": "preflight-dcgm-diag",
  "version": "dev"
}

From platform connector logs:

{
  "time": "2026-02-04T05:17:59.837394457Z",
  "level": "INFO",
  "msg": "Successfully got item",
  "module": "platform-connectors",
  "version": "v0.7.1",
  "healthEvents": {
    "version": 1,
    "events": [
      {
        "version": 1,
        "agent": "preflight-dcgm-diag",
        "componentClass": "GPU",
        "checkName": "DcgmDiagnosticMemory",
        "isFatal": true,
        "message": "Error using CUDA API cuInit Check DCGM and system logs for errors. Reset GPU. Restart DCGM. Rerun diagnostics. Unable to initialize CUDA library: 'system not yet initialized'. . Please check if a CUDA sample program can be run successfully on this host. Refer to https://github.com/nvidia/cuda-samples",
        "recommendedAction": 5,
        "errorCode": [
          "DCGM_FR_CUDA_API"
        ],
        "entitiesImpacted": [
          {
            "entityType": "GPU_UUID",
            "entityValue": "GPU-9c06186e-e6c1-6884-a4ae-ff2174ff0c51"
          }
        ],
        "metadata": {
          "node.kubernetes.io/instance-type": "Standard_ND96amsr_A100_v4",
          "nvidia.com/cuda.driver-version.full": "570.148.08",
          "nvidia.com/cuda.driver-version.major": "570",
          "nvidia.com/cuda.driver-version.minor": "148",
          "nvidia.com/cuda.driver-version.revision": "08",
          "nvidia.com/cuda.runtime-version.full": "12.8",
          "nvidia.com/cuda.runtime-version.major": "12",
          "nvidia.com/cuda.runtime-version.minor": "8",
          "nvidia.com/gpu.product": "NVIDIA-A100-SXM4-80GB",
          "providerID": "a..18",
          "topology.kubernetes.io/region": "southcentralus",
          "topology.kubernetes.io/zone": "0"
        },
        "generatedTimestamp": {
          "seconds": 1770182278,
          "nanos": 610074000
        },
        "nodeName": "aks-gpu-12493808-vmss00000i",
        "processingStrategy": 1
      }
    ]
  }
}

Type of Change

Component(s) Affected

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

Summary by CodeRabbit

New Features
- DCGM-based preflight diagnostic CLI, GPU discovery, structured JSON logging, and health-event reporting; new Helm values to configure DCGM endpoint, port, diag level, connector socket, and processing strategy.
Bug Fixes / Improvements
- Webhook now injects DCGM env vars and connector socket volume when configured; DCGM daemonset/address resolution improved; lint workflows adjusted for DCGM checks.
Chores
- CI updated to build/publish DCGM diagnostic component; legacy ping preflight removed; .gitignore and top-level Makefile updated.
Tests
- New unit tests for config, diagnostics, health reporting, and error mapping.

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai · 2026-02-02T11:16:15Z

📝 Walkthrough

Walkthrough

Replaces the preflight ping module with a new preflight-dcgm-diag Python project and image; adds DCGM Helm helpers and global DCGM defaults; injects DCGM envs and optional connector socket volume via the webhook; updates CI/workflows, Makefiles, and build scripts; removes ping artifacts.

Changes

Cohort / File(s)	Summary
CI workflows `/.github/workflows/cleanup-untagged-images.yml`, `/.github/workflows/container-build-test.yml`, `/.github/workflows/lint-test.yml`, `/.github/workflows/publish.yml`	Replace `ping` with `preflight-dcgm-diag` in matrices; add docker build/publish entries for `preflight-dcgm-diag`; remove artifact upload from lint preflight job.
Preflight dcgm-diag project `preflight-checks/dcgm-diag/Dockerfile`, `preflight-checks/dcgm-diag/Makefile`, `preflight-checks/dcgm-diag/pyproject.toml`, `preflight-checks/dcgm-diag/.../dcgm_diag/`, `preflight-checks/dcgm-diag/.../protos/`, `preflight-checks/dcgm-diag/.../tests/`, `preflight-checks/dcgm-diag/.../.pyi`	Add a Poetry-based Python package implementing DCGM diagnostics, GPU discovery, health reporting (gRPC), error mapping, structured logging, protobuf/gRPC stubs, CLI entrypoint, Docker build, Make targets, and unit tests.
Removed ping module & workspace `preflight-checks/ping/Makefile` (deleted), `preflight-checks/ping/go.mod` (deleted), `scripts/buildko.sh`, `Makefile`	Delete ping module files; remove ping from go.work/ko build targets; add dcgm-diag protos generation to top-level Makefile.
Helm — preflight chart `distros/kubernetes/nvsentinel/charts/preflight/templates/_helpers.tpl`, `distros/kubernetes/nvsentinel/charts/preflight/templates/configmap.yaml`, `distros/kubernetes/nvsentinel/charts/preflight/values.yaml`	Add preflight DCGM helper defines (endpoint, port, hostengineAddr, diagLevel, processingStrategy, connectorSocket); insert `dcgm` config block into ConfigMap; replace init container `preflight-ping` with `preflight-dcgm-diag` and update values.
Helm — gpu-health-monitor `distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/_helpers.tpl`, `.../daemonset-dcgm-3.x.yaml`, `.../daemonset-dcgm-4.x.yaml`	Add DCGM helper defines with global/local fallbacks and change daemonset templates to use includes `gpu-health-monitor.dcgmAddr` and `gpu-health-monitor.dcgmEnabled`.
Global Helm values `distros/kubernetes/nvsentinel/values.yaml`	Add `global.dcgm` block with service endpoint and port (nvidia-dcgm.gpu-operator.svc:5555) and enablement flag.
Go preflight config & webhook `preflight/go.mod`, `preflight/pkg/config/config.go`, `preflight/pkg/webhook/injector.go`	Switch YAML library to `sigs.k8s.io/yaml`; add DCGMConfig to FileConfig with defaults; webhook injects DCGM env vars, merges env precedence, and conditionally patches a hostPath `nvsentinel-socket` volume.
Repository ignore `.gitignore`	Ignore dcgm-diag and preflight build artifacts (`preflight-checks/dcgm-diag/dcgm-diag`, `preflight/preflight`).
Build scripts `scripts/buildko.sh`	Remove explicit `preflight-checks/ping` from go.work and ko build targets; keep `./preflight`.
Top-level Makefile `Makefile`	Invoke `preflight-checks/dcgm-diag` protos generation from `protos-generate`.

Sequence Diagram

sequenceDiagram
    participant Pod as Pod
    participant Webhook as Webhook Injector
    participant Init as preflight-dcgm-diag Init Container
    participant NVML as NVML
    participant DCGM as DCGM Hostengine
    participant Reporter as HealthReporter
    participant Connector as Platform Connector (gRPC)

    Pod->>Webhook: creation webhook invoked
    Webhook->>Webhook: evaluate DCGM config & connector socket
    Webhook->>Pod: inject init container `preflight-dcgm-diag` + env vars
    Webhook->>Pod: add `nvsentinel-socket` hostPath volume (if configured)

    Init->>NVML: discover GPUs
    NVML-->>Init: GPU indices & UUIDs
    Init->>DCGM: connect and run diagnostics
    DCGM-->>Init: diagnostic results
    Init->>Init: parse results -> DiagResult list
    Init->>Reporter: build HealthEvents
    Reporter->>Connector: send HealthEvents via unix-socket gRPC
    Connector-->>Reporter: ack
    Init->>Pod: exit (0 success / 1 failure)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped through charts and CI with glee,

Ping tucked away, dcgm-diag set free.
Init container listens, GPUs speak their name,
Health events scurry down the unix lane.
A rabbit cheers for diagnostics tame!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 31.93% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding DCGM diagnostics as a preflight check, which aligns with the comprehensive changeset introducing the dcgm_diag package, Kubernetes integration, and related build/CI configuration updates.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

preflight/go.mod (1)

3-3: ⚠️ Potential issue | 🟡 Minor

Update Go version to the latest stable release. Go 1.25.0 is outdated; the current stable release is Go 1.25.6. Update the go directive to go 1.25.6 to stay current with security updates and bug fixes, aligning with the guidance to keep Go versions up-to-date.

🤖 Fix all issues with AI agents

In
`@distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/_helpers.tpl`:
- Around line 45-51: The helper gpu-health-monitor.dcgmEnabled uses a partial
nil check (.Values.global) but then accesses .Values.global.dcgm.enabled
directly which can panic if global.dcgm is missing; update the conditional to
mirror the dcgmEndpoint/dcgmPort pattern by checking the full chain (e.g., if
.Values.global and .Values.global.dcgm) and then use .Values.global.dcgm.enabled
| default .Values.dcgm.dcgmK8sServiceEnabled so the fallback
(.Values.dcgm.dcgmK8sServiceEnabled) is used safely when global.dcgm or its
enabled key is absent.

In `@distros/kubernetes/nvsentinel/charts/preflight/templates/_helpers.tpl`:
- Around line 102-111: The helper preflight.dcgmHostengineAddr currently returns
an empty string when .Values.global.dcgm.service is missing; change it to mirror
gpu-health-monitor.dcgmAddr by falling back to the local .Values.dcgm.service
values: if .Values.global.dcgm.service exists use
.Values.global.dcgm.service.endpoint and .port, otherwise use
.Values.dcgm.service.endpoint and .Values.dcgm.service.port; update the
conditional in the define "preflight.dcgmHostengineAddr" to check
.Values.global.dcgm.service and select the appropriate endpoint:port string
accordingly so it never returns an empty string when local dcgm.service is
provided.

In `@distros/kubernetes/nvsentinel/charts/preflight/values.yaml`:
- Around line 29-32: Replace the development image and undocumented fields: set
image.repository to the official NVIDIA registry
(ghcr.io/nvidia/nvsentinel/preflight) instead of xrfxlp/preflight, change
image.tag to use the chart’s release/version pattern (e.g., the same value used
by other charts such as the global image tag or Chart.AppVersion) instead of
10420202, and keep pullPolicy as IfNotPresent; also add short inline comments
for the image.repository, image.pullPolicy, and image.tag fields mirroring the
style and content used in the global configuration block above so each field is
documented.

In `@preflight-checks/dcgm-diag/Dockerfile`:
- Around line 44-49: The Dockerfile copies only go.mod for the dcgm-diag module
but then runs RUN go mod download in WORKDIR
/workspace/preflight-checks/dcgm-diag; update the COPY instruction that
references the dcgm-diag module so it copies both go.mod and go.sum (similar to
how data-models copies go.mod and go.sum) to ensure go mod download can verify
checksums and produce reproducible builds—locate the COPY line that currently
says COPY preflight-checks/dcgm-diag/go.mod ./preflight-checks/dcgm-diag/ and
modify it to include go.sum as well.
- Line 19: Update the Dockerfile build ARG for Go to a current patched 1.25
release: change the ARG named GOLANG_VERSION used in the Dockerfile from 1.25.0
to 1.25.6 (or a newer 1.25.x patch) so the image uses Go 1.25.6+; ensure any
related build/test scripts or CI that reference GOLANG_VERSION are updated to
the same value to keep versions consistent.

In `@preflight-checks/dcgm-diag/gpu.go`:
- Line 15: Add a package-level godoc comment describing the purpose of this
package immediately above the package declaration for package main; create a
concise sentence or two that explains what the dcgm-diag/gpu package does (e.g.,
GPU preflight diagnostics using DCGM) so the package has proper documentation
and meets the coding guidelines.

In `@scripts/buildko.sh`:
- Line 55: The script currently only includes ./preflight in the go work/use and
ko build targets; add the missing dcgm-diag module by including
./preflight-checks/dcgm-diag alongside ./preflight in the `go work use`
directive (replace or extend the existing ./preflight token) and add
./preflight-checks/dcgm-diag to the list of targets passed to `ko build` (the
same place where ./preflight is listed) so the dcgm-diag binary is included in
the container build.

🧹 Nitpick comments (8)

preflight/pkg/config/config.go (2)
39-43: Add godoc comment for exported DCGMConfig type.

Per Go conventions, exported types should have documentation comments describing their purpose and fields.
📝 Proposed documentation
+// DCGMConfig holds DCGM (Data Center GPU Manager) diagnostic configuration
+// for preflight health checks.
 type DCGMConfig struct {
+	// HostengineAddr is the address of the DCGM hostengine service (e.g., "nvidia-dcgm.gpu-operator.svc:5555")
 	HostengineAddr string `yaml:"hostengineAddr"`
+	// DiagLevel specifies the DCGM diagnostic level (1-4), defaults to 1
 	DiagLevel      int    `yaml:"diagLevel"`
+	// Timeout specifies the diagnostic timeout duration (e.g., "5m"), defaults to "5m"
 	Timeout        string `yaml:"timeout"`
 }
60-66: Consider validating Timeout as a parseable duration.

The Timeout field is set as a string but will likely be parsed as a time.Duration downstream. Validating it early in Load would surface configuration errors at startup rather than at runtime.
🛡️ Proposed validation
 	if fileConfig.DCGM.Timeout == "" {
 		fileConfig.DCGM.Timeout = "5m"
 	}
+	if _, err := time.ParseDuration(fileConfig.DCGM.Timeout); err != nil {
+		return nil, fmt.Errorf("invalid DCGM timeout duration %q: %w", fileConfig.DCGM.Timeout, err)
+	}

 	return &Config{FileConfig: fileConfig}, nil
You'll also need to add "time" to the imports.
preflight-checks/dcgm-diag/main.go (2)
32-43: Remove redundant os.Exit(0) call.

Go programs exit with status 0 by default when main() returns normally. The explicit os.Exit(0) is redundant and prevents deferred functions from executing (though none exist here currently).
♻️ Suggested simplification
 func main() {
 	slog.Info("Starting preflight dcgm-diag check",
 		"version", version, "commit", commit, "date", date)

 	if err := run(); err != nil {
 		slog.Error("DCGM diagnostic check failed", "error", err)
 		os.Exit(1)
 	}

 	slog.Info("DCGM diagnostic check passed")
-	os.Exit(0)
 }
94-99: Consider moving logger configuration out of parseConfig.

Setting the global logger as a side effect within parseConfig couples configuration parsing with logging setup. For better separation of concerns, consider returning the verbose flag and configuring the logger in run() or main().
preflight-checks/dcgm-diag/diag.go (1)

51-74: Goroutine may continue running after context timeout.

When the context times out, runDCGMDiag returns but the goroutine calling dcgm.RunDiag continues executing. This is likely acceptable since DCGM diagnostics cannot be cancelled mid-execution, but be aware that the diagnostic process will complete in the background.
preflight-checks/dcgm-diag/Makefile (1)
41-41: Platform inconsistency between local and publish builds.

PLATFORMS is set to linux/amd64,linux/arm64 (line 41), but docker-build (line 85) hardcodes --platform linux/amd64. This is likely intentional for faster local builds, but consider adding a comment to clarify this design choice.
📝 Add clarifying comment
 .PHONY: docker-build
 docker-build: setup-buildx ## Build Docker image locally
 	`@echo` "Building Docker image for preflight-dcgm-diag (local)..."
 	$(if $(filter true,$(DISABLE_REGISTRY_CACHE)),`@echo` "Registry cache disabled for this build")
 	cd $(REPO_ROOT) && docker buildx build \
+		# Local builds use single platform for speed; docker-publish uses PLATFORMS
 		--platform linux/amd64 \
Also applies to: 85-85
preflight-checks/dcgm-diag/health.go (1)
47-61: Consider deduplicating GPU entities.

If multiple test failures occur on the same GPU, the entities slice will contain duplicate UUIDs. Depending on the reporting API's expectations, this may cause issues or redundant data.
♻️ Deduplicate entities using a map
 func reportHealthEvent(connectorSocket string, results []dcgm.DiagResult, isFatal bool, message string) error {
 	reporter := newReporter(connectorSocket)

-	entities := make([]reporting.Entity, 0, len(results))
+	seen := make(map[uint]bool)
+	entities := make([]reporting.Entity, 0, len(results))

 	for _, r := range results {
+		if seen[r.EntityID] {
+			continue
+		}
+		seen[r.EntityID] = true
+
 		uuid, err := GetGPUUUID(r.EntityID)
 		if err != nil {
 			slog.Error("Failed to get GPU UUID for health event", "gpuIndex", r.EntityID, "error", err)

 			return err
 		}

 		entities = append(entities, reporting.Entity{
 			Type:  "GPU_UUID",
 			Value: uuid,
 		})
 	}
preflight-checks/dcgm-diag/gpu.go (1)
26-27: Global map lacks synchronization and clearing logic.

The gpuIndexToUUID map is written in getAllocatedGPUs and read in GetGPUUUID without synchronization. If these are called concurrently, there's a data race. Additionally, the map is never cleared before repopulating, so repeated calls to getAllocatedGPUs could leave stale entries.

If this code path is guaranteed to be single-threaded and called only once, this is acceptable. Otherwise, consider using a sync.RWMutex and clearing the map at the start of getAllocatedGPUs.
♻️ Suggested fix with mutex protection
+import "sync"
+
+var gpuIndexToUUIDMu sync.RWMutex
 var gpuIndexToUUID = make(map[uint]string)
Then in getAllocatedGPUs (around line 66):
gpuIndexToUUIDMu.Lock()
gpuIndexToUUID = make(map[uint]string) // clear stale entries
defer gpuIndexToUUIDMu.Unlock()
And in GetGPUUUID:
gpuIndexToUUIDMu.RLock()
defer gpuIndexToUUIDMu.RUnlock()

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/_helpers.tpl

distros/kubernetes/nvsentinel/charts/preflight/templates/_helpers.tpl

distros/kubernetes/nvsentinel/charts/preflight/values.yaml

preflight-checks/dcgm-diag/Dockerfile

preflight-checks/dcgm-diag/gpu.go

scripts/buildko.sh

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

distros/kubernetes/nvsentinel/charts/preflight/values.yaml (1)
20-83: ⚠️ Potential issue | 🟠 Major

Add inline docs for newly added values (and examples where non‑obvious).

Several new blocks lack inline comments (e.g., serviceAccount, securityContext, service, resources, probes, nodeSelector/tolerations/affinity, and global.dcgm.service.port). The chart guidelines require every value to be documented and examples for non‑obvious settings.
Example style (apply broadly)
 global:
   dcgm:
     service:
       # DCGM hostengine service endpoint (GPU Operator's DCGM service)
       endpoint: "nvidia-dcgm.gpu-operator.svc"
+      # DCGM hostengine service port
       port: 5555
@@
 serviceAccount:
+  # Create a dedicated ServiceAccount for the preflight webhook
   create: true
preflight/pkg/webhook/injector.go (1)
17-24: ⚠️ Potential issue | 🟠 Major

Derive socket hostPath from ConnectorSocket instead of hardcoding.

Hardcoding /var/run/nvsentinel won’t include the default socket (/var/run/nvsentinel.sock) or custom paths, so the init container may not see the socket. Parse the configured socket URI and mount its directory.
Suggested fix
 import (
 	"fmt"
 	"log/slog"
+	"path/filepath"
+	"strings"

 	"github.com/nvidia/nvsentinel/preflight/pkg/config"
 	corev1 "k8s.io/api/core/v1"
 	"k8s.io/apimachinery/pkg/api/resource"
 )
@@
-	// Only inject socket volume if connector socket is configured
-	if i.cfg.DCGM.ConnectorSocket == "" {
+	// Only inject socket volume if connector socket is configured
+	socketPath := strings.TrimPrefix(i.cfg.DCGM.ConnectorSocket, "unix://")
+	if socketPath == "" {
 		return patches
 	}
@@
-	hostPathType := corev1.HostPathDirectory
+	hostPathType := corev1.HostPathDirectoryOrCreate
 	socketVolume := corev1.Volume{
 		Name: nvsentinelSocketVolumeName,
 		VolumeSource: corev1.VolumeSource{
 			HostPath: &corev1.HostPathVolumeSource{
-				Path: "/var/run/nvsentinel",
+				Path: filepath.Dir(socketPath),
 				Type: &hostPathType,
 			},
 		},
 	}
Also applies to: 164-184

🤖 Fix all issues with AI agents

In `@preflight-checks/dcgm-diag/Dockerfile`:
- Around line 18-35: The Dockerfile downloads the Go tarball without verifying
integrity; add an ARG GOLANG_SHA256 and replace the current RUN that uses
wget+tar with a download-and-verify sequence that checks the SHA256 before
extraction: declare ARG GOLANG_SHA256, download
go${GOLANG_VERSION}.linux-amd64.tar.gz (using wget or curl) and then verify it
with sha256sum -c (e.g. echo "$GOLANG_SHA256 
go${GOLANG_VERSION}.linux-amd64.tar.gz" | sha256sum -c -) and fail the build on
mismatch, only then tar -C /usr/local -xzf the verified archive and remove the
artifacts; update the RUN that currently contains wget/tar to perform these
steps and reference ARG GOLANG_VERSION and the new ARG GOLANG_SHA256.

In `@preflight-checks/dcgm-diag/pkg/gpu/gpu.go`:
- Around line 18-25: The indexToUUID map is written during discovery and read by
GetUUID without synchronization, causing races and stale data; protect all
accesses by adding a package-level sync.RWMutex (or mutex) and use a write lock
when populating/clearing indexToUUID in the discovery code that populates the
map and a read lock inside GetUUID (and any other readers) to ensure consistent
reads; also ensure the discovery path clears or replaces the map while holding
the write lock so stale entries are removed atomically.

In `@preflight-checks/dcgm-diag/pkg/health/health.go`:
- Around line 66-77: The loop building gpuUUIDs from results using gpu.GetUUID
may append the same UUID multiple times when multiple tests fail on one GPU;
update the logic in health.go (the block that iterates over results and
populates gpuUUIDs) to deduplicate UUIDs before returning/reporting by using a
temporary set/map keyed by UUID (or checking for existence before append) so
only unique GPU UUIDs are added to the gpuUUIDs slice; ensure the rest of the
code that consumes gpuUUIDs continues to receive a slice of unique UUID strings.
- Around line 115-128: The HealthEvent currently sets IsHealthy using IsHealthy:
!isFatal which wrongly marks warnings as healthy; change the constructor to
accept an explicit isHealthy bool (or pass an existing isHealthy flag) and set
IsHealthy: isHealthy in the pb.HealthEvent literal (keeping IsFatal: isFatal),
and update all callers that construct this event (locations invoking the
function/section that sets event := &pb.HealthEvent{...}, using
agentName/isFatal) to supply the correct isHealthy value so warnings become
IsFatal=false, IsHealthy=false while healthy remains IsFatal=false,
IsHealthy=true.

🧹 Nitpick comments (7)

preflight/pkg/config/config.go (1)
39-44: Add godoc for exported DCGMConfig.

golint expects exported identifiers to have a leading comment; this new type is missing one.
Suggested fix
+// DCGMConfig holds DCGM diagnostics configuration loaded from YAML.
 type DCGMConfig struct {
 	HostengineAddr  string `yaml:"hostengineAddr"`
 	DiagLevel       int    `yaml:"diagLevel"`
 	Timeout         string `yaml:"timeout"`
 	ConnectorSocket string `yaml:"connectorSocket"`
 }
preflight-checks/dcgm-diag/main.go (3)
35-46: Redundant os.Exit(0) call.

After a successful run() returns nil, the program will naturally exit with code 0. The explicit os.Exit(0) on line 45 is unnecessary and prevents any deferred functions in main() from executing (though there are none currently).
🧹 Proposed cleanup
 func main() {
 	slog.Info("Starting preflight dcgm-diag check",
 		"version", version, "commit", commit, "date", date)

 	if err := run(); err != nil {
 		slog.Error("DCGM diagnostic check failed", "error", err)
 		os.Exit(1)
 	}

 	slog.Info("DCGM diagnostic check passed")
-	os.Exit(0)
 }
104-113: Consider using strconv.Atoi and logging parse failures.

fmt.Sscanf is unconventional for simple integer parsing. The silent fallback on parse errors could mask configuration issues in production.
♻️ Proposed improvement
+import "strconv"
+
 func getEnvInt(key string, defaultValue int) int {
 	if value := os.Getenv(key); value != "" {
-		var result int
-		if _, err := fmt.Sscanf(value, "%d", &result); err == nil {
+		result, err := strconv.Atoi(value)
+		if err == nil {
 			return result
 		}
+		slog.Warn("Invalid integer value for environment variable, using default",
+			"key", key, "value", value, "default", defaultValue, "error", err)
 	}

 	return defaultValue
 }
115-123: Consider logging parse failures for duration environment variables.

Similar to integer parsing, silently falling back on invalid duration strings could hide misconfiguration.
♻️ Proposed improvement
 func getEnvDuration(key string, defaultValue time.Duration) time.Duration {
 	if value := os.Getenv(key); value != "" {
 		if d, err := time.ParseDuration(value); err == nil {
 			return d
 		}
+		slog.Warn("Invalid duration value for environment variable, using default",
+			"key", key, "value", value, "default", defaultValue)
 	}

 	return defaultValue
 }
preflight-checks/dcgm-diag/pkg/diag/diag.go (2)
34-79: Goroutine may leak on context cancellation.

When the context times out (line 70-71), the goroutine running dcgm.RunDiag continues executing and will eventually send to resultCh or errCh. Since these are buffered channels of size 1, this won't block, but the DCGM resources may not be properly cleaned up if the diagnostic is still running.

Consider documenting this behavior or investigating whether dcgm.RunDiag can be cancelled. The current implementation is acceptable for a preflight check that exits immediately after.
📝 Suggested documentation improvement
 // Run executes DCGM diagnostics using the go-dcgm bindings.
 //
 // Note: go-dcgm requires CGO and links against libdcgm.so at compile time.
 // The binary must be built with DCGM 4.2.3+ which introduced dcgmDiagResponse_version12.
+//
+// On context timeout, the diagnostic goroutine may continue running until completion.
+// The process is expected to exit shortly after, releasing all resources.
 func Run(ctx context.Context, level int, hostengineAddr string) (*dcgm.DiagResults, error) {
182-210: Minor code duplication between formatFailures and formatWarnings.

These functions are nearly identical, differing only in the fallback suffix. Consider extracting a shared helper.
♻️ Optional refactor to reduce duplication
-func formatFailures(failures []dcgm.DiagResult) string {
+func formatResults(results []dcgm.DiagResult, fallbackSuffix string) string {
 	var parts []string

-	for _, f := range failures {
-		msg := fmt.Sprintf("%s (GPU %d): %s", f.TestName, f.EntityID, f.ErrorMessage)
-		if f.ErrorMessage == "" {
-			msg = fmt.Sprintf("%s (GPU %d): failed", f.TestName, f.EntityID)
+	for _, r := range results {
+		msg := fmt.Sprintf("%s (GPU %d): %s", r.TestName, r.EntityID, r.ErrorMessage)
+		if r.ErrorMessage == "" {
+			msg = fmt.Sprintf("%s (GPU %d): %s", r.TestName, r.EntityID, fallbackSuffix)
 		}

 		parts = append(parts, msg)
 	}

 	return strings.Join(parts, "; ")
 }
-
-func formatWarnings(warnings []dcgm.DiagResult) string {
-	var parts []string
-
-	for _, w := range warnings {
-		msg := fmt.Sprintf("%s (GPU %d): %s", w.TestName, w.EntityID, w.ErrorMessage)
-		if w.ErrorMessage == "" {
-			msg = fmt.Sprintf("%s (GPU %d): warning", w.TestName, w.EntityID)
-		}
-
-		parts = append(parts, msg)
-	}
-
-	return strings.Join(parts, "; ")
-}
Then update callers:
msg := formatResults(failures, "failed")
msg := formatResults(warnings, "warning")
preflight-checks/dcgm-diag/pkg/health/health.go (1)
42-56: Consider returning an error from ReportError for consistency.

ReportError silently logs and discards the error from sendHealthEvent, while ReportEvent returns the error. This inconsistency may confuse maintainers. Even if the caller ignores the error, returning it provides a consistent API.
♻️ Proposed consistency improvement
 // ReportError reports a fatal error without specific GPU entities.
-func ReportError(connectorSocket, message string) {
+func ReportError(connectorSocket, message string) error {
 	if connectorSocket == "" {
 		slog.Info("Skipping health event reporting (no connector socket configured)")

-		return
+		return nil
 	}

 	slog.Info("Reporting error health event",
 		"socket", connectorSocket,
 		"message", message)

 	if err := sendHealthEvent(connectorSocket, nil, true, message); err != nil {
 		slog.Warn("Failed to report health event", "error", err)
+		return err
 	}
+
+	return nil
 }
Then in main.go, the caller can choose to ignore the returned error:
_ = health.ReportError(cfg.connectorSocket, err.Error())

preflight-checks/dcgm-diag/Dockerfile

preflight-checks/dcgm-diag/pkg/gpu/gpu.go

preflight-checks/dcgm-diag/pkg/health/health.go

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@preflight-checks/dcgm-diag/pkg/diag/diag.go`:
- Around line 30-78: The select using ctx.Done() races with deferred cleanup
because dcgm.RunDiag (called inside the goroutine) is not cancellable; update
Run to either remove the ctx timeout handling entirely or ensure the goroutine
finishes before returning: for example, replace the select with a synchronous
receive from resultCh/errCh (removing ctx.Done) or add a done/wait mechanism
(sync.WaitGroup or a separate done channel that the goroutine closes after
sending to resultCh/errCh) so that when ctx.Done() fires you still wait for the
goroutine to complete before executing deferred cleanup (refer to Run, initDCGM,
dcgm.RunDiag, resultCh, errCh, groupCleanup, and cleanup).

In `@preflight-checks/dcgm-diag/pkg/health/health.go`:
- Around line 73-90: The code compares errors directly with io.EOF (e.g., err ==
io.EOF) which fails for wrapped errors; update that comparison to use
errors.Is(err, io.EOF) so wrapped io.EOF is detected, and add the errors import
if missing. Locate the logic that handles the sendToConnector result inside the
wait.ExponentialBackoff retry block (and the subsequent status.FromError(err)
handling) and replace the direct equality check with errors.Is to correctly
unwrap wrapped errors.

preflight-checks/dcgm-diag/pkg/diag/diag.go

preflight-checks/dcgm-diag/pkg/health/health.go

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In
`@distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/_helpers.tpl`:
- Around line 42-51: The helper template "gpu-health-monitor.dcgmEnabled"
currently uses default which treats false as empty and re-enables DCGM; update
the inner logic to check for the presence of the explicit key instead of using
default: when .Values.global and .Values.global.dcgm exist use hasKey on
.Values.global.dcgm for "enabled" and if the key exists return
.Values.global.dcgm.enabled, otherwise fall back to
.Values.dcgm.dcgmK8sServiceEnabled; ensure this change targets the define
"gpu-health-monitor.dcgmEnabled" and the keys .Values.global.dcgm.enabled and
.Values.dcgm.dcgmK8sServiceEnabled so explicit false is preserved.

In `@distros/kubernetes/nvsentinel/charts/preflight/templates/_helpers.tpl`:
- Around line 142-145: The helper preflight.connectorSocket currently accesses
.Values.global.socketPath directly which can panic; update the template to first
check whether .Values.global and .Values.global.socketPath are defined (e.g.,
use an if with and to test .Values.global and .Values.global.socketPath), and
only then use that value; otherwise use the default "/var/run/nvsentinel.sock"
and return the formatted "unix://%s" string—mirror the defensive pattern used in
preflight.dcgmEndpoint/preflight.dcgmPort so nested property access is guarded
before applying defaults.

In `@preflight-checks/dcgm-diag/pkg/health/health.go`:
- Around line 170-172: The current check in health.go only tests whether err
implements interface{ Temporary() bool } but never calls Temporary(), so
non-temporary errors that implement it will be retried; update the logic to
perform the type assertion to a variable (e.g., temp, ok := err.(interface{
Temporary() bool })) and then call temp.Temporary() and only return true (or
trigger a retry) when ok is true AND temp.Temporary() == true; ensure you
reference the existing err variable and replace the current if block that uses
the anonymous type assertion.

In `@preflight/pkg/webhook/injector.go`:
- Around line 159-200: The injectVolumes function currently hardcodes
"/var/run/nvsentinel"; update it to derive the host path from
i.cfg.DCGM.ConnectorSocket (strip "unix://" if present, use filepath.Dir to get
the directory) and set hostPathType to corev1.HostPathDirectoryOrCreate so the
directory is created if missing; adjust the socketVolume construction in
injectVolumes to use that derived path and type, keep the existing checks for
empty ConnectorSocket and existing volume, and add imports for "strings" and
"path/filepath".

🧹 Nitpick comments (1)

preflight/pkg/webhook/injector.go (1)

203-207: Avoid hard‑coding the DCGM init‑container name.

If users override the init container name in config/values, env injection will silently skip. Consider making the target name configurable or validating it during config load.

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/_helpers.tpl

distros/kubernetes/nvsentinel/charts/preflight/templates/_helpers.tpl

preflight-checks/dcgm-diag/pkg/health/health.go

preflight/pkg/webhook/injector.go

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai

Actionable comments posted: 7

🤖 Fix all issues with AI agents

In `@preflight-checks/dcgm-diag/dcgm_diag/__main__.py`:
- Around line 76-86: The code logs "DCGM diagnostic check passed"
unconditionally even when the warnings branch ran; update the control flow so
that when warnings is non-empty you emit a warning-level completion log (e.g.,
use log.warning with a message like "DCGM diagnostic completed with warnings")
and only log the "passed" info-level message in the else branch (no warnings).
Locate the warnings handling block (the list comprehension for uuids, the
message variable and reporter.send_event calls) and change the final log call
there from log.info to a warning-level log; keep the existing log.info("DCGM
diagnostic check passed") only in the no-warnings (else) branch so logs
accurately reflect health.
- Around line 51-56: The exception handler for diag.run currently reports the
diagnostic failure as non-fatal; update the handler so that when diag.run raises
an Exception you log the error and call reporter.send_event with
is_healthy=False and is_fatal=True (preserving the message from the exception),
then exit with sys.exit(1); this change affects the except block that catches
Exception as e, the log.error call, and the reporter.send_event invocation.

In `@preflight-checks/dcgm-diag/dcgm_diag/config.py`:
- Line 31: The code currently does diag_level = int(os.getenv("DCGM_DIAG_LEVEL",
"1")) which will raise an unhandled ValueError if DCGM_DIAG_LEVEL is
non-numeric; wrap the int conversion in a try/except that catches ValueError,
read the raw env value via os.getenv("DCGM_DIAG_LEVEL"), and either fallback to
a safe default (e.g. 1) or raise a new ValueError with a clear message including
the offending value (and reference to DCGM_DIAG_LEVEL) so callers see what
failed; update the diag_level assignment in config.py accordingly and use the
same variable name diag_level in the fix.

In `@preflight-checks/dcgm-diag/dcgm_diag/health.py`:
- Around line 61-62: The send_event path currently logs on failure but silently
continues; change send_event so that if self._send_with_retries(health_events)
returns False it raises a specific exception (e.g., HealthEventDeliveryError)
instead of only calling log.error, and define that exception class
(HealthEventDeliveryError) in the module; update callers of send_event (or
document callers of send_event/_send_with_retries) to catch or propagate this
exception so preflight can fail fast. Ensure the raised exception message
includes MAX_RETRIES and context about the health_events for debugging.

In `@preflight-checks/dcgm-diag/Dockerfile`:
- Around line 43-54: Remove the line that writes "Acquire::https::Verify-Peer
\"false\";" to /etc/apt/apt.conf.d/99disable-cert-check and ensure HTTPS
verification remains enabled; instead run apt-get update first and then install
ca-certificates (and other packages) so apt uses system CAs for TLS.
Specifically delete the RUN that creates 99disable-cert-check and reorder the
block around the apt-get update + apt-get install sequence in the same
Dockerfile RUN (referencing the echo to /etc/apt/apt.conf.d/99disable-cert-check
and the apt-get install invocation) so ca-certificates is installed/used
normally rather than disabling verification.

In `@preflight-checks/dcgm-diag/Makefile`:
- Around line 24-27: Add an explicit all target to the Makefile to satisfy
checkmake by creating an alias target named all that depends on the existing
lint-test (so default make behavior is preserved); update the Makefile near the
existing .PHONY and lint-test target (reference: .PHONY and lint-test) to
include an all target that simply delegates to lint-test.

In `@preflight-checks/dcgm-diag/pyproject.toml`:
- Around line 10-11: The pyproject dependency version for grpcio is too low
compared to the generated module's runtime check in health_event_pb2_grpc.py;
update the grpc-related pins in pyproject.toml (grpcio and grpcio-tools) to at
least 1.75.1 (or match the minimum checked by health_event_pb2_grpc.py) so the
import-time version check passes, and run a quick install and test import of
health_event_pb2_grpc to verify resolution.

🧹 Nitpick comments (5)

preflight-checks/dcgm-diag/dcgm_diag/gpu.py (2)
30-31: Consider raising KeyError or returning Optional[str] for invalid GPU index.

get_uuid silently returns an empty string for an invalid index, which could mask bugs if callers don't check for empty strings. Consider either raising KeyError for invalid indices or changing the return type to str | None to make the failure case explicit.
Option A: Raise on invalid index
     def get_uuid(self, index: int) -> str:
-        return self._index_to_uuid.get(index, "")
+        if index not in self._index_to_uuid:
+            raise KeyError(f"GPU index {index} not found")
+        return self._index_to_uuid[index]
Option B: Return Optional
-    def get_uuid(self, index: int) -> str:
-        return self._index_to_uuid.get(index, "")
+    def get_uuid(self, index: int) -> str | None:
+        return self._index_to_uuid.get(index)
47-49: Error message mismatch between log and exception.

The log says "Failed to discover GPUs" but the raised exception says "NVML initialization failed". This could be confusing during debugging since discovery failure could happen after successful init (e.g., during nvmlDeviceGetCount or nvmlDeviceGetHandleByIndex).
Align error messages
         except pynvml.NVMLError as e:
-            log.error(f"Failed to discover GPUs: {e}")
-            raise RuntimeError(f"NVML initialization failed: {e}")
+            log.error(f"GPU discovery failed: {e}")
+            raise RuntimeError(f"GPU discovery failed: {e}")
preflight-checks/dcgm-diag/dcgm_diag/health.py (1)
96-107: Consider adding a timeout to the gRPC call.

The stub.HealthEventOccurredV1 call has no timeout, which could cause the preflight check to hang indefinitely if the platform connector is unresponsive. gRPC calls support a timeout parameter.
Add timeout to gRPC call
+GRPC_TIMEOUT = 30.0  # seconds
+
+# ... in _send_with_retries:
                 with grpc.insecure_channel(f"unix://{self._socket_path}") as channel:
                     stub = pb_grpc.PlatformConnectorStub(channel)
-                    stub.HealthEventOccurredV1(health_events)
+                    stub.HealthEventOccurredV1(health_events, timeout=GRPC_TIMEOUT)
                     log.info("Health event sent successfully")
preflight-checks/dcgm-diag/dcgm_diag/diag.py (2)
86-86: Silent fallback to DCGM_DIAG_LVL_SHORT for invalid levels.

If an invalid level is passed (outside 1-4), the code silently defaults to DCGM_DIAG_LVL_SHORT. This could mask configuration errors. Since Config.from_env already validates the level, this is defensive, but a warning log would help debugging.
Add warning for fallback
-            diag_level = self.DIAG_LEVELS.get(level, dcgm_structs.DCGM_DIAG_LVL_SHORT)
+            diag_level = self.DIAG_LEVELS.get(level)
+            if diag_level is None:
+                log.warning(f"Invalid diagnostic level {level}, defaulting to SHORT")
+                diag_level = dcgm_structs.DCGM_DIAG_LVL_SHORT
102-102: Add type hints for response and value parameters.

Per coding guidelines, type hints are required for all functions. The response parameter in _parse_response and value parameter in _decode_string are missing type annotations.
Add type hints
-    def _parse_response(self, response, gpu_indices: list[int]) -> list[DiagResult]:
+    def _parse_response(self, response: dcgm_structs.c_dcgmDiagResponse_v12, gpu_indices: list[int]) -> list[DiagResult]:
     `@staticmethod`
-    def _decode_string(value) -> str:
+    def _decode_string(value: bytes | str | None) -> str:
Also applies to: 144-145

preflight-checks/dcgm-diag/dcgm_diag/__main__.py

preflight-checks/dcgm-diag/dcgm_diag/config.py

preflight-checks/dcgm-diag/dcgm_diag/health.py

preflight-checks/dcgm-diag/Dockerfile

preflight-checks/dcgm-diag/Makefile

preflight-checks/dcgm-diag/pyproject.toml

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@preflight-checks/dcgm-diag/dcgm_diag/diag.py`:
- Around line 43-46: Add an explicit return-type annotation "-> None" to the
constructor: update the __init__ definition in diag.py (the __init__ that sets
self._hostengine_addr, self._handle and self._gpu_discovery) to declare a None
return type (e.g., def __init__(self, hostengine_addr: str = "") -> None:) so it
satisfies the project’s type-hint requirement.
- Around line 146-152: The _decode_string method lacks a type hint for its
parameter; update the signature of _decode_string to accept Any (def
_decode_string(value: Any) -> str) and ensure typing.Any is imported (add "from
typing import Any" if missing) so the function handles bytes, str, or other
values with proper type annotations.
- Around line 113-132: error_lookup is keyed by err.testId but the lookup uses
the loop index test_idx; change the lookup to use the actual test id from the
result (entity_result.testId) so errors map correctly to results. Specifically,
in the block iterating results (where entity_result is read and gpu_idx set)
replace the key (test_idx, gpu_idx) with (entity_result.testId, gpu_idx) when
calling error_lookup.get so error_msg is retrieved using the true test
identifier.
- Around line 102-103: The _parse_response function is missing a type annotation
for the response parameter; update its signature to include an appropriate type
(e.g., response: dict[str, Any] or response: Mapping[str, Any]) and add the
necessary typing import (Any or Mapping) at the top of the module so
_parse_response(self, response: dict[str, Any], gpu_indices: list[int]) ->
list[DiagResult] (or with Mapping) is fully typed and satisfies the project's
type-hint rule.

🧹 Nitpick comments (3)

preflight-checks/dcgm-diag/Makefile (1)
61-69: Add .PHONY declarations for docker-build and docker-publish.

These targets don't produce files with those names, so they should be declared as phony to ensure they always run when invoked.
Suggested fix
 .PHONY: setup
 setup:
 	`@echo` "Setting up Poetry environment for $(MODULE_NAME)..."
 	poetry config virtualenvs.in-project true
 	poetry install

+.PHONY: docker-build
 docker-build: setup-buildx
 	`@echo` "Building Docker image for $(MODULE_NAME)..."
And similarly for docker-publish:
+.PHONY: docker-publish
 docker-publish: setup-buildx
 	`@echo` "Publishing Docker image for $(MODULE_NAME)..."
preflight-checks/dcgm-diag/dcgm_diag/diag.py (2)
35-41: Annotate DIAG_LEVELS as ClassVar to clarify intent.
This avoids instance-level mutation ambiguity and aligns with typing best practices.
♻️ Suggested update
+from typing import ClassVar
...
 class DCGMDiagnostic:
-    DIAG_LEVELS = {
+    DIAG_LEVELS: ClassVar[dict[int, int]] = {
     }
69-75: Avoid catching broad Exception on shutdown.
Please narrow this to the specific DCGM/pydcgm exception type your version emits to avoid masking unexpected errors.

preflight-checks/dcgm-diag/dcgm_diag/diag.py

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@preflight-checks/dcgm-diag/dcgm_diag/health.py`:
- Around line 36-41: The __init__ method of HealthReporter is missing an
explicit return type; update the constructor signature for
HealthReporter.__init__ to include "-> None" (i.e., make it def __init__(...,
processing_strategy: pb.ProcessingStrategy) -> None:) so it matches the return
type annotations used by other methods in the class and satisfies the project's
typing consistency.
- Around line 26-28: The retry loop uses stub.HealthEventOccurredV1() without a
deadline so hung RPCs block retries; add a bounded timeout (e.g., RPC_TIMEOUT)
and pass it to the gRPC call (deadline/timeout param) and make it configurable
alongside MAX_RETRIES/INITIAL_DELAY/BACKOFF_FACTOR; update the call site
(stub.HealthEventOccurredV1(..., timeout=RPC_TIMEOUT) or the appropriate gRPC
deadline API used in this codebase) so the RPC raises on timeout and the
existing exception handling and backoff logic can proceed.

In `@preflight-checks/dcgm-diag/dcgm_diag/tests/conftest.py`:
- Around line 20-32: Add explicit type hints for the pytest fixtures: annotate
the monkeypatch parameter as pytest.MonkeyPatch and the fixture return types as
None for both clean_env and valid_env, and annotate the clean_env parameter on
valid_env as None; ensure pytest is imported for the MonkeyPatch type (e.g.,
import pytest) so functions read def clean_env(monkeypatch: pytest.MonkeyPatch)
-> None: and def valid_env(monkeypatch: pytest.MonkeyPatch, clean_env: None) ->
None:.

In `@preflight-checks/dcgm-diag/dcgm_diag/tests/test_config.py`:
- Around line 15-18: Add missing type hints to every test function/method in
test_config.py: annotate all parameters (e.g., fixtures) with their expected
types and give each test a return type of None; update method signatures inside
the test class (and top-level test functions) such as any test_* functions in
test_config.py to include parameter and return annotations, and import required
typing names (e.g., Callable, Any) if needed. Ensure fixtures referenced in
signatures use their proper types (or typing.Any) and that all test methods end
with -> None so the module conforms to the coding guidelines.

In `@preflight-checks/dcgm-diag/dcgm_diag/tests/test_health.py`:
- Around line 24-95: Tests and fixture lack type annotations; add appropriate
typing to the reporter fixture and each test function (including methods in
TestSendWithRetries and TestSendEvent) to satisfy project requirements: annotate
the reporter fixture to return HealthReporter, annotate test functions'
parameters (e.g., mock objects like mock_channel, mock_sleep, mock_send and the
reporter param) with types from unittest.mock (MagicMock/Mock) or grpc/types
where applicable and mark return types as None; update signatures for
reporter(), test_success_first_attempt(...), test_retries_on_failure(...),
test_fails_after_max_retries(...), test_exponential_backoff(...), and
TestSendEvent.test_raises_on_failure(...) accordingly, importing typing names
(e.g., Mock, MagicMock) if needed.

preflight-checks/dcgm-diag/dcgm_diag/health.py

preflight-checks/dcgm-diag/dcgm_diag/tests/conftest.py

preflight-checks/dcgm-diag/dcgm_diag/tests/test_config.py

preflight-checks/dcgm-diag/dcgm_diag/tests/test_health.py

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

preflight-checks/dcgm-diag/dcgm_diag/errors.py

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

copy-pr-bot · 2026-02-04T12:21:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

XRFXLP · 2026-02-04T12:22:06Z

/ok to test 4d8056a

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@preflight-checks/dcgm-diag/dcgm_diag/errors.py`:
- Around line 49-52: Strip CSV fields before validating and using them: assign
stripped_name = row[0].strip() and stripped_action = row[1].strip(), then check
that stripped_name is non-empty and stripped_action is a valid key of
pb.RecommendedAction (e.g., stripped_action in pb.RecommendedAction.keys())
before adding mapping[stripped_name] =
pb.RecommendedAction.Value(stripped_action); this prevents creating empty keys
and preserves mappings where only surrounding whitespace was present.

🧹 Nitpick comments (1)

preflight-checks/dcgm-diag/dcgm_diag/errors.py (1)
30-42: Cache the loader functions to avoid repeated file I/O and reflection.

Both loader functions are called transitively by public APIs; caching keeps repeated calls fast and avoids rereading the CSV.
♻️ Proposed refactor
+@lru_cache(maxsize=1)
 def _load_code_to_name() -> dict[int, str]:
     """Load DCGM error code → name mapping from dcgm_errors module."""
     try:
         import dcgm_errors
@@
-    return {}
+    return {}

+@lru_cache(maxsize=1)
 def _load_name_to_action() -> dict[str, int]:
     """Load DCGM error name → action mapping from CSV file."""

preflight-checks/dcgm-diag/dcgm_diag/errors.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @.github/workflows/lint-test.yml:
- Around line 153-154: The CI step currently invokes "make -C
preflight-checks/${{ matrix.component }} lint" which only runs the Black
formatter check; update this to run the full "lint-test" target (i.e., replace
the make target with "lint-test") so linting plus tests and coverage run in CI,
or alternatively add a separate job that runs "make -C preflight-checks/${{
matrix.component }} lint-test" (or a dedicated test job) to ensure the dcgm-diag
tests and coverage artifacts are executed; modify the job that currently has the
name "Run lint" (or create a new job) accordingly so it calls the "lint-test"
target instead of "lint".

.github/workflows/lint-test.yml

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

XRFXLP · 2026-02-04T12:36:51Z

/ok to test 132ba37

XRFXLP · 2026-02-04T14:06:09Z

/ok to test 07e6ae8

XRFXLP · 2026-02-05T05:07:36Z

/ok to test 416ea9c

preflight-checks/dcgm-diag/dcgm_diag/diag.py

preflight-checks/dcgm-diag/dcgm_diag/gpu.py

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

XRFXLP · 2026-02-05T06:24:30Z

/ok to test 8eba746

lalitadithya · 2026-02-05T10:34:24Z

/ok to test c7e40b6

XRFXLP · 2026-02-05T10:43:30Z

/ok to test 5463918

lalitadithya · 2026-02-05T11:35:43Z

/ok to test 4c712c2

ArangoGutierrez · 2026-02-05T14:27:45Z

@lalitadithya @XRFXLP I would appreciate not merging code against this branch without consulting with @pteranodan and my self first.

XRFXLP added 2 commits February 2, 2026 16:44

chore: init commit

63577b5

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

Merge branch 'main' into xrfxlp/763

921a30b

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

chore: should be working

3f0fced

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

chore: few more changes

3ae3a00

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

preflight-checks/dcgm-diag/pkg/diag/diag.go Outdated Show resolved Hide resolved

preflight-checks/dcgm-diag/pkg/health/health.go Outdated Show resolved Hide resolved

chore: more changes

13d57c3

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

XRFXLP self-assigned this Feb 3, 2026

XRFXLP added the enhancement New feature or request label Feb 3, 2026

XRFXLP linked an issue Feb 3, 2026 that may be closed by this pull request

[Feature]: Add preflight check for dcgmi diag #763

Closed

1 task

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

XRFXLP added 2 commits February 3, 2026 10:45

chore: more changes

e4c6e23

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

Merge branch 'main' into xrfxlp/763

270ba62

XRFXLP changed the title ~~[WIP] feat: Add dcgm diagnostics as a preflight check~~ feat: Add dcgm diagnostics as a preflight check Feb 3, 2026

XRFXLP added 5 commits February 3, 2026 10:52

chore: remove ornamental changes

76d82f3

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

chore: more changes

3cb6c83

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

clean main.go

a91a0a9

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

chore: use python instead of go for better flexibilty

a6df27c

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

chore: fix pipeline

bd8d921

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

XRFXLP added 4 commits February 3, 2026 15:07

chore: no idea

e34d001

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

chore: fix lint

ee10576

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

chore: fix pipeline now

df9e371

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

chore: more changes

4f8d6be

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

chore: add tests

217e855

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

chore: more changes

f49ac52

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

chore: refactor injection of env var to common and specific

6bf0aab

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

XRFXLP requested a review from lalitadithya February 4, 2026 12:03

lalitadithya reviewed Feb 4, 2026

View reviewed changes

preflight-checks/dcgm-diag/dcgm_diag/errors.py Outdated Show resolved Hide resolved

XRFXLP added 2 commits February 4, 2026 17:49

chore: remove lru cache

3562657

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

Merge branch 'main' into xrfxlp/763

4d8056a

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

preflight-checks/dcgm-diag/dcgm_diag/errors.py Show resolved Hide resolved

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

.github/workflows/lint-test.yml Outdated Show resolved Hide resolved

fix: lint and test

132ba37

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

Merge branch 'main' into xrfxlp/763

07e6ae8

XRFXLP added this to the v0.9.0 milestone Feb 4, 2026

Merge branch 'main' into xrfxlp/763

416ea9c

deesharma24 reviewed Feb 5, 2026

View reviewed changes

preflight-checks/dcgm-diag/dcgm_diag/diag.py Outdated Show resolved Hide resolved

preflight-checks/dcgm-diag/dcgm_diag/gpu.py Outdated Show resolved Hide resolved

fix: remove embedded mode reference

8eba746

Signed-off-by: Ajay Mishra <ajmishra@nvidia.com>

Merge branch 'main' into xrfxlp/763

c7e40b6

Merge branch 'main' into xrfxlp/763

5463918

Merge branch 'main' into xrfxlp/763

4c712c2

lalitadithya enabled auto-merge (squash) February 5, 2026 11:35

lalitadithya approved these changes Feb 5, 2026

View reviewed changes

auto-merge was automatically disabled February 5, 2026 11:40
Pull Request is not mergeable

lalitadithya merged commit ced27ca into NVIDIA:main Feb 5, 2026
60 checks passed

Conversation

XRFXLP commented Feb 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Healthy node

Unhealthy Node

Type of Change

Component(s) Affected

Testing

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot bot commented Feb 4, 2026

Uh oh!

XRFXLP commented Feb 4, 2026

Uh oh!

XRFXLP commented Feb 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 2, 2026 •

edited

Loading