NVSentinel's log collection feature automatically gathers diagnostic logs from GPU nodes when faults are detected. These logs help with troubleshooting and root cause analysis of GPU hardware and software issues.
When a node is drained due to a fault, NVSentinel can optionally create a log collection job that gathers comprehensive diagnostics and stores them in an in-cluster file server for easy access.
When GPU nodes fail, you need diagnostic information to understand what went wrong:
- Root cause analysis: Determine if the issue is hardware, driver, or configuration related
- Support requests: Provide comprehensive logs to NVIDIA support or your infrastructure team
- Trend analysis: Build historical data on failure patterns
- Faster resolution: All relevant logs collected automatically instead of manual gathering
Without log collection, you'd need to manually SSH to failed nodes, locate the right log files, and extract them before the node is rebooted or replaced - a time-consuming and error-prone process.
When log collection is enabled, NVSentinel automatically collects diagnostics when a fault with a supported remediation action is detected:
- Fault Remediation module receives a health event with a supported action
- Creates a Kubernetes Job on the target node to collect diagnostics
- Job runs with privileged access to gather logs (in parallel with node drain/remediation)
- Logs are uploaded to the in-cluster file server
- Job completes and is automatically cleaned up after 1 hour
Note: Log collection is skipped for unsupported remediation actions since no automated remediation is performed and the node remains accessible for manual log collection.
The file server stores logs organized by node name and timestamp, accessible via web browser through port-forwarding.
Comprehensive NVIDIA driver and GPU diagnostic report:
- GPU configuration and status
- Driver version and details
- GPU error logs
- PCIe information
- DCGM diagnostics
Warning: Must-gather is disabled by default because it collects logs from ALL nodes in the cluster, which can be very time-consuming for large clusters (e.g., GB200 clusters with 100+ nodes).
When enabled, collects Kubernetes resources and logs for GPU operator components:
- GPU operator pod logs
- DCGM exporter logs
- Device plugin logs
- GPU feature discovery logs
- Operator configuration
To enable must-gather, set enableGpuOperatorMustGather: true and increase the timeout accordingly (see Timeout Configuration below).
System logs and configuration from GCP or AWS instances when enabled.
Configure log collection through Helm values:
fault-remediation:
logCollector:
enabled: false # Enable automatic log collection
uploadURL: "http://nvsentinel-incluster-file-server.nvsentinel.svc.cluster.local/upload"
gpuOperatorNamespaces: "gpu-operator"
timeout: "10m"
# GPU Operator must-gather (disabled by default - see warning below)
enableGpuOperatorMustGather: false
# Cloud-specific SOS collection
enableGcpSosCollection: false
enableAwsSosCollection: falseThe default timeout of 10m is sufficient when collecting only the nvidia-bug-report (must-gather disabled).
Important: If you enable
enableGpuOperatorMustGather, you MUST increase the timeout!Must-gather collects logs from all nodes in the cluster, taking approximately 2-3 minutes per node.
Recommended timeout formula:
(number of nodes) × 2-3 minutes
Cluster Size Recommended Timeout 10 nodes 30m 50 nodes 2h 100 nodes 4h 200+ nodes 8h+
Example configuration for a 100-node cluster with must-gather enabled:
fault-remediation:
logCollector:
enabled: true
enableGpuOperatorMustGather: true
timeout: "4h" # Increased for 100-node clusterThe in-cluster file server stores collected logs:
incluster-file-server:
enabled: true # Deploy the file server
persistence:
enabled: true
storageClassName: "" # Uses default storage class
size: 50Gi
logCleanup:
enabled: true
retentionDays: 7 # Keep logs for 7 days
sleepInterval: 86400 # Run cleanup every 24 hours- Enable/Disable: Turn log collection on or off per deployment
- Storage Size: Configure persistent volume size based on expected log volume
- Log Retention: Automatically clean up old logs after configurable retention period
- Timeout: Set maximum time for log collection job (increase when enabling must-gather)
- Enable Must-Gather: Enable GPU Operator must-gather collection (disabled by default due to performance impact on large clusters)
- Cloud-Specific SOS: Enable additional cloud provider diagnostics
Logs are gathered automatically when a supported remediation action is triggered - no manual intervention required.
Collection job runs with necessary privileges to access all diagnostic sources on the node.
Logs are organized by node name and timestamp in a browsable directory structure:
/node-name/timestamp/
├── nvidia-bug-report.log.gz
└── gpu-operator-must-gather.tar.gz
Simple port-forward to browse and download logs via web browser:
kubectl port-forward -n nvsentinel svc/nvsentinel-incluster-file-server 8080:80Then access http://localhost:8080 in your browser.
Built-in log rotation removes old logs based on retention policy to manage disk space.
Collection jobs are automatically cleaned up after completion with configurable TTL (default: 1 hour).
Logs are stored in a persistent volume attached to an NGINX-based file server running in the cluster. The file server is accessible only within the cluster or via port-forward.
Automatic cleanup service runs periodically (default: daily) to remove logs older than the configured retention period (default: 7 days). This prevents disk space issues from accumulating logs.
/usr/share/nginx/html/
└── <node-name>/
└── <timestamp>/
├── nvidia-bug-report-<node-name>-<timestamp>.log.gz
└── gpu-operator-must-gather-<node-name>-<timestamp>.tar.gz