Skip to content

Commit 65abddc

Browse files
Merge branch 'main' into fix-fault-remediation-bug
2 parents 14d7204 + 26a91df commit 65abddc

File tree

28 files changed

+1234
-103
lines changed

28 files changed

+1234
-103
lines changed

GOVERNANCE.md

Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
# NVSentinel Governance
2+
3+
This document describes the governance model for the NVSentinel project. It defines the roles, responsibilities, and decision-making processes that help maintain the long-term health and sustainability of the project.
4+
5+
## Overview
6+
7+
NVSentinel follows a hierarchical governance model similar to other Kubernetes ecosystem projects. This structure ensures that contributions are properly reviewed, architectural decisions are well-considered, and the project maintains high quality standards while remaining accessible to new contributors.
8+
9+
## Roles and Responsibilities
10+
11+
The NVSentinel project has the following roles, listed in order of increasing scope of responsibility:
12+
13+
### Contributors
14+
15+
**Who they are**: Anyone who contributes to the project in any form (code, documentation, issues, discussions, etc.).
16+
17+
**Responsibilities**:
18+
- Follow the [Code of Conduct](CODE_OF_CONDUCT.md)
19+
- Sign the [Developer Certificate of Origin](CONTRIBUTING.md#developer-certificate-of-origin) (DCO) for all contributions
20+
- Follow project coding standards and guidelines
21+
- Respond to feedback on their contributions
22+
23+
**How to become one**: Simply contribute! Open a pull request, file an issue, or participate in discussions.
24+
25+
### Reviewers
26+
27+
**Who they are**: Contributors who have demonstrated consistent, high-quality contributions and have been granted review privileges for specific areas of the codebase.
28+
29+
**Responsibilities**:
30+
- Review pull requests in their area of expertise
31+
- Provide constructive feedback on code quality, design, and testing
32+
- Ensure contributions follow project standards
33+
- Help triage and categorize issues
34+
- Mentor new contributors
35+
36+
**Privileges**:
37+
- Can review and comment on pull requests
38+
- Can request changes on pull requests
39+
- Can be assigned to review pull requests
40+
41+
**How to become one**:
42+
1. Demonstrate consistent, high-quality contributions over time
43+
2. Show expertise in specific areas of the codebase
44+
3. Be nominated by an existing Reviewer or Approver
45+
4. Be approved by a majority of Approvers in the relevant area
46+
47+
### Approvers
48+
49+
**Who they are**: Reviewers who have demonstrated deep expertise and excellent judgment in their areas. They have the authority to approve pull requests for merging.
50+
51+
**Responsibilities**:
52+
- All responsibilities of Reviewers
53+
- Approve pull requests that meet project standards
54+
- Make decisions on technical design within their area
55+
- Participate in architectural discussions
56+
- Help maintain code quality and project health
57+
- Mentor Reviewers and Contributors
58+
59+
**Privileges**:
60+
- All privileges of Reviewers
61+
- Can approve pull requests (with appropriate reviews)
62+
- Can merge approved pull requests
63+
- Can participate in release planning and decisions
64+
65+
**How to become one**:
66+
1. Be an active Reviewer for at least 3 months
67+
2. Demonstrate excellent technical judgment and code review quality
68+
3. Show ability to make sound architectural decisions
69+
4. Deliver a feature end to end
70+
5. Be nominated by an existing Approver or Maintainer
71+
6. Be approved by a majority of Maintainers
72+
73+
### Maintainers
74+
75+
**Who they are**: Approvers who have demonstrated exceptional commitment to the project and have broad responsibility for its overall health and direction.
76+
77+
**Responsibilities**:
78+
- All responsibilities of Approvers
79+
- Make architectural and design decisions
80+
- Participate in release planning and management
81+
- Resolve conflicts and disputes
82+
- Maintain project documentation and governance
83+
- Represent the project in the community
84+
- Onboard new Approvers and Reviewers
85+
86+
**Privileges**:
87+
- All privileges of Approvers
88+
- Can make architectural decisions
89+
- Can approve new Reviewers and Approvers
90+
- Can participate in project-wide decisions
91+
- Can manage project settings and repositories
92+
93+
**How to become one**:
94+
1. Be an active Approver for at least 6 months
95+
2. Demonstrate exceptional commitment and leadership
96+
3. Show ability to guide project direction
97+
4. Be nominated by an existing Maintainer
98+
5. Be approved by a supermajority (2/3) of existing Maintainers
99+
100+
### Technical Leads / Project Chairs
101+
102+
**Who they are**: Maintainers who provide overall technical leadership and strategic direction for the project.
103+
104+
**Responsibilities**:
105+
- All responsibilities of Maintainers
106+
- Set long-term technical vision and roadmap
107+
- Make final decisions on major architectural changes
108+
- Resolve disputes that cannot be resolved by Maintainers
109+
- Represent the project to external stakeholders
110+
- Coordinate with other projects and organizations
111+
112+
**Privileges**:
113+
- All privileges of Maintainers
114+
- Final authority on technical decisions
115+
- Can make emergency decisions when needed
116+
117+
**How to become one**:
118+
- Appointed by the project sponsor (NVIDIA) in consultation with existing Technical Leads
119+
- Requires exceptional technical leadership and commitment
120+
121+
## Decision-Making Process
122+
123+
### Code Changes
124+
125+
1. **Pull Request Process**:
126+
- All pull requests require at least one review from a Reviewer
127+
- All pull requests require approval from at least one Approver
128+
- Maintainers can approve their own changes, but should seek review from another Maintainer for significant changes
129+
- All CI checks must pass before merging
130+
131+
2. **Review Requirements**:
132+
- Small changes (documentation, typo fixes): 1 Reviewer approval
133+
- Standard changes: 1 Reviewer + 1 Approver approval
134+
- Significant changes (new features, major refactoring): 2 Reviewers + 1 Approver approval
135+
- Architectural changes: 2 Approvers + 1 Maintainer approval
136+
137+
### Design and Architecture Decisions
138+
139+
1. **Proposal Process**:
140+
- Create a GitHub Discussion or issue with the `design` label
141+
- Tag relevant Maintainers and Approvers
142+
- Allow at least 1 week for community feedback
143+
- Document the decision and rationale
144+
145+
2. **Decision Authority**:
146+
- Minor design decisions: Any Approver
147+
- Significant design decisions: Majority of Maintainers
148+
- Major architectural changes: Supermajority (2/3) of Maintainers or Technical Leads
149+
150+
### Conflict Resolution
151+
152+
1. **Technical Disagreements**:
153+
- Discuss in pull request comments or GitHub Discussions
154+
- If unresolved, escalate to Maintainers
155+
- Maintainers will facilitate discussion and make a decision
156+
- Final appeal to Technical Leads if needed
157+
158+
2. **Code of Conduct Issues**:
159+
- Report to GitHub_Conduct@nvidia.com (as per [Code of Conduct](CODE_OF_CONDUCT.md))
160+
- Maintainers will investigate and take appropriate action
161+
162+
## Areas of Ownership
163+
164+
The project is organized into several areas, each with their own Reviewers and Approvers:
165+
166+
- **Core Infrastructure**: Platform connectors, store client, data models
167+
- **Health Monitors**: GPU, syslog, CSP, and Kubernetes object monitors
168+
- **Fault Management**: Fault quarantine, node drainer, fault remediation
169+
- **Supporting Services**: Janitor, labeler, metadata collector, log collector
170+
- **Distribution**: Helm charts, Kubernetes manifests, deployment tooling
171+
- **Documentation**: All project documentation
172+
173+
Maintainers have cross-cutting responsibilities across all areas.
174+
175+
## Becoming a Reviewer or Approver
176+
177+
### Path to Reviewer
178+
179+
1. **Contribute consistently**: Make at least 5-10 meaningful contributions
180+
2. **Demonstrate expertise**: Show deep understanding of specific areas
181+
3. **Review contributions**: Provide helpful reviews on others' pull requests
182+
4. **Get nominated**: Be nominated by an existing Reviewer or Approver
183+
5. **Get approved**: Receive approval from a majority of Approvers in the relevant area
184+
185+
### Path to Approver
186+
187+
1. **Be an active Reviewer**: Review consistently for at least 3 months
188+
2. **Show judgment**: Demonstrate ability to make sound technical decisions
189+
3. **Mentor others**: Help onboard new contributors
190+
4. **Get nominated**: Be nominated by an existing Approver or Maintainer
191+
5. **Get approved**: Receive approval from a majority of Maintainers
192+
193+
## Maintaining Status
194+
195+
All roles require ongoing participation:
196+
197+
- **Reviewers**: Should review at least 1 pull request per month
198+
- **Approvers**: Should review and approve at least 2 pull requests per month
199+
- **Maintainers**: Should participate in project discussions and decisions regularly
200+
201+
If a person is inactive for 6+ months, their status may be reviewed. Exceptions can be made for known circumstances (e.g., sabbatical, parental leave).
202+
203+
## Release Management
204+
205+
- **Release Planning**: Maintainers coordinate release planning and scheduling
206+
- **Release Approval**: Releases require approval from at least 2 Maintainers
207+
- **Release Process**: See [RELEASE.md](RELEASE.md) for detailed release procedures
208+
209+
## Modifications to Governance
210+
211+
Changes to this governance document require:
212+
- A pull request with the proposed changes
213+
- Discussion period of at least 1 week
214+
- Approval from a supermajority (2/3) of Maintainers
215+
216+
## Contact
217+
218+
For questions about governance or to nominate someone for a role:
219+
- Open a GitHub Discussion with the `governance` label
220+
- Contact the project Maintainers directly
221+
222+
---
223+
224+
*This governance model is inspired by the Kubernetes project and other CNCF projects, adapted for the needs of NVSentinel.*

OWNERS

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# See the OWNERS docs: https://k8s.dev/docs/guide/owners
2+
#
3+
# This file defines the approvers and reviewers for the NVSentinel project.
4+
# OWNERS files are used by automation tools like Prow, Tide, and GitHub Actions
5+
# to automatically assign reviewers and require approvals for pull requests.
6+
#
7+
# For more information, see GOVERNANCE.md
8+
#
9+
# Aliases can be defined in OWNERS_ALIASES and referenced here.
10+
11+
approvers:
12+
# Maintainers - have approval authority across the project
13+
# Add GitHub usernames of project Maintainers here
14+
# You can also use aliases from OWNERS_ALIASES (e.g., sig-maintainers)
15+
# Example:
16+
# - maintainer1
17+
# - maintainer2
18+
- sig-maintainers # Using alias from OWNERS_ALIASES
19+
20+
reviewers:
21+
# Reviewers - can review pull requests
22+
# Add GitHub usernames of project Reviewers here
23+
# You can also use aliases from OWNERS_ALIASES
24+
# Example:
25+
# - reviewer1
26+
# - reviewer2
27+
- sig-reviewers # Using alias from OWNERS_ALIASES
28+
29+
labels:
30+
# Labels to automatically apply to PRs in this directory
31+
- area/nvsentinel
32+
33+
# Optional: Former approvers/reviewers who are no longer active
34+
# emeritus_approvers:
35+
# - former-approver1
36+
# emeritus_reviewers:
37+
# - former-reviewer1
38+
39+
# Optional: Configuration options
40+
# options:
41+
# no_parent_owners: false # Set to true to exclude parent OWNERS files

OWNERS_ALIASES

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# See the OWNERS docs: https://k8s.dev/docs/guide/owners
2+
#
3+
# This file defines aliases for groups of people.
4+
# Aliases can be used in OWNERS files to reference multiple people at once.
5+
#
6+
# Format:
7+
# alias-name:
8+
# - github-username1
9+
# - github-username2
10+
#
11+
# Example usage in OWNERS file:
12+
# approvers:
13+
# - sig-core-infra # This alias references all members below
14+
15+
# Project-wide aliases
16+
# Add aliases for different areas of the project as defined in GOVERNANCE.md
17+
18+
# Core Infrastructure team (Platform connectors, store client, data models)
19+
# sig-core-infra:
20+
# - approver1
21+
# - approver2
22+
23+
# Health Monitors team (GPU, syslog, CSP, Kubernetes object monitors)
24+
# sig-health-monitors:
25+
# - approver1
26+
# - approver2
27+
28+
# Fault Management team (Fault quarantine, node drainer, fault remediation)
29+
# sig-fault-management:
30+
# - approver1
31+
# - approver2
32+
33+
# Supporting Services team (Janitor, labeler, metadata collector, log collector)
34+
# sig-supporting-services:
35+
# - approver1
36+
# - approver2
37+
38+
# Distribution team (Helm charts, Kubernetes manifests, deployment tooling)
39+
# sig-distribution:
40+
# - approver1
41+
# - approver2
42+
43+
# Documentation team
44+
# sig-documentation:
45+
# - approver1
46+
# - approver2
47+
48+
# All Maintainers (cross-cutting responsibility)
49+
sig-maintainers:
50+
- lalitadithya
51+
- dims
52+
# - maintainer3

distros/kubernetes/nvsentinel/charts/event-exporter/templates/deployment.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ spec:
8686
- "--config=/etc/config/config.toml"
8787
- "--metrics-port={{ .Values.global.metricsPort }}"
8888
- "--oidc-secret-path=/var/secrets/oidc-client-secret"
89+
- "--workers={{ .Values.exporter.workers }}"
8990
ports:
9091
- name: metrics
9192
containerPort: {{ .Values.global.metricsPort }}

distros/kubernetes/nvsentinel/charts/event-exporter/values.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,12 @@ exporter:
7272
database: "nvsentinel"
7373
collection: "resumetokens"
7474

75+
# Number of concurrent publish workers.
76+
# Each worker publishes events to the sink in parallel.
77+
# Higher values increase throughput for high-event-rate clusters.
78+
# At 300ms per publish, 10 workers ≈ 33 events/sec throughput.
79+
workers: 10
80+
7581
# Failure handling and retry configuration
7682
failureHandling:
7783
maxRetries: 17 # ~30 minutes of retries

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/configmap.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,5 +48,6 @@ data:
4848
DCGM_HEALTH_WATCH_NVSWITCH_NONFATAL=NonFatal
4949
DCGM_HEALTH_WATCH_PCIE=Fatal
5050
DCGM_HEALTH_WATCH_PMU=Fatal
51+
DCGM_HEALTH_WATCH_ALL=Fatal
5152
5253
{{ (.Files.Glob "files/dcgmerrorsmapping.csv").AsConfig | indent 2 }}

docs/METRICS.md

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -188,12 +188,20 @@ These metrics track the internal ring buffer workqueue performance:
188188

189189
These metrics track GPU health events detected via DCGM (Data Center GPU Manager):
190190

191-
| Metric Name | Type | Labels | Description |
192-
|------------|------|--------|-------------|
193-
| `dcgm_health_events_publish_time_to_grpc_channel` | Histogram | `operation_name` | Amount of time spent in publishing DCGM health events on the gRPC channel |
194-
| `health_events_insertion_to_uds_succeed` | Counter | - | Total number of successful insertions of health events to UDS |
195-
| `health_events_insertion_to_uds_error` | Counter | - | Total number of failed insertions of health events to UDS |
196-
| `dcgm_health_active_events` | Gauge | `event_type`, `gpu_id`, `severity` | Total number of active health events at any given time by severity. Severity values: `fatal`, `non_fatal` |
191+
| Metric Name | Type | Labels | Description |
192+
|---------------------------------------------------|-----------|------------------------------------|-----------------------------------------------------------------------------------------------------------|
193+
| `dcgm_health_events_publish_time_to_grpc_channel` | Histogram | `operation_name` | Amount of time spent in publishing DCGM health events on the gRPC channel |
194+
| `health_events_insertion_to_uds_succeed` | Counter | - | Total number of successful insertions of health events to UDS |
195+
| `health_events_insertion_to_uds_error` | Counter | - | Total number of failed insertions of health events to UDS |
196+
| `dcgm_health_active_events` | Gauge | `event_type`, `gpu_id`, `severity` | Total number of active health events at any given time by severity. Severity values: `fatal`, `non_fatal` |
197+
| `dcgm_api_latency` | Histogram | `operation_name` | Amount of time spent calling DCGM APIs |
198+
| `dcgm_reconcile_time` | Histogram | - | Amount of time spent running a single DCGM reconcile loop |
199+
| `number_of_health_watches` | Gauge | - | Number of DCGM health watches available |
200+
| `number_of_fields` | Gauge | - | Number of available DCGM fields to monitor |
201+
| `callback_failures` | Counter | `class_name`, `func_name` | Number of times a callback function has thrown an exception |
202+
| `callback_success` | Counter | `class_name`, `func_name` | Number of times a callback function has successfully completed |
203+
| `dcgm_api_failures` | Counter | `error_name` | Number of DCGM API errors |
204+
| `dcgm_health_check_unknown_system_skipped` | Counter | - | Number of DCGM health check incidents skipped due to unrecognized system value |
197205

198206
---
199207

0 commit comments

Comments
 (0)