-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Open
Labels
A-kv-observabilityC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAIssues/test failures with no fix SLAT-kvKV TeamKV Team
Description
We've seen on multiple occasions (most recently in https://github.com/cockroachlabs/support/issues/3300) that an issue with the liveness range leaseholder can cause an cluster-wide outage. In such cases, we would like to be able to better understand what caused the lease to expire (e.g. CPU overload, disk issues).
Capturing an execution trace right before the lease is about to expire can help debug these situations better in the future. In the common case, the lease is extended after 3s, so if we were to capture an execution trace after 5s or so, this should only happen in the few cases where the lease is actually about to expire.
Jira issue: CRDB-50863
Metadata
Metadata
Assignees
Labels
A-kv-observabilityC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAIssues/test failures with no fix SLAT-kvKV TeamKV Team