Skip to content

kvserver: observability for expired liveness lease #147025

@miraradeva

Description

@miraradeva

We've seen on multiple occasions (most recently in https://github.com/cockroachlabs/support/issues/3300) that an issue with the liveness range leaseholder can cause an cluster-wide outage. In such cases, we would like to be able to better understand what caused the lease to expire (e.g. CPU overload, disk issues).

Capturing an execution trace right before the lease is about to expire can help debug these situations better in the future. In the common case, the lease is extended after 3s, so if we were to capture an execution trace after 5s or so, this should only happen in the few cases where the lease is actually about to expire.

Jira issue: CRDB-50863

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-observabilityC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAT-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions