server,tests: add server tso batch metric and test lock fix #10224

okJiang · 2026-02-05T11:34:04Z

What problem does this PR solve?

pd-tso-bench can generate batch requests, but the existing Grafana panel depends on the client-side metric pd_client_request_handle_tso_batch_size. In stress scenarios this client metric may be missing or not representative, which makes it hard to judge real server-side batching behavior from PD monitoring.

In addition, CI hit an intermittent deadlock warning in tests: tests.TestServer.Run() held a write lock while calling s.server.Run(), while retry cleanup paths called State() under a read lock, causing lock contention and flaky failures.

Issue Number: ref #10228

What is changed and how does it work?

This PR changes observability to use PD server metrics directly and fixes the test lock contention:

Add a new server histogram metric pd_server_handle_tso_batch_size in server/metrics.go.
Record request.GetCount() for normal TSO handling and service-independent forwarding path in server/grpc_service.go.
Update Grafana panel TSO request batch size in metrics/grafana/pd.json to query the new server metric (_bucket/_sum/_count) instead of client metric.
Fix test lock contention in tests/cluster.go by introducing a Starting state and avoiding holding TestServer write lock across s.server.Run(). Cleanup logic now treats Starting as active and can stop/destroy safely.

server,tests: add tso batch metric and fix test server lock

Check List

Tests

Unit test
Manual test (add detailed scripts or steps below)

Manual test steps:

Code changes

Has the configuration change: No
Has HTTP API interfaces changed: No
Has persistent data change: No

Side effects

Possible performance regression: No
Increased code complexity: Low (only metric observation + explicit test state transition)
Breaking backward compatibility: No

Related changes

PR to update pingcap/docs/pingcap/docs-cn: N/A
PR to update pingcap/tiup: N/A
Need to cherry-pick to the release branch: TBD

Release note

Add a PD server-side TSO batch size metric and use it in Grafana; also fix flaky test lock contention in TestServer startup/cleanup paths.

Summary by CodeRabbit

Observability
- Added a TSO batch-size histogram and instrumentation to record batch sizes; Grafana dashboards now source and label these server-side metrics for clearer monitoring.
Tests
- Improved test server lifecycle with a new transitional "Starting" state and adjusted run/stop/destroy/retry flows to handle transitional states reliably.

ti-chi-bot · 2026-02-05T11:34:08Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ti-chi-bot · 2026-02-05T11:34:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign overvenus for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-02-05T11:34:14Z

📝 Walkthrough

Walkthrough

Adds a server-side TSO batch-size histogram and records observations on local TSO handling; updates Grafana queries to use server-side metrics; introduces a new TestServer "Starting" state and updates Run/Stop/Destroy and retry cleanup logic accordingly.

Changes

Cohort / File(s)	Summary
Server metrics & instrumentation `server/metrics.go`, `server/grpc_service.go`	Add `tsoBatchSize` Prometheus histogram (server namespace) and call `tsoBatchSize.Observe(count)` in the local TSO handling path.
Grafana dashboard `metrics/grafana/pd.json`	Replace client-side TSO histogram/aggregations with server-side equivalents (`pd_server_handle_tso_batch_size_*`) and update legend labels; minor EOF newline change.
Test server lifecycle `tests/cluster.go`	Add `Starting` state constant; update `Run()`, `Stop()`, `Destroy()` and retry/cleanup paths to treat `Starting` similarly to `Running` for stop/destroy eligibility and to handle transitional locking/state updates.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client as Client
participant Grpc as GrpcServer
participant Hist as tsoBatchSize (Histogram)
participant Prom as Prometheus
participant Graf as Grafana
rect rgba(200,230,255,0.5)
Client->>Grpc: Send TSO request (count)
Grpc->>Hist: Observe(count)
Hist-->>Prom: Exposed metric (scrape)
Prom-->>Graf: Query server histogram (histogram_quantile / sum/count)
Graf-->>Client: Render dashboard panel
end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related issues

server: expose TSO batch size metric and fix test lock contention #10228: Implements server-side TSO batch-size histogram, records observations, updates Grafana queries, and adjusts TestServer startup/cleanup state handling.

Poem

🐰 I hopped through code with metrics bright,

Observed each TSO batch in flight,
A Starting hop before we run,
Dashboards gleam when scrapes are done,
Little rabbit claps—well tracked tonight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the two main changes: adding a server TSO batch metric and fixing a test lock issue.
Description check	✅ Passed	The description is comprehensive, covering the problem statement, detailed changes, testing approach, and release notes with all required sections properly filled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-02-05T11:43:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.80%. Comparing base (2139230) to head (f016eae).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10224      +/-   ##
==========================================
+ Coverage   78.63%   78.80%   +0.16%     
==========================================
  Files         520      522       +2     
  Lines       70089    70371     +282     
==========================================
+ Hits        55112    55453     +341     
+ Misses      10989    10923      -66     
- Partials     3988     3995       +7

Flag	Coverage Δ
unittests	`78.80% <100.00%> (+0.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: okjiang <819421878@qq.com>

This reverts commit 91fcabc. Signed-off-by: okjiang <819421878@qq.com>

Signed-off-by: okjiang <819421878@qq.com>

server/grpc_service.go

Signed-off-by: okjiang <819421878@qq.com>

okJiang · 2026-02-11T03:48:55Z

/retest

ti-chi-bot bot added the do-not-merge/needs-linked-issue label Feb 5, 2026

ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. labels Feb 5, 2026

ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 5, 2026

okJiang force-pushed the tso-bench-cannot-batch branch from 17d03d8 to 62cda4f Compare February 5, 2026 14:59

ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 5, 2026

okJiang force-pushed the tso-bench-cannot-batch branch from 62cda4f to cd65fef Compare February 5, 2026 15:04

tools: print tso batch size stats

91fcabc

Signed-off-by: okjiang <819421878@qq.com>

okJiang force-pushed the tso-bench-cannot-batch branch from cd65fef to 91fcabc Compare February 6, 2026 02:43

ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 6, 2026

okJiang changed the title ~~[codex] tools: print tso batch size stats~~ [codex] server,tests: add server tso batch metric and test lock fix Feb 6, 2026

ti-chi-bot bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Feb 6, 2026

okJiang mentioned this pull request Feb 6, 2026

server: expose TSO batch size metric and fix test lock contention #10228

Open

ti-chi-bot bot removed the do-not-merge/needs-linked-issue label Feb 6, 2026

okJiang added 2 commits February 9, 2026 17:45

Revert "tools: print tso batch size stats"

5eddccd

This reverts commit 91fcabc. Signed-off-by: okjiang <819421878@qq.com>

server,tests: add tso batch metric and fix test server lock

2013ce2

Signed-off-by: okjiang <819421878@qq.com>

okJiang force-pushed the tso-bench-cannot-batch branch from a22494d to 2013ce2 Compare February 9, 2026 09:46

ti-chi-bot bot removed the do-not-merge/needs-triage-completed label Feb 9, 2026

okJiang changed the title ~~[codex] server,tests: add server tso batch metric and test lock fix~~ server,tests: add server tso batch metric and test lock fix Feb 10, 2026

okJiang marked this pull request as ready for review February 10, 2026 08:39

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2026

okJiang commented Feb 10, 2026

View reviewed changes

server/grpc_service.go Outdated Show resolved Hide resolved

server: avoid double-count tso batch metric in forwarding path

f016eae

Signed-off-by: okjiang <819421878@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server,tests: add server tso batch metric and test lock fix #10224

server,tests: add server tso batch metric and test lock fix #10224

okJiang commented Feb 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

ti-chi-bot bot commented Feb 5, 2026

Uh oh!

ti-chi-bot bot commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

Uh oh!

codecov bot commented Feb 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

okJiang commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

server,tests: add server tso batch metric and test lock fix #10224

Are you sure you want to change the base?

server,tests: add server tso batch metric and test lock fix #10224

Conversation

okJiang commented Feb 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot bot commented Feb 5, 2026

Uh oh!

ti-chi-bot bot commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

Uh oh!

codecov bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

okJiang commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

okJiang commented Feb 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 5, 2026 •

edited

Loading

codecov bot commented Feb 5, 2026 •

edited

Loading