Skip to content

Conversation

@okJiang
Copy link
Member

@okJiang okJiang commented Feb 5, 2026

What problem does this PR solve?

pd-tso-bench can generate batch requests, but the existing Grafana panel depends on the client-side metric pd_client_request_handle_tso_batch_size. In stress scenarios this client metric may be missing or not representative, which makes it hard to judge real server-side batching behavior from PD monitoring.

In addition, CI hit an intermittent deadlock warning in tests: tests.TestServer.Run() held a write lock while calling s.server.Run(), while retry cleanup paths called State() under a read lock, causing lock contention and flaky failures.

Issue Number: ref #10228

What is changed and how does it work?

This PR changes observability to use PD server metrics directly and fixes the test lock contention:

  1. Add a new server histogram metric pd_server_handle_tso_batch_size in server/metrics.go.
  2. Record request.GetCount() for normal TSO handling and service-independent forwarding path in server/grpc_service.go.
  3. Update Grafana panel TSO request batch size in metrics/grafana/pd.json to query the new server metric (_bucket/_sum/_count) instead of client metric.
  4. Fix test lock contention in tests/cluster.go by introducing a Starting state and avoiding holding TestServer write lock across s.server.Run(). Cleanup logic now treats Starting as active and can stop/destroy safely.
server,tests: add tso batch metric and fix test server lock

Check List

Tests

  • Unit test
  • Manual test (add detailed scripts or steps below)

Manual test steps:

image

Code changes

  • Has the configuration change: No
  • Has HTTP API interfaces changed: No
  • Has persistent data change: No

Side effects

  • Possible performance regression: No
  • Increased code complexity: Low (only metric observation + explicit test state transition)
  • Breaking backward compatibility: No

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn: N/A
  • PR to update pingcap/tiup: N/A
  • Need to cherry-pick to the release branch: TBD

Release note

Add a PD server-side TSO batch size metric and use it in Grafana; also fix flaky test lock contention in TestServer startup/cleanup paths.

Summary by CodeRabbit

  • Observability
    • Added a TSO batch-size histogram and instrumentation to record batch sizes; Grafana dashboards now source and label these server-side metrics for clearer monitoring.
  • Tests
    • Improved test server lifecycle with a new transitional "Starting" state and adjusted run/stop/destroy/retry flows to handle transitional states reliably.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 5, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. labels Feb 5, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign overvenus for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

Adds a server-side TSO batch-size histogram and records observations on local TSO handling; updates Grafana queries to use server-side metrics; introduces a new TestServer "Starting" state and updates Run/Stop/Destroy and retry cleanup logic accordingly.

Changes

Cohort / File(s) Summary
Server metrics & instrumentation
server/metrics.go, server/grpc_service.go
Add tsoBatchSize Prometheus histogram (server namespace) and call tsoBatchSize.Observe(count) in the local TSO handling path.
Grafana dashboard
metrics/grafana/pd.json
Replace client-side TSO histogram/aggregations with server-side equivalents (pd_server_handle_tso_batch_size_*) and update legend labels; minor EOF newline change.
Test server lifecycle
tests/cluster.go
Add Starting state constant; update Run(), Stop(), Destroy() and retry/cleanup paths to treat Starting similarly to Running for stop/destroy eligibility and to handle transitional locking/state updates.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client as Client
participant Grpc as GrpcServer
participant Hist as tsoBatchSize (Histogram)
participant Prom as Prometheus
participant Graf as Grafana
rect rgba(200,230,255,0.5)
Client->>Grpc: Send TSO request (count)
Grpc->>Hist: Observe(count)
Hist-->>Prom: Exposed metric (scrape)
Prom-->>Graf: Query server histogram (histogram_quantile / sum/count)
Graf-->>Client: Render dashboard panel
end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related issues

Poem

🐰 I hopped through code with metrics bright,

Observed each TSO batch in flight,
A Starting hop before we run,
Dashboards gleam when scrapes are done,
Little rabbit claps—well tracked tonight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the two main changes: adding a server TSO batch metric and fixing a test lock issue.
Description check ✅ Passed The description is comprehensive, covering the problem statement, detailed changes, testing approach, and release notes with all required sections properly filled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 5, 2026
@codecov
Copy link

codecov bot commented Feb 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.80%. Comparing base (2139230) to head (f016eae).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10224      +/-   ##
==========================================
+ Coverage   78.63%   78.80%   +0.16%     
==========================================
  Files         520      522       +2     
  Lines       70089    70371     +282     
==========================================
+ Hits        55112    55453     +341     
+ Misses      10989    10923      -66     
- Partials     3988     3995       +7     
Flag Coverage Δ
unittests 78.80% <100.00%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@okJiang okJiang force-pushed the tso-bench-cannot-batch branch from 17d03d8 to 62cda4f Compare February 5, 2026 14:59
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 5, 2026
@okJiang okJiang force-pushed the tso-bench-cannot-batch branch from 62cda4f to cd65fef Compare February 5, 2026 15:04
Signed-off-by: okjiang <819421878@qq.com>
@okJiang okJiang force-pushed the tso-bench-cannot-batch branch from cd65fef to 91fcabc Compare February 6, 2026 02:43
@ti-chi-bot ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 6, 2026
@okJiang okJiang changed the title [codex] tools: print tso batch size stats [codex] server,tests: add server tso batch metric and test lock fix Feb 6, 2026
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Feb 6, 2026
This reverts commit 91fcabc.

Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
@okJiang okJiang force-pushed the tso-bench-cannot-batch branch from a22494d to 2013ce2 Compare February 9, 2026 09:46
@okJiang okJiang changed the title [codex] server,tests: add server tso batch metric and test lock fix server,tests: add server tso batch metric and test lock fix Feb 10, 2026
@okJiang okJiang marked this pull request as ready for review February 10, 2026 08:39
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2026
Signed-off-by: okjiang <819421878@qq.com>
@okJiang
Copy link
Member Author

okJiang commented Feb 11, 2026

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant