Skip to content

Conversation

@bufferflies
Copy link
Contributor

@bufferflies bufferflies commented Nov 12, 2025

What problem does this PR solve?

Issue Number: Close #9212

What is changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

Release note

None.

Summary by CodeRabbit

  • New Features

    • Metastorage-backed router service discovery with per-node management and service registry JSON parsing.
    • Option builders to allow router-based handling for region and store requests.
    • gRPC health-check registration for servers.
  • Improvements

    • Unified, runtime-aware routing between router and PD services with improved connection lifecycle and balancing.
    • Cleaner service discovery balancer and node health checks.
  • Bug Fixes / Tests

    • Nil-safety for region keys and expanded integration tests covering router scenarios and failpoints.

✏️ Tip: You can customize this high-level summary in your review settings.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Nov 12, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 12, 2025
@bufferflies bufferflies force-pushed the router/client branch 3 times, most recently from a248f17 to 030b4e3 Compare November 13, 2025 07:39
@bufferflies bufferflies force-pushed the router/client branch 3 times, most recently from deb2c24 to 6365ae9 Compare December 4, 2025 06:26
@bufferflies bufferflies marked this pull request as ready for review December 4, 2025 11:35
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 4, 2025
@bufferflies
Copy link
Contributor Author

/retest

@bufferflies bufferflies force-pushed the router/client branch 2 times, most recently from edcaf60 to d4c7e95 Compare December 11, 2025 09:16
@liupan1111
Copy link
Contributor

Is there any design doc for this PR?

@codecov
Copy link

codecov bot commented Dec 12, 2025

Codecov Report

❌ Patch coverage is 80.04535% with 88 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.74%. Comparing base (757acdf) to head (6539e88).
⚠️ Report is 19 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9925      +/-   ##
==========================================
+ Coverage   78.55%   78.74%   +0.19%     
==========================================
  Files         520      522       +2     
  Lines       69658    70090     +432     
==========================================
+ Hits        54721    55195     +474     
+ Misses      10982    10926      -56     
- Partials     3955     3969      +14     
Flag Coverage Δ
unittests 78.74% <80.04%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 182 to 184
routerSvcDiscovery sd.ServiceDiscovery
// routerConCtxMgr is used to store the context of the router service stream connection.
routerConCtxMgr *cctx.Manager[routerpb.Router_QueryRegionClient]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that they are already in the router package, we can remove the router prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It‘s different, it just marks that this connection manager servers for the router microservice not router name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed as mcsDiscovery and mcsConCtxMgr

Copilot AI review requested due to automatic review settings January 19, 2026 11:02
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client/client.go (1)

1045-1068: Missing retry logic for router service failures.

Unlike GetRegion, GetPrevRegion, GetRegionByID, and BatchScanRegions, the GetStore method lacks the NeedRetry fallback to PD when the router service fails. This inconsistency could cause reliability issues when the router service is unavailable.

🐛 Proposed fix to add retry logic
 	if isRouterServiceClient {
 		resp, err = routerpb.NewRouterClient(serviceClient.GetClientConn()).GetStore(cctx, req)
 	} else {
 		resp, err = pdpb.NewPDClient(serviceClient.GetClientConn()).GetStore(cctx, req)
 	}

+	if serviceClient.NeedRetry(resp.GetHeader().GetError(), err) {
+		protoClient, cctx := c.getClientAndContext(ctx)
+		if protoClient == nil {
+			return nil, errs.ErrClientGetProtoClient
+		}
+		resp, err = protoClient.GetStore(cctx, req)
+	}
+
 	if err = c.respForErr(metrics.CmdFailedDurationGetStore, start, err, resp.GetHeader()); err != nil {
♻️ Duplicate comments (2)
client/client.go (1)

1102-1121: Missing retry logic for router service failures.

Same inconsistency as GetStore - GetAllStores lacks the NeedRetry fallback pattern used by other region/store methods.

🐛 Proposed fix to add retry logic
 	if isRouterServiceClient {
 		resp, err = routerpb.NewRouterClient(serviceClient.GetClientConn()).GetAllStores(cctx, req)
 	} else {
 		resp, err = pdpb.NewPDClient(serviceClient.GetClientConn()).GetAllStores(cctx, req)
 	}

+	if serviceClient.NeedRetry(resp.GetHeader().GetError(), err) {
+		protoClient, cctx := c.getClientAndContext(ctx)
+		if protoClient == nil {
+			return nil, errs.ErrClientGetProtoClient
+		}
+		resp, err = protoClient.GetAllStores(cctx, req)
+	}
+
 	if err = c.respForErr(metrics.CmdFailedDurationGetAllStores, start, err, resp.GetHeader()); err != nil {
client/servicediscovery/router_service_discovery.go (1)

177-194: Log message is misleading - service discovery continues running.

The log message at line 182 says "stopping router service discovery" but the service actually continues running in a degraded mode. Based on learnings, returning nil is intentional to exit the retry loop, but the log should accurately reflect this behavior.

📝 Suggested fix
 		if errs.ErrClientGetMetaStorageClient.Equal(err) || errs.ErrClientGetProtoClient.Equal(err) {
-			log.Warn("[router service] meta storage client error, stopping router service discovery", errs.ZapError(err))
+			log.Warn("[router service] meta storage client error, skipping member update until client recovers", errs.ZapError(err))
 			return nil
 		}
🧹 Nitpick comments (1)
client/inner_client.go (1)

298-334: Consider simplifying the variadic storeOp parameter.

The method uses variadic storeOp ...*opt.GetStoreOp but only accesses storeOp[0]. While current callers pass valid options, if a caller passes a nil pointer as the first element, line 309 would panic. Consider accepting a single *opt.GetStoreOp for clarity and safety.

♻️ Suggested signature change
-func (c *innerClient) getServiceClient(ctx context.Context, regionOp *opt.GetRegionOp, storeOp ...*opt.GetStoreOp) (sd.ServiceClient, context.Context, bool) {
+func (c *innerClient) getServiceClient(ctx context.Context, regionOp *opt.GetRegionOp, storeOp *opt.GetStoreOp) (sd.ServiceClient, context.Context, bool) {
 	var (
 		serviceClient  sd.ServiceClient
 		mustLeader     bool
 		isRouterClient bool
 	)
 	allowRouterServiceHandle := regionOp.AllowRouterServiceHandle
-	if len(storeOp) > 0 {
-		allowRouterServiceHandle = allowRouterServiceHandle || storeOp[0].AllowRouterServiceHandle
+	if storeOp != nil {
+		allowRouterServiceHandle = allowRouterServiceHandle || storeOp.AllowRouterServiceHandle
 	}

Then update callers in client/client.go accordingly.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for client access to the router service, enabling PD clients to route region and store operations through the new router microservice. The implementation includes service discovery for router nodes, gRPC health checks, and intelligent client-side routing between PD and router services based on runtime context.

Changes:

  • Introduced router service discovery with metastorage integration for dynamic service endpoint discovery
  • Added gRPC health check support to microservices for improved service monitoring
  • Implemented client-side routing logic to intelligently select between router service and PD service based on operation type and availability

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 21 comments.

Show a summary per file
File Description
tests/integrations/mcs/router/server_test.go Added integration tests for router service with store and region API validation
client/servicediscovery/router_service_discovery.go New service discovery implementation for router service with health checks and connection management
client/inner_client.go Added service client selection logic to route requests between PD and router services
client/client.go Updated region and store API methods to support router service routing
client/clients/router/client.go Enhanced router client with multi-service support (PD and router service streams)
client/clients/router/request.go Added nil key handling for edge cases
client/opt/option.go Added options to control router service usage for region and store requests
pkg/mcs/utils/util.go Added gRPC health check server registration
client/clients/metastorage/client.go Added service discovery helper function for metastorage-based service registration
server/grpc_service.go Added failpoint for testing error scenarios
client/servicediscovery/service_discovery.go Updated log field names for clarity
client/servicediscovery/tso_service_discovery.go Updated log field names for clarity
client/pkg/connectionctx/manager.go Added Size method for connection count tracking
tests/integrations/mcs/discovery/register_test.go Added router service to discovery test suite

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: bufferflies <1045931706@qq.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@client/client.go`:
- Around line 1054-1063: When isRouterServiceClient is true and the router
GetStore call (routerpb.NewRouterClient(...).GetStore) fails, fall back to PD:
after receiving a non-nil err from the router call, call
c.inner.getServiceClient again to obtain a PD service client, then invoke
pdpb.NewPDClient(pdServiceClient.GetClientConn()).GetStore(cctx, req) and return
that response if successful; if the PD call also fails, return the original or
combined error. Update the logic around getServiceClient/isRouterServiceClient,
routerpb.NewRouterClient(...).GetStore, and pdpb.NewPDClient(...).GetStore to
implement this retry/fallback path.
♻️ Duplicate comments (8)
client/servicediscovery/router_service_discovery.go (3)

92-100: Clarify why APIKind is ignored.
GetServiceClientByKind drops the parameter; a short comment would prevent confusion for callers.


274-292: Verify sync.Map.Clear() availability.
sync.Map.Clear() isn’t available in all Go versions. Please confirm it exists in Go 1.25 for this repo; otherwise use Range/Delete to clear.

🧹 Alternative if Clear is unavailable
-	r.nodes.Clear()
+	r.nodes.Range(func(key, _ any) bool {
+		r.nodes.Delete(key)
+		return true
+	})

177-184: Log wording implies shutdown but discovery continues.
This path intentionally returns nil to skip retries; the log should reflect “skipping update” rather than “stopping discovery.”

💡 Suggested tweak
-			log.Warn("[router service] meta storage client error, stopping router service discovery", errs.ZapError(err))
+			log.Warn("[router service] meta storage client error, skipping member update", errs.ZapError(err))

Based on learnings, this return is intentional to exit the retry loop.

client/inner_client.go (2)

298-310: Guard against nil storeOp[0] in getServiceClient.
The variadic parameter only uses storeOp[0] and can panic if it’s nil. Either switch to a single pointer or guard the first element.

🛡️ Safe guard
-	if len(storeOp) > 0 {
-		allowRouterServiceHandle = allowRouterServiceHandle || storeOp[0].AllowRouterServiceHandle
-	}
+	if len(storeOp) > 0 && storeOp[0] != nil {
+		allowRouterServiceHandle = allowRouterServiceHandle || storeOp[0].AllowRouterServiceHandle
+	}

327-333: Fallback to leader should force mustLeader=true.
When router handling fails and you fall back to the leader, mustLeader stays false, so the fallback may not enforce leader routing.

🐛 Suggested fix
 	// Fallback to the leader client.
+	mustLeader = true
 	isRouterClient = false
 	serviceClient = c.serviceDiscovery.GetServiceClient()
client/client.go (3)

173-175: Set msDiscovery to nil after Close for consistency.
Other discovery fields are nulled after closing; doing the same here reduces risk of use-after-close.

🧹 Suggested tweak
 	if k.msDiscovery != nil {
 		k.msDiscovery.Close()
+		k.msDiscovery = nil
 	}

881-898: Don’t silently override AllowRouterServiceHandle.
ScanRegions forces the option to false, which can surprise callers. Consider returning an explicit error or documenting that ScanRegions never uses the router service.


1103-1118: Add PD fallback when router GetAllStores fails.
This mirrors the retry pattern used in region APIs and improves reliability.

🐛 Suggested fallback
 	if isRouterServiceClient {
 		resp, err = routerpb.NewRouterClient(serviceClient.GetClientConn()).GetAllStores(cctx, req)
 	} else {
 		resp, err = pdpb.NewPDClient(serviceClient.GetClientConn()).GetAllStores(cctx, req)
 	}
+
+	if serviceClient.NeedRetry(resp.GetHeader().GetError(), err) {
+		protoClient, cctx := c.getClientAndContext(ctx)
+		if protoClient == nil {
+			return nil, errs.ErrClientGetProtoClient
+		}
+		resp, err = protoClient.GetAllStores(cctx, req)
+	}
🧹 Nitpick comments (1)
client/servicediscovery/router_service_discovery.go (1)

46-76: Avoid storing contexts on the struct.
The struct keeps parentCtx/ctx as fields; project guidance says contexts shouldn’t be stored in structs. Consider storing only a cancel/base configuration and pass contexts into methods instead, or document why this pattern is required here.

As per coding guidelines, avoid storing contexts in structs.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 19, 2026

@ystaticy: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: bufferflies <1045931706@qq.com>
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jan 20, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 20, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-01-13 14:31:36.827246277 +0000 UTC m=+367940.889111185: ☑️ agreed by okJiang.
  • 2026-01-20 07:15:54.153017644 +0000 UTC m=+478181.766974530: ☑️ agreed by rleungx.

@ti-chi-bot ti-chi-bot bot added the approved label Jan 20, 2026
@bufferflies
Copy link
Contributor Author

/test pull-unit-test-next-gen-2

5 similar comments
@bufferflies
Copy link
Contributor Author

/test pull-unit-test-next-gen-2

@bufferflies
Copy link
Contributor Author

/test pull-unit-test-next-gen-2

@bufferflies
Copy link
Contributor Author

/test pull-unit-test-next-gen-2

@bufferflies
Copy link
Contributor Author

/test pull-unit-test-next-gen-2

@bufferflies
Copy link
Contributor Author

/test pull-unit-test-next-gen-2

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 20, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: okJiang, rleungx, ystaticy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Jan 20, 2026
@ti-chi-bot ti-chi-bot bot merged commit 1b40b8d into tikv:master Jan 20, 2026
55 of 57 checks passed
bufferflies added a commit to bufferflies/pd that referenced this pull request Jan 21, 2026
close tikv#9212

Signed-off-by: bufferflies <1045931706@qq.com>
bufferflies added a commit to bufferflies/pd that referenced this pull request Jan 22, 2026
close tikv#9212

Signed-off-by: bufferflies <1045931706@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider introducing the router service to alleviate pressure of PD

5 participants