-
Notifications
You must be signed in to change notification settings - Fork 753
Client: support pd client access to the router service #9925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Skipping CI for Draft Pull Request. |
a248f17 to
030b4e3
Compare
deb2c24 to
6365ae9
Compare
cee954e to
c18f37a
Compare
|
/retest |
edcaf60 to
d4c7e95
Compare
|
Is there any design doc for this PR? |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #9925 +/- ##
==========================================
+ Coverage 78.55% 78.74% +0.19%
==========================================
Files 520 522 +2
Lines 69658 70090 +432
==========================================
+ Hits 54721 55195 +474
+ Misses 10982 10926 -56
- Partials 3955 3969 +14
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
5d7b298 to
817d207
Compare
817d207 to
e70ec13
Compare
client/clients/router/client.go
Outdated
| routerSvcDiscovery sd.ServiceDiscovery | ||
| // routerConCtxMgr is used to store the context of the router service stream connection. | ||
| routerConCtxMgr *cctx.Manager[routerpb.Router_QueryRegionClient] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that they are already in the router package, we can remove the router prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It‘s different, it just marks that this connection manager servers for the router microservice not router name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed as mcsDiscovery and mcsConCtxMgr
b4e650b to
3358e5a
Compare
2995531 to
114a041
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
client/client.go (1)
1045-1068: Missing retry logic for router service failures.Unlike
GetRegion,GetPrevRegion,GetRegionByID, andBatchScanRegions, theGetStoremethod lacks theNeedRetryfallback to PD when the router service fails. This inconsistency could cause reliability issues when the router service is unavailable.🐛 Proposed fix to add retry logic
if isRouterServiceClient { resp, err = routerpb.NewRouterClient(serviceClient.GetClientConn()).GetStore(cctx, req) } else { resp, err = pdpb.NewPDClient(serviceClient.GetClientConn()).GetStore(cctx, req) } + if serviceClient.NeedRetry(resp.GetHeader().GetError(), err) { + protoClient, cctx := c.getClientAndContext(ctx) + if protoClient == nil { + return nil, errs.ErrClientGetProtoClient + } + resp, err = protoClient.GetStore(cctx, req) + } + if err = c.respForErr(metrics.CmdFailedDurationGetStore, start, err, resp.GetHeader()); err != nil {
♻️ Duplicate comments (2)
client/client.go (1)
1102-1121: Missing retry logic for router service failures.Same inconsistency as
GetStore-GetAllStoreslacks theNeedRetryfallback pattern used by other region/store methods.🐛 Proposed fix to add retry logic
if isRouterServiceClient { resp, err = routerpb.NewRouterClient(serviceClient.GetClientConn()).GetAllStores(cctx, req) } else { resp, err = pdpb.NewPDClient(serviceClient.GetClientConn()).GetAllStores(cctx, req) } + if serviceClient.NeedRetry(resp.GetHeader().GetError(), err) { + protoClient, cctx := c.getClientAndContext(ctx) + if protoClient == nil { + return nil, errs.ErrClientGetProtoClient + } + resp, err = protoClient.GetAllStores(cctx, req) + } + if err = c.respForErr(metrics.CmdFailedDurationGetAllStores, start, err, resp.GetHeader()); err != nil {client/servicediscovery/router_service_discovery.go (1)
177-194: Log message is misleading - service discovery continues running.The log message at line 182 says "stopping router service discovery" but the service actually continues running in a degraded mode. Based on learnings, returning
nilis intentional to exit the retry loop, but the log should accurately reflect this behavior.📝 Suggested fix
if errs.ErrClientGetMetaStorageClient.Equal(err) || errs.ErrClientGetProtoClient.Equal(err) { - log.Warn("[router service] meta storage client error, stopping router service discovery", errs.ZapError(err)) + log.Warn("[router service] meta storage client error, skipping member update until client recovers", errs.ZapError(err)) return nil }
🧹 Nitpick comments (1)
client/inner_client.go (1)
298-334: Consider simplifying the variadicstoreOpparameter.The method uses variadic
storeOp ...*opt.GetStoreOpbut only accessesstoreOp[0]. While current callers pass valid options, if a caller passes anilpointer as the first element, line 309 would panic. Consider accepting a single*opt.GetStoreOpfor clarity and safety.♻️ Suggested signature change
-func (c *innerClient) getServiceClient(ctx context.Context, regionOp *opt.GetRegionOp, storeOp ...*opt.GetStoreOp) (sd.ServiceClient, context.Context, bool) { +func (c *innerClient) getServiceClient(ctx context.Context, regionOp *opt.GetRegionOp, storeOp *opt.GetStoreOp) (sd.ServiceClient, context.Context, bool) { var ( serviceClient sd.ServiceClient mustLeader bool isRouterClient bool ) allowRouterServiceHandle := regionOp.AllowRouterServiceHandle - if len(storeOp) > 0 { - allowRouterServiceHandle = allowRouterServiceHandle || storeOp[0].AllowRouterServiceHandle + if storeOp != nil { + allowRouterServiceHandle = allowRouterServiceHandle || storeOp.AllowRouterServiceHandle }Then update callers in
client/client.goaccordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for client access to the router service, enabling PD clients to route region and store operations through the new router microservice. The implementation includes service discovery for router nodes, gRPC health checks, and intelligent client-side routing between PD and router services based on runtime context.
Changes:
- Introduced router service discovery with metastorage integration for dynamic service endpoint discovery
- Added gRPC health check support to microservices for improved service monitoring
- Implemented client-side routing logic to intelligently select between router service and PD service based on operation type and availability
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 21 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/integrations/mcs/router/server_test.go | Added integration tests for router service with store and region API validation |
| client/servicediscovery/router_service_discovery.go | New service discovery implementation for router service with health checks and connection management |
| client/inner_client.go | Added service client selection logic to route requests between PD and router services |
| client/client.go | Updated region and store API methods to support router service routing |
| client/clients/router/client.go | Enhanced router client with multi-service support (PD and router service streams) |
| client/clients/router/request.go | Added nil key handling for edge cases |
| client/opt/option.go | Added options to control router service usage for region and store requests |
| pkg/mcs/utils/util.go | Added gRPC health check server registration |
| client/clients/metastorage/client.go | Added service discovery helper function for metastorage-based service registration |
| server/grpc_service.go | Added failpoint for testing error scenarios |
| client/servicediscovery/service_discovery.go | Updated log field names for clarity |
| client/servicediscovery/tso_service_discovery.go | Updated log field names for clarity |
| client/pkg/connectionctx/manager.go | Added Size method for connection count tracking |
| tests/integrations/mcs/discovery/register_test.go | Added router service to discovery test suite |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: bufferflies <1045931706@qq.com>
114a041 to
648cb85
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@client/client.go`:
- Around line 1054-1063: When isRouterServiceClient is true and the router
GetStore call (routerpb.NewRouterClient(...).GetStore) fails, fall back to PD:
after receiving a non-nil err from the router call, call
c.inner.getServiceClient again to obtain a PD service client, then invoke
pdpb.NewPDClient(pdServiceClient.GetClientConn()).GetStore(cctx, req) and return
that response if successful; if the PD call also fails, return the original or
combined error. Update the logic around getServiceClient/isRouterServiceClient,
routerpb.NewRouterClient(...).GetStore, and pdpb.NewPDClient(...).GetStore to
implement this retry/fallback path.
♻️ Duplicate comments (8)
client/servicediscovery/router_service_discovery.go (3)
92-100: Clarify whyAPIKindis ignored.
GetServiceClientByKinddrops the parameter; a short comment would prevent confusion for callers.
274-292: Verifysync.Map.Clear()availability.
sync.Map.Clear()isn’t available in all Go versions. Please confirm it exists in Go 1.25 for this repo; otherwise use Range/Delete to clear.🧹 Alternative if Clear is unavailable
- r.nodes.Clear() + r.nodes.Range(func(key, _ any) bool { + r.nodes.Delete(key) + return true + })
177-184: Log wording implies shutdown but discovery continues.
This path intentionally returnsnilto skip retries; the log should reflect “skipping update” rather than “stopping discovery.”💡 Suggested tweak
- log.Warn("[router service] meta storage client error, stopping router service discovery", errs.ZapError(err)) + log.Warn("[router service] meta storage client error, skipping member update", errs.ZapError(err))Based on learnings, this return is intentional to exit the retry loop.
client/inner_client.go (2)
298-310: Guard against nilstoreOp[0]ingetServiceClient.
The variadic parameter only usesstoreOp[0]and can panic if it’s nil. Either switch to a single pointer or guard the first element.🛡️ Safe guard
- if len(storeOp) > 0 { - allowRouterServiceHandle = allowRouterServiceHandle || storeOp[0].AllowRouterServiceHandle - } + if len(storeOp) > 0 && storeOp[0] != nil { + allowRouterServiceHandle = allowRouterServiceHandle || storeOp[0].AllowRouterServiceHandle + }
327-333: Fallback to leader should forcemustLeader=true.
When router handling fails and you fall back to the leader,mustLeaderstaysfalse, so the fallback may not enforce leader routing.🐛 Suggested fix
// Fallback to the leader client. + mustLeader = true isRouterClient = false serviceClient = c.serviceDiscovery.GetServiceClient()client/client.go (3)
173-175: SetmsDiscoveryto nil after Close for consistency.
Other discovery fields are nulled after closing; doing the same here reduces risk of use-after-close.🧹 Suggested tweak
if k.msDiscovery != nil { k.msDiscovery.Close() + k.msDiscovery = nil }
881-898: Don’t silently overrideAllowRouterServiceHandle.
ScanRegionsforces the option tofalse, which can surprise callers. Consider returning an explicit error or documenting that ScanRegions never uses the router service.
1103-1118: Add PD fallback when routerGetAllStoresfails.
This mirrors the retry pattern used in region APIs and improves reliability.🐛 Suggested fallback
if isRouterServiceClient { resp, err = routerpb.NewRouterClient(serviceClient.GetClientConn()).GetAllStores(cctx, req) } else { resp, err = pdpb.NewPDClient(serviceClient.GetClientConn()).GetAllStores(cctx, req) } + + if serviceClient.NeedRetry(resp.GetHeader().GetError(), err) { + protoClient, cctx := c.getClientAndContext(ctx) + if protoClient == nil { + return nil, errs.ErrClientGetProtoClient + } + resp, err = protoClient.GetAllStores(cctx, req) + }
🧹 Nitpick comments (1)
client/servicediscovery/router_service_discovery.go (1)
46-76: Avoid storing contexts on the struct.
The struct keepsparentCtx/ctxas fields; project guidance says contexts shouldn’t be stored in structs. Consider storing only a cancel/base configuration and pass contexts into methods instead, or document why this pattern is required here.As per coding guidelines, avoid storing contexts in structs.
|
@ystaticy: adding LGTM is restricted to approvers and reviewers in OWNERS files. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Signed-off-by: bufferflies <1045931706@qq.com>
|
/test pull-unit-test-next-gen-2 |
5 similar comments
|
/test pull-unit-test-next-gen-2 |
|
/test pull-unit-test-next-gen-2 |
|
/test pull-unit-test-next-gen-2 |
|
/test pull-unit-test-next-gen-2 |
|
/test pull-unit-test-next-gen-2 |
545a3f3 to
6539e88
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: okJiang, rleungx, ystaticy The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
close tikv#9212 Signed-off-by: bufferflies <1045931706@qq.com>
close tikv#9212 Signed-off-by: bufferflies <1045931706@qq.com>
What problem does this PR solve?
Issue Number: Close #9212
What is changed and how does it work?
Check List
Tests
Code changes
Side effects
Related changes
pingcap/docs/pingcap/docs-cn:pingcap/tiup:Release note
Summary by CodeRabbit
New Features
Improvements
Bug Fixes / Tests
✏️ Tip: You can customize this high-level summary in your review settings.