Skip to content

Conversation

@lhy1024
Copy link
Contributor

@lhy1024 lhy1024 commented Dec 10, 2025

What problem does this PR solve?

Issue Number: Ref #9764

What is changed and how does it work?

Check List

Tests

  • Unit test

Release note

None.

Summary by CodeRabbit

  • New Features

    • Added affinity groups API endpoints and a background watcher to keep affinity state synced.
    • Introduced middleware to forward affinity-related requests to the scheduling service.
  • Improvements

    • Better handling for label rules that arrive before their groups (buffering and reconciliation).
    • Centralized and expanded redirect matching logic and rule representation.
  • Tests

    • Added extensive tests for watcher/manager lifecycle, concurrency, and redirect matching.

@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels Dec 10, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 10, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 10, 2025
@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

❌ Patch coverage is 76.69492% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.72%. Comparing base (a1f2017) to head (ef4e5a5).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10042      +/-   ##
==========================================
+ Coverage   78.62%   78.72%   +0.10%     
==========================================
  Files         520      522       +2     
  Lines       70089    70270     +181     
==========================================
+ Hits        55105    55318     +213     
+ Misses      10998    10946      -52     
- Partials     3986     4006      +20     
Flag Coverage Δ
unittests 78.72% <76.69%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lhy1024 lhy1024 force-pushed the affinity-pr6 branch 2 times, most recently from 079ad98 to 4dc8ea5 Compare December 11, 2025 10:07
@lhy1024 lhy1024 marked this pull request as ready for review December 11, 2025 10:07
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 11, 2025
@lhy1024
Copy link
Contributor Author

lhy1024 commented Dec 11, 2025

/retest

1 similar comment
@lhy1024
Copy link
Contributor Author

lhy1024 commented Dec 12, 2025

/retest

@lhy1024 lhy1024 force-pushed the affinity-pr6 branch 4 times, most recently from 0720b5d to 64d7903 Compare December 13, 2025 14:49
lhy1024 added a commit to lhy1024/pd that referenced this pull request Dec 14, 2025
Signed-off-by: lhy1024 <admin@liudos.us>
lhy1024 added a commit to lhy1024/pd that referenced this pull request Dec 14, 2025
Signed-off-by: lhy1024 <admin@liudos.us>
lhy1024 added a commit to lhy1024/pd that referenced this pull request Dec 14, 2025
Signed-off-by: lhy1024 <admin@liudos.us>
lhy1024 added a commit to lhy1024/pd that referenced this pull request Dec 14, 2025
Signed-off-by: lhy1024 <admin@liudos.us>
lhy1024 added a commit to lhy1024/pd that referenced this pull request Dec 15, 2025
Signed-off-by: lhy1024 <admin@liudos.us>
}

// AffinityMicroserviceRedirector only forwards affinity GET requests to the scheduling service.
func AffinityMicroserviceRedirector() gin.HandlerFunc {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not in server/api/server.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a middleware for v2

@lhy1024
Copy link
Contributor Author

lhy1024 commented Dec 15, 2025

/retest

labelRule = m.labelRuleBuffer[group.ID]
}
// Once group created, the buffer must be deleted.
delete(m.labelRuleBuffer, group.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to delete after updateGroupLabelRuleLockedWithCount?

return
}
// Attach only non-empty label rules to avoid marking the group as having ranges when it does not.
if len(gkr.KeyRanges) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If len(keyRanges) is a default limit, can we put it inside updateGroupLabelRuleLockedWithCount

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the check inside would break the other call.

// Store LabelRule information in the buffer when it is synchronized before the group.
m.labelRuleBuffer[groupID] = labelRule
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about returning direct when !exist && len(gkr.KeyRanges) ==0?

_, err = client.Put(ctx, labelKey, string(labelValue))
re.NoError(err)

time.Sleep(100 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use Eventually

@lhy1024
Copy link
Contributor Author

lhy1024 commented Jan 8, 2026

@coderabbitai full review

@lhy1024
Copy link
Contributor Author

lhy1024 commented Jan 29, 2026

@bufferflies @okJiang Friendly ping

var toDelete []string
err := l.storage.LoadRegionRules(func(k, v string) {
r, err := NewLabelRuleFromJSON([]byte(v))
r, err := NewLabelRuleFromJSONWithoutCheck([]byte(v))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, NewLabelRuleFromJSONWithoutCheck is origin NewLabelRuleFromJSON

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move it

)

// MicroserviceRedirector creates a middleware to rewrite and forward requests to a microservice primary.
func MicroserviceRedirector(rules ...serverapi.RedirectRule) gin.HandlerFunc {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have many middleware in different packages, is it possible to reorganize them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do it in another pr?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pkg/mcs/scheduling/server/affinity/watcher.go`:
- Around line 45-56: Update the Watcher struct to use a pointer waitgroup and
remove the stored context: change wg from type sync.WaitGroup to *sync.WaitGroup
and remove the ctx field; initialize it as &sync.WaitGroup{} in NewWatcher. Add
a ctx context.Context parameter to initializeGroupWatcher and
initializeAffinityLabelWatcher and, inside those functions, pass the ctx
argument (not w.ctx) and w.wg (not &w.wg) when calling etcdutil.NewLoopWatcher.
Ensure all method signatures and call sites that used w.ctx are updated to
accept and forward the context parameter.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pkg/mcs/scheduling/server/affinity/watcher_test.go`:
- Around line 199-279: The test TestConcurrentLabelRuleUpdates uses a value
WaitGroup which can be copied; change the declaration of wg to a pointer and
initialize it (e.g. wg := &sync.WaitGroup{}) so all goroutines use the same
*sync.WaitGroup instance; keep all existing wg.Add, wg.Done and wg.Wait calls
but ensure wg is created before the first Add to avoid nil dereference.
🧹 Nitpick comments (1)
pkg/mcs/scheduling/server/apis/v1/api.go (1)

1617-1634: Use errcode/errorResp helpers for affinity API errors.
These handlers return raw error strings via gin aborts; align them with the standard error response helpers for consistent API contracts.
As per coding guidelines: HTTP handlers in Go must use errcode and errorResp; avoid http.Error.

Copy link
Contributor

@bufferflies bufferflies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@server/apiv2/middlewares/microservice_redirector.go`:
- Around line 32-63: The MicroserviceRedirector middleware is not preserving
client addressing headers before proxying; update MicroserviceRedirector to set
X-Forwarded-For and X-Forwarded-Port on c.Request.Header just before calling
apiutil.NewCustomReverseProxies(...).ServeHTTP: obtain the client IP from
c.Request.RemoteAddr (or RemoteAddr split by ':'), append it to any existing
X-Forwarded-For value (comma-separated) and set X-Forwarded-Port from the
request's port (use r.URL.Port() if present, otherwise extract port from
RemoteAddr or infer 80/443), then proceed to create the reverse proxy; reference
the MicroserviceRedirector function, c.Request, apiutil.NewCustomReverseProxies,
and headers X-Forwarded-For / X-Forwarded-Port.
🧹 Nitpick comments (6)
pkg/mcs/scheduling/server/affinity/watcher_test.go (1)

234-242: Silently ignored putLabelRule error in concurrent goroutine.

If putLabelRule fails, the test won't detect it and may pass or hang on the Eventually call with a confusing timeout. Consider collecting errors or using require within the goroutine (with t captured safely).

♻️ Proposed fix to capture errors
 	for i := range groupCount {
 		wg.Add(1)
 		go func(idx int) {
 			defer wg.Done()
 			labelRule := makeTestLabelRule(groups[idx].ID,
 				"7480000000000000ff"+string(rune('0'+idx))+"000000000000000f8",
 				"7480000000000000ff"+string(rune('0'+idx))+"100000000000000f8",
 			)
-			_ = putLabelRule(ctx, client, labelRule)
+			if err := putLabelRule(ctx, client, labelRule); err != nil {
+				t.Logf("putLabelRule failed for group %d: %v", idx, err)
+			}
 		}(i)
 	}
pkg/mcs/scheduling/server/affinity/watcher.go (1)

84-116: Group watcher putFn also returns errors on unmarshal failure — consider the same treatment.

Same concern as with the label watcher: if an invalid group JSON value exists in etcd, putFn returns the unmarshal error which may block the watcher. This is less likely since the group path is more specific, but the same defensive approach (log + return nil) could be applied for consistency.

pkg/utils/apiutil/serverapi/middleware.go (2)

143-145: r.URL.Path is mutated (trailing-slash stripped) even when no rule matches.

Line 145 unconditionally strips trailing slashes from the request URL before rule matching. If no rule matches, the caller's request is still mutated. This was pre-existing behavior in the private method, but now that MatchMicroserviceRedirect is a public API consumed by multiple callers (including the new gin middleware), this side effect is more visible and could surprise callers who expect an unmatched request to be unmodified.

Consider guarding the mutation so it only applies when a match is found, or documenting this side effect prominently in the function's doc comment.


164-182: Path rewriting with RawPath: prefix trimming uses the non-raw MatchPath.

On Line 176, when r.URL.RawPath is non-empty, path is set to the raw (percent-encoded) value but the TrimPrefix still uses rule.MatchPath (the non-encoded form). If the match path ever contains characters that appear differently in the raw form (e.g., spaces encoded as %20), the prefix won't be trimmed correctly.

In practice, route prefixes like /pd/api/v1/operators are plain ASCII, so this is unlikely to bite today — but it's worth keeping in mind if more exotic paths are introduced later.

pkg/utils/apiutil/serverapi/middleware_test.go (1)

36-147: Good edge-case coverage, but the happy-path (successful redirect with path rewriting) is missing.

All five tests exercise conditions that prevent or short-circuit a redirect. Consider adding at least one test that:

  1. Verifies a successful match returns (true, expectedAddr).
  2. Asserts that r.URL.Path is correctly rewritten (e.g., /pd/api/v1/operators/123/scheduling/api/v1/operators/123).
  3. Covers multiple rules where the first doesn't match and the second does.

This would guard the core redirect + path-rewriting logic against regressions.

server/apiv2/middlewares/microservice_redirector.go (1)

49-52: Consider using a structured error response instead of a plain string.

c.AbortWithStatusJSON here serializes the .Error() string directly as the JSON body, producing a bare JSON string (e.g., "redirect failed"). If the rest of the v2 API uses a structured error envelope (e.g., {"code": ..., "message": ...}), this would be inconsistent. The same applies to line 57.

@lhy1024
Copy link
Contributor Author

lhy1024 commented Feb 9, 2026

/retest

Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Feb 9, 2026
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Feb 9, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bufferflies, okJiang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [bufferflies,okJiang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 9, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-02-09 06:50:19.064302673 +0000 UTC m=+167634.758442533: ☑️ agreed by okJiang.
  • 2026-02-09 07:10:30.634713115 +0000 UTC m=+168846.328852955: ☑️ agreed by bufferflies.

@lhy1024
Copy link
Contributor Author

lhy1024 commented Feb 9, 2026

/retest

1 similar comment
@lhy1024
Copy link
Contributor Author

lhy1024 commented Feb 9, 2026

/retest

@ti-chi-bot ti-chi-bot bot merged commit 31fc48f into tikv:master Feb 9, 2026
32 checks passed
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this pull request Feb 9, 2026
ref tikv#9764

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #10232.
But this PR has conflicts, please resolve them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants