-
Notifications
You must be signed in to change notification settings - Fork 753
tso:update tso timestamp when met logic exhaust #10137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
2a11510 to
ecda224
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
📝 WalkthroughWalkthroughTSO physical-update logic gained option-based guards and an overflow indicator propagated from updateTimestamp through allocator retries; a metric for skipped-save events was added; unit tests were introduced to validate overflow, guarded physical updates, save gating, and concurrent timestamp allocation. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant TSO as timestampOracle
participant Storage as persistentStore
participant Metrics
Client->>TSO: getTS(count)
TSO->>TSO: compute next TS (physical + logical)
alt no overflow & persist not needed
TSO-->>Client: return TS
else need to advance physical
TSO->>TSO: updateTimestamp(updatePurpose)
alt save allowed
TSO->>Storage: persist next physical
Storage-->>TSO: persist OK
TSO->>TSO: setTSOPhysical(next, withInitialCheck(), withLogicalOverflowCheck())
else save not allowed
TSO->>Metrics: notAllowedSaveTimestampEvent.Inc()
TSO->>TSO: setTSOPhysical(next, withInitialCheck(), withLogicalOverflowCheck())
end
TSO-->>Client: return TS (or error + overflow flag)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: bufferflies <1045931706@qq.com>
ecda224 to
4b29385
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @pkg/tso/allocator.go:
- Line 170: The code reads a.timestampOracle.tsoMux.physical without holding
tsoMux, causing a data race; change the call to obtain the physical timestamp
via a safe accessor (e.g., call the existing timestampOracle.getTSO() or another
helper that reads tsoMux under lock) and pass that value into
a.timestampOracle.updateTimestamp(..., true) instead of directly accessing
tsoMux.physical; update the call site where updateTimestamp is invoked to use
the safe getter (referring to symbols a.timestampOracle.updateTimestamp and
timestampOracle.getTSO).
In @pkg/tso/tso_test.go:
- Line 124: The assertion comparing totalTso.Load()/int32(maxLogical)+1 to
changes.Load() is flaky because only the goroutine with i == 0 increments
changes and scheduling may miss physical time transitions; update the test to
either relax the exact equality (e.g., assert changes.Load() >=
expectedLowerBound or within a small tolerance) or replace with an
eventually-style check that waits for changes to reach an acceptable value, and
add a brief comment near totalTso, maxLogical, and changes explaining why the
relationship is approximate and which variance is acceptable.
- Around line 100-120: The test reads timestampOracle.tsoMux.physical directly
from multiple goroutines causing a data race; switch to using the safe accessor
timestampOracle.getTSO() (or another exported method that acquires the tsoMux
lock) wherever the code reads tsoMux.physical (both the initial assignment and
the inside-if compare) so reads are protected, i.e., replace direct references
to timestampOracle.tsoMux.physical with timestampOracle.getTSO() results before
comparing and updating the local physical variable while leaving getTS calls
unchanged.
In @pkg/tso/tso.go:
- Line 211: The code reads t.tsoMux.physical without holding the mutex, causing
a potential data race; change the call to use the safe accessor getTSO() (or
otherwise acquire the tsoMux lock) and pass its physical value into
setTSOPhysical (e.g., replace t.tsoMux.physical with t.getTSO().physical or
fetch ts := t.getTSO(); then call t.setTSOPhysical(ts.physical, next, true)) so
the current physical time is read under the mutex.
🧹 Nitpick comments (1)
pkg/tso/tso_test.go (1)
30-30: Typo in mock name: "MokElection" should be "MockElection".Suggested fix
-type MokElection struct{} +type MockElection struct{}Also update references on lines 60 and 89.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
pkg/tso/allocator.gopkg/tso/metrics.gopkg/tso/tso.gopkg/tso/tso_test.go
🧰 Additional context used
🧬 Code graph analysis (2)
pkg/tso/metrics.go (1)
pkg/schedule/filter/counter.go (1)
Counter(113-120)
pkg/tso/tso.go (2)
pkg/errs/errno.go (1)
ErrUpdateTimestamp(104-104)pkg/utils/logutil/log.go (1)
CondUint32(269-274)
🪛 golangci-lint (2.5.0)
pkg/tso/tso_test.go
[error] 36-36: : # github.com/tikv/pd/pkg/dashboard/uiserver
pkg/dashboard/uiserver/embedded_assets_rewriter.go:36:26: undefined: assets
pkg/dashboard/uiserver/embedded_assets_rewriter.go:37:13: undefined: vfsgen۰FS
pkg/dashboard/uiserver/embedded_assets_rewriter.go:39:15: undefined: vfsgen۰CompressedFileInfo
pkg/dashboard/uiserver/embedded_assets_rewriter.go:47:9: undefined: assets
(typecheck)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: chunks (8, TSO Integration Test)
- GitHub Check: chunks (7, Client Integration Test)
- GitHub Check: chunks (6, Tools Test)
- GitHub Check: chunks (9, Microservice Integration(!TSO))
- GitHub Check: chunks (3, Unit Test(3))
- GitHub Check: chunks (5, Tests(2))
- GitHub Check: chunks (1, Unit Test(1))
- GitHub Check: chunks (10, Microservice Integration(TSO))
- GitHub Check: chunks (2, Unit Test(2))
- GitHub Check: chunks (4, Tests(1))
- GitHub Check: statics
🔇 Additional comments (7)
pkg/tso/metrics.go (1)
118-118: LGTM!The new
notAllowedSaveTimestampEventmetric follows the established patterns in this file and provides useful observability for the new save-throttling behavior.Also applies to: 155-155
pkg/tso/tso.go (6)
52-52: LGTM on the new throttling interval constant.The
waitUpdateTSOIntervalof 50ms provides reasonable backoff between retry attempts when logical overflow occurs.
94-111: Thepreparameter design for non-regression is reasonable.The new signature allows callers to pass their observed physical time, and the check at lines 102-105 (inside the lock) safely determines if an update has already occurred. This enables the optimistic concurrency pattern where multiple callers can attempt updates without blocking each other unnecessarily.
123-135: LGTM ongenerateTSOreturning current physical time.Returning the current physical time allows callers to capture a consistent view of the physical time at generation, which is then used as
prein subsequentupdateTimestampcalls. This is key to the new retry logic.
352-356: Correct handling of storage save restriction.When
allowSaveStorageis false and a save is needed, the method correctly returns an error and increments the new metric. This enforces the design goal of restricting storage writes to the ticker-based updates only.
405-408: Verify the error type comparison logic.The condition
!errs.ErrUpdateTimestamp.Equal(err)suppresses the warning log when the error is specificallyErrUpdateTimestamp(e.g., "not allowed to save storage"). This seems intentional to avoid noisy logs during normal operation whengetTStriggers a retry due to storage save restrictions.However, this also suppresses warnings for other
ErrUpdateTimestamperrors (e.g., "timestamp not initialized"). Consider whether you want to differentiate these cases or if the current behavior is acceptable.
299-371: OverallupdateTimestampchanges align with PR objectives.The refactored method correctly:
- Accepts
prefor optimistic concurrency control- Respects
allowSaveStorageto throttle storage writes- Uses the new metric for observability
- Passes
pretosetTSOPhysicalfor non-regression checksThe logic flow is sound for the stated goal of advancing TSO physical time during
GetTSrequests while restricting storage persistence to ticker-based updates.
pkg/tso/tso.go
Outdated
| if t.tsoMux.physical.After(pre) { | ||
| return | ||
| } | ||
| // make sure the ts won't fall back | ||
| if typeutil.SubTSOPhysicalByWallClock(next, t.tsoMux.physical) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between the two comparisons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first comparison ensures that the t.tsoMux.physical doesn't change when getting Tso .
The second comparison ensures that the logical physical can't rollback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused why they use different functions for comparison. Is it worth unifying the code style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/tso/tso.go (1)
95-111: Critical: Race condition allows lost updates when physical time is updated concurrently.The race condition claim is valid. The pattern of capturing
previagetTSO()without holding the lock, then later callingsetTSOPhysical(pre, next, ...)creates a window whereprebecomes stale. If another goroutine updatestsoMux.physicalbetween thegetTSO()call (line 304) andsetTSOPhysical()'s lock acquisition (line 96), theprevalue becomes outdated.The early return on line 103-104 (
if t.tsoMux.physical.After(pre) { return }) then skips applyingnexteven whennext > t.tsoMux.physical(as verified by the check on line 107). This causes valid timestamp advancement to be lost.Concrete example with actual call flow:
- Line 304:
prevPhysical, prevLogical := t.getTSO()returns T1- Between line 304-358: Another goroutine calls
setTSOPhysical(T1, T2, ...), updates physical to T2- Line 358: Current goroutine calls
setTSOPhysical(T1, T3, ...)where T3 > T2- Line 103: Sees
physical(T2).After(pre(T1))is true, returns early without updating to T3- Result: Physical remains at T2 instead of advancing to T3
This pattern appears in:
updateTimestamp()flow (lines 304-358)syncTimestamp()at line 211:current, _ := t.getTSO()followed byt.setTSOPhysical(current, next, true)UpdateTSO()in allocator.go lines 170-171The lost update could cause TSO physical time to lag behind computed values, violating expected monotonic progression guarantees. Implement Option 1 from the review (atomic compare-and-swap returning success/failure) or Option 2 (compute next inside the critical section) to eliminate this race.
🤖 Fix all issues with AI agents
In @pkg/tso/tso_test.go:
- Line 30: Rename the mock type from MokElection to MockElection and update
every usage in the test file accordingly (replace MokElection occurrences at the
type declaration and all references, e.g., where the mock is instantiated or
type-asserted around the election variable on the referenced lines). Ensure
constructor names, variable declarations, type assertions, and any interface
implementations use MockElection so the test compiles.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
pkg/tso/allocator.gopkg/tso/tso.gopkg/tso/tso_test.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/tso/tso_test.go (1)
pkg/election/leadership.go (1)
Leadership(60-79)
🪛 golangci-lint (2.5.0)
pkg/tso/tso_test.go
[error] 36-36: : # github.com/tikv/pd/pkg/dashboard/uiserver
pkg/dashboard/uiserver/embedded_assets_rewriter.go:36:26: undefined: assets
pkg/dashboard/uiserver/embedded_assets_rewriter.go:37:13: undefined: vfsgen۰FS
pkg/dashboard/uiserver/embedded_assets_rewriter.go:39:15: undefined: vfsgen۰CompressedFileInfo
pkg/dashboard/uiserver/embedded_assets_rewriter.go:47:9: undefined: assets
(typecheck)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: chunks (10, Microservice Integration(TSO))
- GitHub Check: chunks (7, Client Integration Test)
- GitHub Check: chunks (1, Unit Test(1))
- GitHub Check: chunks (3, Unit Test(3))
- GitHub Check: chunks (9, Microservice Integration(!TSO))
- GitHub Check: chunks (8, TSO Integration Test)
- GitHub Check: chunks (4, Tests(1))
- GitHub Check: chunks (5, Tests(2))
- GitHub Check: chunks (2, Unit Test(2))
- GitHub Check: chunks (6, Tools Test)
- GitHub Check: statics
- GitHub Check: tso-function-test
- GitHub Check: chunks
🔇 Additional comments (9)
pkg/tso/allocator.go (1)
170-171: LGTM: Ticker correctly retrieves current TSO and enables storage saves.The UpdateTSO method now retrieves the current physical time via getTSO() and passes it to updateTimestamp along with allowSaveStorage=true. This aligns with the PR's design where only the ticker (allocatorUpdater) is permitted to persist timestamps to storage, while GetTS requests advance the physical time in memory only.
pkg/tso/tso_test.go (3)
46-73: Test logic correctly validates storage-save prerequisites.TestGenerateTSO verifies that getTS fails when the timestamp window requires extending but storage saves are not allowed (line 64), then succeeds after simulating a storage save by advancing lastSavedTime (line 69). This effectively tests the new allowSaveStorage control flow.
75-126: Verify assertion correctness under concurrent access.The assertion on line 125 assumes a linear relationship between total TSO count and physical time changes:
totalTso.Load()/int32(maxLogical)+1 == changes.Load(). However, with 100 concurrent goroutines and potential retries due to logical overflow, this assertion may be flaky:
- Concurrent goroutines may observe the same physical time value multiple times before it changes
- Retries and timing variations could cause the actual number of physical changes to differ from the calculated expectation
- The test only tracks changes in goroutine 0 (line 112), which may not observe all physical time transitions if another goroutine updates it first
Consider whether this assertion should use an approximate comparison (e.g.,
changes.Load()is within a reasonable range oftotalTso.Load()/maxLogical) or whether the test setup needs refinement to ensure deterministic behavior.If the assertion is intentionally exact, please verify through repeated test runs that it doesn't produce intermittent failures.
36-36: Note: Static analysis error is unrelated to this PR.The typecheck error shown in static analysis hints references
pkg/dashboard/uiserver/embedded_assets_rewriter.go, which is not part of the files under review. This appears to be a pre-existing build issue or missing code generation step, not caused by the TSO changes in this PR.pkg/tso/tso.go (5)
52-52: LGTM: Throttle constant for retry delays.The waitUpdateTSOInterval constant provides a reasonable 50ms delay when retrying after logical overflow, preventing tight retry loops while keeping latency acceptable.
123-135: LGTM: generateTSO now returns current physical time.Returning the current physical time allows callers to detect whether the timestamp was updated between their initial check and the generateTSO call. This supports the non-regression logic and enables the overflow handling in getTS.
211-212: Note: syncTimestamp subject to the same race condition as other callers.The syncTimestamp method retrieves the current TSO via getTSO() and passes it to setTSOPhysical(). This call site exhibits the same race condition discussed in the review of setTSOPhysical (lines 95-111), where another goroutine could update the physical time between getTSO() and setTSOPhysical(), potentially causing the update to be incorrectly skipped.
300-373: Verify storage persistence strategy aligns with crash recovery requirements.The updateTimestamp method now accepts an allowSaveStorage parameter:
- When true (ticker path): saves to etcd and updates lastSavedTime
- When false (GetTS overflow path): only updates in-memory state
Potential concern: If the MET goroutine (ticker) is delayed due to scheduler pressure (the original problem this PR addresses), and many GetTS requests trigger in-memory physical time updates (line 371) without saving to storage, the gap between the in-memory timestamp and lastSavedTime could grow significantly. If the process crashes before the next ticker-triggered save:
- More timestamp values would be lost (larger regression window)
- After restart, syncTimestamp would reload the older lastSavedTime from etcd
- Timestamp monotonicity is preserved but with a larger backward jump than before
Please confirm this trade-off is acceptable for the system's crash recovery semantics. Consider:
- What is the maximum acceptable timestamp regression window after a crash?
- Should there be a threshold where GetTS requests are allowed/required to save to storage even with allowSaveStorage=false?
- Does the monitoring (notAllowedSaveTimestampEvent metric) provide sufficient observability for this behavior?
Based on learnings, crash recovery semantics are critical for distributed transaction correctness.
377-420: Client impact: Logical overflow now introduces 50ms delay.When logical time overflows (line 400-410), the new behavior:
- Calls updateTimestamp(physical, false) to advance physical time without saving to storage
- Sleeps for waitUpdateTSOInterval (50ms) regardless of success
- Retries the entire getTS operation
Client-visible change: Clients requesting TSOs when logical overflow occurs will experience at least a 50ms latency increase on that request, even if updateTimestamp succeeds in advancing the physical time.
Consider the following:
Performance impact: Is 50ms acceptable for production workloads? High-throughput applications generating TSOs near maxLogical boundaries could see increased p99 latencies.
Necessity of sleep: If updateTimestamp successfully advances the physical time (even when returning an error about storage save), does the client need to sleep before retrying? Could the retry proceed immediately after a successful in-memory update?
Alternative: Would it be better to spin with a shorter delay or check if physical time was actually advanced before sleeping?
This behavior change should be documented in the release notes as a potential performance regression for workloads with high TSO allocation rates.
Based on learnings, TSO allocation latency directly impacts transaction throughput.
pkg/tso/tso.go
Outdated
| } | ||
| // Get a new TSO result with the given count | ||
| resp.Physical, resp.Logical = t.generateTSO(ctx, int64(count)) | ||
| resp.Physical, resp.Logical, physical = t.generateTSO(ctx, int64(count)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generateTSO returns physical, logical, current, and here they become resp.Physical, resp.Logical, physical. Please standardize the namings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good, add comment in the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
| resp.Physical, resp.Logical, physical = t.generateTSO(ctx, int64(count)) | |
| resp.Physical, resp.Logical, current = t.generateTSO(ctx, int64(count)) |
Signed-off-by: bufferflies <1045931706@qq.com>
96efdf0 to
a262d63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pkg/tso/tso.go (1)
52-420: Suggest monitoring new metrics to validate behavioral changes.This PR introduces significant complexity to enable GetTS-driven physical time advancement alongside ticker-driven updates. To validate the changes in production and detect potential issues early:
Monitor these metrics:
notAllowedSaveTimestampEvent(line 355): Tracks GetTS attempts to advance physical when time window is exhausted. High values indicate GetTS frequently hitting window limits, suggesting either:
saveIntervalis too small relative to TSO allocation rate- Ticker is being delayed (the problem this PR addresses)
logicalOverflowEvent(line 405): Tracks how often logical exhaustion forces physical advancement. Increases indicate higher TSO allocation rates or longer intervals between physical updates.
exceededMaxRetryEvent(line 418): Critical - indicates GetTS failed after 10 retries (~500ms of failures). Should be rare; frequent occurrences suggest severe scheduler/system issues.Recommended alert thresholds:
- Alert if
exceededMaxRetryEvent> 0 (critical - TSO allocation failures)- Alert if
notAllowedSaveTimestampEventrate is consistently high (may indicate ticker scheduling issues)Configuration tuning: The interaction between
tso-save-interval,tso-update-physical-interval, and TSO request rate determines whether GetTS can operate within existing time windows. IfnotAllowedSaveTimestampEventis high, consider increasingtso-save-intervalto create larger time windows.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
pkg/tso/allocator.gopkg/tso/tso.gopkg/tso/tso_test.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/tso/tso_test.go (1)
pkg/election/leadership.go (1)
Leadership(60-79)
🪛 golangci-lint (2.5.0)
pkg/tso/tso_test.go
[error] 36-36: : # github.com/tikv/pd/pkg/dashboard/uiserver
pkg/dashboard/uiserver/embedded_assets_rewriter.go:36:26: undefined: assets
pkg/dashboard/uiserver/embedded_assets_rewriter.go:37:13: undefined: vfsgen۰FS
pkg/dashboard/uiserver/embedded_assets_rewriter.go:39:15: undefined: vfsgen۰CompressedFileInfo
pkg/dashboard/uiserver/embedded_assets_rewriter.go:47:9: undefined: assets
(typecheck)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: chunks (8, TSO Integration Test)
- GitHub Check: statics
- GitHub Check: chunks (10, Microservice Integration(TSO))
- GitHub Check: chunks (6, Tools Test)
- GitHub Check: chunks (3, Unit Test(3))
- GitHub Check: chunks (4, Tests(1))
- GitHub Check: chunks (5, Tests(2))
- GitHub Check: chunks (1, Unit Test(1))
- GitHub Check: chunks (7, Client Integration Test)
- GitHub Check: chunks (9, Microservice Integration(!TSO))
- GitHub Check: chunks (2, Unit Test(2))
- GitHub Check: tso-function-test
- GitHub Check: chunks
🔇 Additional comments (11)
pkg/tso/allocator.go (1)
170-171: LGTM! Ticker-driven path correctly propagates current physical time.The updated logic fetches the current physical time before calling
updateTimestamp(current, true), which aligns with the PR objective to control timestamp persistence. TheallowSaveStorage=trueflag ensures only the ticker path (this function) persists to etcd, while GetTS-driven updates skip storage saves.The retry loop correctly handles transient etcd failures, and the pre-time guard in
setTSOPhysical(lines 103-105 in tso.go) prevents race conditions where another goroutine updates the physical time betweengetTSO()andupdateTimestamp().pkg/tso/tso_test.go (3)
30-44: LGTM! Mock implementation is sufficient for unit tests.The
MockElectionprovides minimal stubs to satisfy the election interface requirements in tests. The key behavior isIsServing()returningtrue, which allows timestamp generation logic to proceed without leadership-related blocking.
46-73: LGTM! Test validates storage-save gating mechanism.This test correctly verifies the PR's core objective:
getTScannot advance the physical time when it requires persisting to storage (allowSaveStorage=falsein the GetTS path). The test demonstrates:
- Line 64:
getTSfails when logical overflow requires physical advancement but storage save is not allowed (lastSavedTime not set)- Line 69: Simulating a prior storage save by advancing lastSavedTime
- Line 70:
getTSsucceeds and advances physical time when sufficient time window existsThe pattern validates that only the ticker (with
allowSaveStorage=true) persists timestamps, while GetTS-driven updates work within the saved time window.
75-126: Concurrent stress test correctly validates logical overflow and physical time advancement.This test exercises concurrent TSO generation under load, verifying that physical time advances when logical exhaustion is reached. The test is well-designed:
lastSavedTimeis set tocurrent.Add(5s), creating a 5-second buffer that prevents storage save requirements during the 4-second test run. Whenlogical >= maxLogical,updateTimestampsucceeds without needing storage, advances physical time by 1ms, and resets logical to 0 viasetTSOPhysical, allowing the next maxLogical allocations before the next overflow. The assertiontotalTso.Load()/int32(maxLogical)+1 == changes.Load()correctly validates this relationship.pkg/tso/tso.go (7)
52-52: LGTM! Throttling constant prevents tight retry loops.The
waitUpdateTSOIntervalconstant introduces controlled backoff whengetTSmust retry after attempting to advance physical time on logical overflow. The 50ms value aligns withupdateTSORetryIntervalused in allocator.go, providing consistent retry timing across the TSO subsystem.
94-111: LGTM! Pre-time guard prevents concurrent update races.The updated signature adds a
preparameter representing the physical time observed before the update, and introduces a critical guard at lines 103-105:if t.tsoMux.physical.After(pre) { return }This prevents stale updates when multiple goroutines (ticker and GetTS paths) attempt to advance physical time concurrently. If physical time has already been advanced past
preby another goroutine, this update is skipped, ensuring:
- Monotonicity: Physical time never regresses
- Idempotency: Redundant updates are safely ignored
- Correctness: The most recent advancement wins
This design is essential for the PR objective of allowing GetTS to advance physical time alongside the ticker without coordination overhead.
122-135: LGTM! Extended return value enables pre-time propagation.The signature change adds a third return value
current time.Time(line 123), representing the physical time at the moment of TSO generation. This enables the critical design pattern wheregetTScan:
- Generate a TSO and capture its physical time (line 396 in getTS)
- If logical overflow occurs, pass that physical time as
pretoupdateTimestamp(line 406)- The
setTSOPhysicalguard then prevents stale updates if another goroutine advanced time firstCapturing
currentunder lock (line 133) ensures consistency with the returned physical/logical values, maintaining TSO correctness guarantees.
211-212: Consistent adoption of pre-time pattern in initialization.The
syncTimestampfunction now fetches the current physical time and passes it aspretosetTSOPhysical, maintaining consistency with the updated signature. During normal initialization, physical time should beZeroTime, and theforce=trueflag ensures the update proceeds (line 99 allows forced updates regardless of pre-time).This defensive pattern makes the code more robust if
syncTimestampis called when physical time is already initialized.
284-373: LGTM! Storage-save gating mechanism correctly implements PR objectives.The updated signature adds two parameters that implement the core PR design:
pre time.Time(line 285): Previous physical time passed tosetTSOPhysicalguard, preventing stale updatesallowSaveStorage bool(line 286): Gates etcd persistence to ensure only the ticker path savesKey logic (lines 354-357): When the time window is exhausted and storage save is required:
- If
allowSaveStorage=false(GetTS path), return error and increment metric- If
allowSaveStorage=true(ticker path), proceed with etcd save (lines 358-368)This design ensures:
- GetTS path can advance physical time within existing time windows (no etcd writes)
- Ticker path creates new time windows by persisting to etcd
- Resilience: Even if ticker is delayed, GetTS can allocate TSOs within current window until logical exhausts
The error return when save is needed but not allowed (line 356) triggers retry in
getTS(line 409), allowing the ticker to eventually create a new time window. This gracefully handles scheduler delays without stalling TSO allocation.
377-420: LGTM! GetTS now advances physical time on logical overflow with proper throttling.The updated
getTSfunction implements the PR's core objective: advancing physical time on GetTS requests when logical exhaustion occurs. Key changes:Line 380: Tracks
physicaltime returned bygenerateTSOto use asprein updatesLines 406-409: On logical overflow:
- Attempts
updateTimestamp(physical, false)to advance physical within existing time window- Logs warnings only for unexpected errors (not
ErrUpdateTimestamp)- Sleeps 50ms to throttle retries and allow ticker to persist new time windows
- Retries from loop start
Retry behavior:
- If
updateTimestampsucceeds (time window sufficient), next iteration generates TSO with advanced physical time- If
updateTimestampfails (save needed but not allowed), sleep gives ticker opportunity to persist, then retry- After
maxRetryCount(10) attempts, returns error to prevent indefinite blockingThis design balances resilience (GetTS can advance physical without waiting for ticker) with stability (bounded retries, throttled loops, no etcd write amplification).
94-420: The existingTestCurrentGetTSOtest already provides comprehensive concurrent stress testing with 10 goroutines running continuousgetTS()calls for 5 seconds, validating timestamp monotonicity. Additionally, the project's CI pipeline (Makefile line 263) already runs all tests includingpkg/tsowith the-raceflag. The suggestion to manually loop the race detector is redundant and unnecessary.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
pkg/tso/tso_test.go (3)
30-44: Consider adding a comment to document the interface being mocked.The
MockElectiontype implements the election interface but lacks a comment explaining this. A brief comment would improve code readability.📝 Suggested documentation
+// MockElection is a minimal mock implementation of the election.Member interface for testing. type MockElection struct{}
104-120: Initialprevalue might not be compared if physical changes before the first loop iteration.Line 104 captures the initial physical timestamp, but if physical time changes before the first
getTScall in the loop, that initial comparison is missed. This is a minor edge case but could lead to an off-by-one in thechangescount.♻️ Suggested adjustment
Move the initial
getTSOcall inside the loop to ensure the first comparison is valid:for i := range concurrency { go func(i int) { - pre, _ := timestampOracle.getTSO() - defer wg.Done() + var pre time.Time + if i == 0 { + pre, _ = timestampOracle.getTSO() + } for {Alternatively, if the intent is to track all changes including the initial state, consider comparing against the initial
currentvalue set on line 79.
128-128: Add assertion to verify that physical time actually progressed.The current assertion only verifies an upper bound on changes (
changes <= totalTso/maxLogical + 1), but doesn't confirm that the physical timestamp actually advanced during the test. Consider adding a lower bound check to ensure the test validates forward progress.♻️ Suggested additional assertion
wg.Wait() + // Verify that physical time actually progressed during concurrent access + re.Greater(changes.Load(), int32(0), "physical timestamp should have advanced at least once") re.LessOrEqual(changes.Load(), totalTso.Load()/int32(maxLogical)+1)This ensures the test catches scenarios where physical time incorrectly remains static despite logical exhaustion.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pkg/tso/tso_test.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/tso/tso_test.go (1)
pkg/election/leadership.go (1)
Leadership(60-79)
🪛 golangci-lint (2.5.0)
pkg/tso/tso_test.go
[error] 36-36: : # github.com/tikv/pd/pkg/dashboard/uiserver
pkg/dashboard/uiserver/embedded_assets_rewriter.go:36:26: undefined: assets
pkg/dashboard/uiserver/embedded_assets_rewriter.go:37:13: undefined: vfsgen۰FS
pkg/dashboard/uiserver/embedded_assets_rewriter.go:39:15: undefined: vfsgen۰CompressedFileInfo
pkg/dashboard/uiserver/embedded_assets_rewriter.go:47:9: undefined: assets
(typecheck)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: chunks (4, Tests(1))
- GitHub Check: chunks (5, Tests(2))
- GitHub Check: chunks (6, Tools Test)
- GitHub Check: chunks (2, Unit Test(2))
- GitHub Check: chunks (9, Microservice Integration(!TSO))
- GitHub Check: chunks (7, Client Integration Test)
- GitHub Check: chunks (1, Unit Test(1))
- GitHub Check: chunks (3, Unit Test(3))
- GitHub Check: chunks (10, Microservice Integration(TSO))
- GitHub Check: chunks (8, TSO Integration Test)
- GitHub Check: statics
- GitHub Check: tso-function-test
- GitHub Check: chunks
🔇 Additional comments (1)
pkg/tso/tso_test.go (1)
69-69:lastSavedTimeis correctly used as a save threshold, not an actual save timestamp.Line 69 sets
lastSavedTimetocurrent.Add(5 * time.Second)because it represents the time window within which a save must occur, not when a save actually happened. The implementation stores a future time—nextPhysical.Add(saveInterval)—when persisting a timestamp to etcd (see lines 202, 274, 367 intso.go). This threshold gates save operations: a new save is triggered only when physical time approaches it (line 267:if SubRealTimeByWallClock(lastSavedTime, nextPhysical) <= updateTimestampGuard). The test pattern is intentional—setting the threshold ahead ensures the TS generation logic is tested without triggering saves during the scenario.
56405de to
3c65d8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @pkg/tso/tso_test.go:
- Line 104: The call to timestampOracle.getTSO() ignores its error return;
change the code that currently does "pre, _ := timestampOracle.getTSO()" to
capture the error (e.g., "pre, err := timestampOracle.getTSO()") and fail the
test if err != nil (use t.Fatalf or require.NoError(t, err) depending on test
helpers used) so the test stops and reports the real failure instead of
proceeding with an invalid timestamp.
🧹 Nitpick comments (3)
pkg/tso/tso_test.go (3)
46-73: Clarify test setup: future lastSavedTime and initialization.
lastSavedTimeis not explicitly initialized before line 64, relying on the zero value ofatomic.Value. Verify this matches the intended behavior when testing the "no save allowed" path.Line 69 sets
lastSavedTimeto a future time (current.Add(5 * time.Second)). This is counterintuitive sincelastSavedTimetypically represents a past event. While this may intentionally prevent triggering another save during the test, it would be clearer to either:
- Add a comment explaining why a future time is used
- Use a past time that's within the save interval window
💡 Suggested clarification
// simulate the save to storage operation is done. + // Set to future time to ensure no additional saves are triggered during this test timestampOracle.lastSavedTime.Store(current.Add(5 * time.Second))
95-95: Add comment explaining the future lastSavedTime.Setting
lastSavedTimeto a future time (current.Add(runDuration)) prevents storage saves during the test run, but this intent isn't documented. A comment would improve readability.📝 Suggested clarification
+ // Set lastSavedTime far in the future to prevent storage saves during the concurrent test timestampOracle.lastSavedTime.Store(current.Add(runDuration))
127-128: Document the assertion logic.The assertion
changes <= totalTso/maxLogical + 1validates that physical timestamp advances stay within expected bounds, but the reasoning isn't explained. Adding a comment would help future maintainers understand what behavior is being verified.📝 Suggested clarification
wg.Wait() + // Verify physical timestamp doesn't advance too frequently: + // With maxLogical TSOs per physical time, we expect ~totalTso/maxLogical advances (+1 for rounding) re.LessOrEqual(changes.Load(), totalTso.Load()/int32(maxLogical)+1)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pkg/tso/tso_test.go
🧰 Additional context used
🪛 golangci-lint (2.5.0)
pkg/tso/tso_test.go
[error] 36-36: : # github.com/tikv/pd/pkg/dashboard/uiserver
pkg/dashboard/uiserver/embedded_assets_rewriter.go:36:26: undefined: assets
pkg/dashboard/uiserver/embedded_assets_rewriter.go:37:13: undefined: vfsgen۰FS
pkg/dashboard/uiserver/embedded_assets_rewriter.go:39:15: undefined: vfsgen۰CompressedFileInfo
pkg/dashboard/uiserver/embedded_assets_rewriter.go:47:9: undefined: assets
(typecheck)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: chunks (8, TSO Integration Test)
- GitHub Check: chunks (3, Unit Test(3))
- GitHub Check: chunks (10, Microservice Integration(TSO))
- GitHub Check: chunks (9, Microservice Integration(!TSO))
- GitHub Check: chunks (1, Unit Test(1))
- GitHub Check: chunks (2, Unit Test(2))
- GitHub Check: chunks (7, Client Integration Test)
- GitHub Check: chunks (5, Tests(2))
- GitHub Check: chunks (4, Tests(1))
- GitHub Check: chunks (6, Tools Test)
- GitHub Check: statics
- GitHub Check: chunks
- GitHub Check: tso-function-test
🔇 Additional comments (2)
pkg/tso/tso_test.go (2)
36-36: Static analysis error is a false positive.The golangci-lint error references
pkg/dashboard/uiserver/embedded_assets_rewriter.go, which is unrelated to this test file. This appears to be a build or tooling issue in a different package.
30-44: No concerns with nil return values in this test.The
mockElection.Client()andmockElection.GetLeadership()returningnilis intentional and safe. The tests setlastSavedTimeto a future time, which prevents any code paths that would invokesaveTimestamp()or use these methods from being reached. The tested code paths inTestGenerateTSOandTestCurrentGetTSOonly exercise TSO generation logic (generateTSO()) and do not require functionalClient()orGetLeadership()implementations.
3c65d8a to
9ad7d5f
Compare
Signed-off-by: bufferflies <1045931706@qq.com>
9ad7d5f to
5151b13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
pkg/tso/tso_test.go (2)
77-77: Consider renaming test to reflect its purpose.The test name
TestCurrentGetTSOis misleading since it tests concurrent access behavior. A name likeTestConcurrentGetTSOwould better communicate the test's intent.📝 Suggested rename
-func TestCurrentGetTSO(t *testing.T) { +func TestConcurrentGetTSO(t *testing.T) {
109-126: Consider collecting errors instead of asserting inside goroutines.Using
re.NoError(err)inside goroutines can cause unclear test failures if the assertion fails, as the panic originates from a non-main goroutine. Consider collecting errors and validating them afterwg.Wait().♻️ Alternative approach with error collection
+ var errOnce sync.Once + var testErr error for i := range concurrency { go func(i int) { pre, _ := timestampOracle.getTSO() defer wg.Done() for { select { case <-runCtx.Done(): return default: _, err := timestampOracle.getTS(runCtx, 1) totalTso.Add(1) - re.NoError(err) + if err != nil { + errOnce.Do(func() { testErr = err }) + return + } if i == 0 { physical, _ := timestampOracle.getTSO() if pre != physical { changes.Add(1) pre = physical } } } } }(i) } wg.Wait() + re.NoError(testErr)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/tso/tso.gopkg/tso/tso_test.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/tso/tso.go (4)
tools/pd-ctl/helper/tidb/types/time.go (1)
Time(69-71)pkg/errs/errno.go (1)
ErrUpdateTimestamp(104-104)pkg/utils/logutil/log.go (1)
CondUint32(269-274)tools/pd-ctl/helper/tidb/terror/terror.go (1)
Error(62-62)
🪛 golangci-lint (2.5.0)
pkg/tso/tso_test.go
[error] 36-36: : # github.com/tikv/pd/pkg/dashboard/uiserver
pkg/dashboard/uiserver/embedded_assets_rewriter.go:36:26: undefined: assets
pkg/dashboard/uiserver/embedded_assets_rewriter.go:37:13: undefined: vfsgen۰FS
pkg/dashboard/uiserver/embedded_assets_rewriter.go:39:15: undefined: vfsgen۰CompressedFileInfo
pkg/dashboard/uiserver/embedded_assets_rewriter.go:47:9: undefined: assets
(typecheck)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: tso-function-test
- GitHub Check: chunks (9, Microservice Integration(!TSO))
- GitHub Check: chunks (7, Client Integration Test)
- GitHub Check: statics
- GitHub Check: chunks (8, TSO Integration Test)
- GitHub Check: chunks
- GitHub Check: chunks (2, Unit Test(2))
- GitHub Check: chunks (10, Microservice Integration(TSO))
- GitHub Check: chunks (1, Unit Test(1))
- GitHub Check: chunks (6, Tools Test)
- GitHub Check: chunks (4, Tests(1))
- GitHub Check: chunks (5, Tests(2))
- GitHub Check: chunks (3, Unit Test(3))
🔇 Additional comments (11)
pkg/tso/tso_test.go (3)
30-44: LGTM!The mock election implementation correctly satisfies the interface requirements for testing purposes, with
IsServing()returningtrueto simulate an active leader.
46-75: LGTM!The test correctly validates the expected behavior:
- When logical is near max and storage save would be needed but isn't allowed,
getTSreturns an error.- After simulating storage save completion,
getTSsucceeds and physical time advances.The setup with
logical: maxLogical - 1ensures the firstgetTS(ctx, 2)triggers the overflow path.
130-130: LGTM!The assertion correctly validates that physical time changes are bounded by the expected maximum: at most
totalTSO / maxLogical + 1physical time increments should occur, since each physical time increment providesmaxLogicallogical slots.pkg/tso/tso.go (8)
93-110: LGTM! Good concurrency guard.The new
preparameter effectively prevents stale updates from overwriting concurrent modifications:
- Line 102-104 checks if current physical is already past the caller's captured
pretime, indicating another goroutine advanced it- The existing check at line 106 ensures monotonicity
This is a correct solution for the concurrent advancement scenario described in the PR objectives.
121-134: LGTM!Returning
current(the physical time at generation) allows callers to pass it toupdateTimestamp, enabling the concurrency guard insetTSOPhysicalto work correctly.
209-212: LGTM!Correctly captures current TSO before calling
setTSOPhysicalwithforce=true, ensuring the sync operation can initialize even from zero time.
283-299: LGTM! Clear parameter documentation.The updated function signature and comments clearly document:
pre: the physical time captured before update (used for concurrency guard)allowSaveStorage: restricts storage persistence to ticker-driven updates onlyThis aligns with the PR objective of allowing in-memory advancement while centralizing persistence to the ticker.
351-356: LGTM!The guard correctly prevents storage writes from the GetTS path while providing observability via the
notAllowedSaveTimestampEventmetric. The error message clearly explains why the update failed.
369-371: LGTM!Using
pre(the captured physical time from the caller) instead ofprevPhysical(re-fetched current value) is correct. This ensures the concurrency guard insetTSOPhysicalcompares against the caller's snapshot, properly detecting if another goroutine advanced the time between when the caller captured it and now.
379-395: LGTM!The captured
physicalfromgenerateTSOis correctly passed toupdateTimestampon overflow, enabling the concurrency guard to detect if another goroutine already advanced the physical time.
405-408: Verify the error handling logic on overflow path.When
updateTimestampreturns an error (lines 405-407), the code logs a warning and continues to retry. However, the condition!errs.ErrUpdateTimestamp.Equal(err)means the warning is logged for all errors exceptErrUpdateTimestamp. This seems intentional—ErrUpdateTimestampfrom the "not allowed" path is expected and shouldn't be logged as a warning—but please verify this is the intended behavior.
|
/retest |
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
d6a20a6 to
80b5c16
Compare
|
I don't quite get the point of |
It avoids all goroutines that can save ts in etcd. Only allocatorUpdater can save tso in etcd |
Signed-off-by: bufferflies <1045931706@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@pkg/tso/tso_test.go`:
- Around line 53-85: The test TestSetTSO drops errors returned by getTSO();
update all calls to getTSO() in this test to assert the error is nil (e.g.,
using re.NoError or require.NoError) before using the returned physical value.
Specifically, in TestSetTSO after each call to ts.getTSO() (used with
setTSOPhysical and withInitialCheck/withLogicalOverflowCheck), capture both
return values and add an assertion on the error to fail the test if getTSO()
returns an error; keep using the physical value afterward for equality checks.
- Line 166: The goroutine calling timestampOracle.getTSO() currently ignores the
returned error; change the call to capture both values (e.g., physical, err :=
timestampOracle.getTSO()) and handle err: if err != nil report the failure from
the test (for example call t.Errorf("getTSO failed: %v", err) or send the error
on a test error channel back to the main goroutine) instead of discarding it;
ensure the fix is applied to the invocation of timestampOracle.getTSO() inside
the goroutine so failures are not masked.
- Around line 107-115: The two calls to timestampOracle.getTSO() ignore the
returned error; update the test to capture and assert no error from both calls
(e.g., err := and re.NoError(err)) before using the returned physical value so
failures are caught; specifically modify the first call that sets physical, _ :=
timestampOracle.getTSO() and the later physical, _ = timestampOracle.getTSO() to
check the error result (use re.NoError or require.NoError) and then use the
returned physical variable for the comparisons with current and re.NotEqual.
If the Go runtime does not wake up |
In the current user case, allocatorUpdater doesn't wake up for more than the lease duration (5s). The root cause is that the physical doesn't advance. |
892babc to
d8bcada
Compare
Signed-off-by: bufferflies <1045931706@qq.com>
d8bcada to
5a2794a
Compare
|
/retest |
|
/ping @JmPotato |
pkg/tso/tso.go
Outdated
| // NOTICE: this function should be called after the TSO in memory has been initialized | ||
| // and should not be called when the TSO in memory has been reset anymore. | ||
| func (t *timestampOracle) updateTimestamp() error { | ||
| func (t *timestampOracle) updateTimestamp(updatePurpose updatePurpose) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better add a comment to explain what the returned bool means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/tso/tso.go
Outdated
| func (t *timestampOracle) updateTimestamp(updatePurpose updatePurpose) (bool, error) { | ||
| if !t.isInitialized() { | ||
| return errs.ErrUpdateTimestamp.FastGenByArgs("timestamp in memory has not been initialized") | ||
| return true, errs.ErrUpdateTimestamp.FastGenByArgs("timestamp in memory has not been initialized") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the return value indicates whether it has overflowed, this should not be true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch ,done
pkg/tso/tso.go
Outdated
| // The time window needs to be updated and saved to etcd. | ||
| if typeutil.SubRealTimeByWallClock(t.getLastSavedTime(), next) <= updateTimestampGuard { | ||
| // Only IntervalUpdate is allowed to save timestamp into etcd. | ||
| // it would be dangerous to save timestamp into etcd when handling overflowUpdate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be best to elaborate a bit in the comments on what “dangerous” actually means.
pkg/tso/tso.go
Outdated
| zap.Error(err)) | ||
| t.metrics.errSaveUpdateTSEvent.Inc() | ||
| return err | ||
| return true, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/tso/tso.go
Outdated
| return nil | ||
| var overflowed bool | ||
| // If it's an IntervalUpdate, we don't need to check logical overflow, just update physical time directly. | ||
| // otherwise, we need to check logical overflow to avoid unnecessary physical time update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with the previous comment, please elaborate on why, in the latter case, we need to check for overflow to avoid extra updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: bufferflies <1045931706@qq.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: liyishuai The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pkg/tso/tso.go`:
- Around line 441-446: The log call inside the error branch of the
updateTimestamp handling should be elevated from log.Info to log.Warn: in the
block that checks the result of t.updateTimestamp(overflowUpdate), replace
log.Info(...) with log.Warn(...), keeping the same structured fields (use
logutil.CondUint32("keyspace-group-id", t.keyspaceGroupID, t.keyspaceGroupID >
0) and zap.Error(err)) and preserve the existing sleep behavior using
t.updatePhysicalInterval; this aligns the error path for
updateTimestamp/overflowUpdate with other warnings in this file.
| if overflowed, err := t.updateTimestamp(overflowUpdate); err != nil { | ||
| log.Info("update timestamp failed", logutil.CondUint32("keyspace-group-id", t.keyspaceGroupID, t.keyspaceGroupID > 0), zap.Error(err)) | ||
| time.Sleep(t.updatePhysicalInterval) | ||
| } else if overflowed { | ||
| time.Sleep(t.updatePhysicalInterval) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Info should be log.Warn for a failed timestamp update.
This logs an error from updateTimestamp at Info level, but it represents an operational failure that warrants operator attention. Other failure paths in this file (e.g., line 390) consistently use log.Warn.
Proposed fix
if overflowed, err := t.updateTimestamp(overflowUpdate); err != nil {
- log.Info("update timestamp failed", logutil.CondUint32("keyspace-group-id", t.keyspaceGroupID, t.keyspaceGroupID > 0), zap.Error(err))
+ log.Warn("update timestamp failed", logutil.CondUint32("keyspace-group-id", t.keyspaceGroupID, t.keyspaceGroupID > 0), zap.Error(err))
time.Sleep(t.updatePhysicalInterval)🤖 Prompt for AI Agents
In `@pkg/tso/tso.go` around lines 441 - 446, The log call inside the error branch
of the updateTimestamp handling should be elevated from log.Info to log.Warn: in
the block that checks the result of t.updateTimestamp(overflowUpdate), replace
log.Info(...) with log.Warn(...), keeping the same structured fields (use
logutil.CondUint32("keyspace-group-id", t.keyspaceGroupID, t.keyspaceGroupID >
0) and zap.Error(err)) and preserve the existing sleep behavior using
t.updatePhysicalInterval; this aligns the error path for
updateTimestamp/overflowUpdate with other warnings in this file.
JmPotato
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😂 I’m “sorry” for my strict review—but it’s necessary. After all, this is extremely critical code logic like TSO, and it’s essential to keep the code clear in every sense.
pkg/tso/tso.go
Outdated
| // NOTICE: this function should be called after the TSO in memory has been initialized | ||
| // and should not be called when the TSO in memory has been reset anymore. | ||
| func (t *timestampOracle) updateTimestamp() error { | ||
| func (t *timestampOracle) updateTimestamp(updatePurpose updatePurpose) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter has the same name as the type updatePurpose. This compiles in Go but is confusing and shadows the type within the function body. Better consider renaming the parameter to purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good suggestion
pkg/tso/tso.go
Outdated
| func withLogicalOverflowCheck() setTSOOption { | ||
| return func(t *timestampOracle) bool { | ||
| return !overflowedLogical(t.tsoMux.logical) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name withLogicalOverflowCheck suggests "check if there's an overflow", but it actually means "skip if there is NOT an overflow". A reader needs to mentally invert the logic. Consider renaming to something like skipIfNotInitialized / skipIfNotOverflowed, or inverting the convention so true means "proceed" instead of "skip".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/tso/tso.go
Outdated
| t.metrics.logicalOverflowEvent.Inc() | ||
| time.Sleep(t.updatePhysicalInterval) | ||
| if overflowed, err := t.updateTimestamp(overflowUpdate); err != nil { | ||
| log.Info("update timestamp failed", logutil.CondUint32("keyspace-group-id", t.keyspaceGroupID, t.keyspaceGroupID > 0), zap.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log should be more like a warnning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/tso/tso_test.go
Outdated
| default: | ||
| } | ||
| } | ||
| if i == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually can use a struct like map[int64]struct{} to count all physical timestamps generated by all goroutines to check if it doesn't exceed the expected number of timestamp advances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I have collected all goroutines when the logical response time is equal to 1, which indicates the physical has been allocated.
Signed-off-by: bufferflies <1045931706@qq.com>
pkg/tso/tso.go
Outdated
| func (t *timestampOracle) setTSOPhysical(next time.Time, force bool) { | ||
| // setTSOOption defines the option type for setTSOPhysical function. | ||
| // if it returns true, it will skip update physical time. | ||
| type setTSOOption func(t *timestampOracle) bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setTSOOption sounds like a 'configuration option', but this type is actually a function that determines 'whether to skip the update', more like a guard/predicate.
| type setTSOOption func(t *timestampOracle) bool | |
| type setTSOPhysicalGuard func(t *timestampOracle) bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or treat Initialized and Overflowed as a prerequisite that must be met before setting TSO.
| type setTSOOption func(t *timestampOracle) bool | |
| type setTSOPrecond func(t *timestampOracle) bool |
for _, precond := range preconds {
if !precond(t) {
return overflowedLogical(t.tsoMux.logical)
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have renamed this option to condition, and renamed mustInitialized and mustOverflowed to make the code more concise.
|
@bufferflies: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: Close #10138
What is changed and how does it work?
Check List
Tests
Master(tso-update-physical-interval = "2s"):

This PR(tso-update-physical-interval = "2s")

Code changes
Side effects
Related changes
pingcap/docs/pingcap/docs-cn:Release note
Summary by CodeRabbit
New Features
Bug Fixes
Tests