Skip to content

Conversation

@vldmit
Copy link
Contributor

@vldmit vldmit commented Feb 13, 2025

Close #1574
Problem Statement

etcd client would only run Sync after AutoSyncInterval (30 seconds), which makes tikv client vulnerable to the failure of endpoints provided in addrs before the first sync happens. Specific failure scenario:

  1. tidb is initialized with a --path=endpoint
  2. tidb successfully established connection to the endpoint
  3. n < 30 seconds after, endpoint fails (e.g. k8s control plane is upgrading the pod)
  4. etcd client is no longer connected
  5. Safe checkpoint expires and CheckVisibility in KVStore start to error out

Fix

We explicitly synchronize client endpoints with endpoints from etcd membership during client initialization phase. etcd client would continue to do periodic sync, we just force first sync to happen in the init phase.

Signed-off-by: Vlad Dmitriev <vldmit@gmail.com>
@ti-chi-bot ti-chi-bot bot added the dco-signoff: yes Indicates the PR's author has signed the dco. label Feb 13, 2025
@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 13, 2025

Welcome @vldmit!

It looks like this is your first PR to tikv/client-go 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!



Thank you, and welcome to tikv/client-go. 😃

@ti-chi-bot ti-chi-bot bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 13, 2025
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Feb 14, 2025
@zyguan
Copy link
Contributor

zyguan commented Feb 17, 2025

@cfzjywxk PTAL

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Feb 17, 2025
@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 17, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfzjywxk, zyguan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 17, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-02-14 01:27:25.878162368 +0000 UTC m=+579088.274384427: ☑️ agreed by zyguan.
  • 2025-02-17 03:08:27.519627564 +0000 UTC m=+844349.915849626: ☑️ agreed by cfzjywxk.

@cfzjywxk
Copy link
Contributor

/retest

@ti-chi-bot ti-chi-bot bot merged commit 279dcd5 into tikv:master Feb 17, 2025
11 checks passed
@cfzjywxk
Copy link
Contributor

@vldmit Thanks for the contribution, the #1574 could be closed.
The client-go dependency for tidb would be updated later to make it take effect.

rleungx added a commit to rleungx/client-go that referenced this pull request Mar 6, 2025
rleungx added a commit to rleungx/client-go that referenced this pull request Mar 6, 2025
@rleungx
Copy link
Member

rleungx commented Mar 6, 2025

I'm trying to upgrade the client-go to the latest version in TiDB pingcap/tidb#59757. But some CI are failed. The log keeps printing missing address, see https://do.pingcap.net/jenkins/blue/organizations/jenkins/pingcap%2Ftidb%2Fpull_lightning_integration_test/detail/pull_lightning_integration_test/7104/pipeline
After I revert this commit in #1605, it succeeds, see https://do.pingcap.net/jenkins/blue/organizations/jenkins/pingcap%2Ftidb%2Fpull_lightning_integration_test/detail/pull_lightning_integration_test/7120/pipeline

But I haven't investigated why it failed. /cc @zyguan

@zyguan
Copy link
Contributor

zyguan commented Mar 7, 2025

I'm trying to upgrade the client-go to the latest version in TiDB pingcap/tidb#59757. But some CI are failed. The log keeps printing missing address, see https://do.pingcap.net/jenkins/blue/organizations/jenkins/pingcap%2Ftidb%2Fpull_lightning_integration_test/detail/pull_lightning_integration_test/7104/pipeline After I revert this commit in #1605, it succeeds, see https://do.pingcap.net/jenkins/blue/organizations/jenkins/pingcap%2Ftidb%2Fpull_lightning_integration_test/detail/pull_lightning_integration_test/7120/pipeline

But I haven't investigated why it failed. /cc @zyguan

@rleungx It seems local.NewBackend uses config.PDAddr (instead of pdAddrs) here, which is empty due to it's generated by genConfig. This PR just exposes the issue.

@rleungx
Copy link
Member

rleungx commented Mar 10, 2025

I'm trying to upgrade the client-go to the latest version in TiDB pingcap/tidb#59757. But some CI are failed. The log keeps printing missing address, see https://do.pingcap.net/jenkins/blue/organizations/jenkins/pingcap%2Ftidb%2Fpull_lightning_integration_test/detail/pull_lightning_integration_test/7104/pipeline After I revert this commit in #1605, it succeeds, see https://do.pingcap.net/jenkins/blue/organizations/jenkins/pingcap%2Ftidb%2Fpull_lightning_integration_test/detail/pull_lightning_integration_test/7120/pipeline
But I haven't investigated why it failed. /cc @zyguan

@rleungx It seems local.NewBackend uses config.PDAddr (instead of pdAddrs) here, which is empty due to it's generated by genConfig. This PR just exposes the issue.

For some mock client, the addr is empty and sync will report an error, which fails the test.

@you06
Copy link
Contributor

you06 commented Mar 12, 2025

@rleungx It seems local.NewBackend uses config.PDAddr (instead of pdAddrs) here, which is empty due to it's generated by genConfig. This PR just exposes the issue.

What about wrapping Client.Sync and only run it when there are non-empty addresses?

@zyguan
Copy link
Contributor

zyguan commented Mar 12, 2025

I've confirmed the issue with @D3Hunter, it should use pdAddrs here. I also pushed some commits to address the issue yesterday, however they've been reverted by recent force-push.

@rleungx
Copy link
Member

rleungx commented Mar 12, 2025

I've confirmed the issue with @D3Hunter, it should use pdAddrs here. I also pushed some commits to address the issue yesterday, however they've been reverted by recent force-push.

Oh, my bad, I wrongly force pushed a commit and overwrote yours. I have changed it back.

serprex added a commit to PeerDB-io/tikv-client-go that referenced this pull request Jun 30, 2025
* *: bump pd client (tikv#1575)

 

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* OWNERS: Auto Sync OWNERS files from community membership (tikv#1576)

 

Signed-off-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

* sync etcd endpoints immediately after initializing the client (tikv#1573)

 

Signed-off-by: Vlad Dmitriev <vldmit@gmail.com>

* p-dml: resolve locks concurrently (tikv#1584)

close tikv#1577

Signed-off-by: you06 <you1474600@gmail.com>

* memdb: prevent iterator invalidation (tikv#1563)

ref pingcap/tidb#59153

Signed-off-by: ekexium <eke@fastmail.com>

* locate: refactor RegionRequestSender.SendReqCtx (tikv#1565)

 

Signed-off-by: zyguan <zhongyangguan@gmail.com>

* locate: fix the default settings of circuit breaker (tikv#1593)

ref tikv/pd#8678

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* util: define and implement core interfaces for async api (tikv#1591)

ref tikv#1586

Signed-off-by: zyguan <zhongyangguan@gmail.com>

* Add a retry when getting ts from PD for validating read ts (tikv#1600)

 

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

* pdclient: Add caller info to pd client (tikv#1516)

ref tikv/pd#8593

Signed-off-by: okJiang <819421878@qq.com>

* locate: fix TestTiKVClientReadTimeout (tikv#1601)

 

Signed-off-by: zyguan <zhongyangguan@gmail.com>

* *: update pd client (tikv#1605)

 

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* Validate ts only for stale read (tikv#1607)

ref pingcap/tidb#59402

Signed-off-by: ekexium <eke@fastmail.com>

* execdetails: export scheduler write details (tikv#1606)

 

Signed-off-by: Neil Shen <overvenus@gmail.com>

* Update pd client (tikv#1615)

Signed-off-by: disksing <i@disksing.com>

* ci: allow use label to skip integration tests (tikv#1616)

 

Signed-off-by: disksing <i@disksing.com>

* remove useless metric tidb_tikvclient_cop_duration_seconds_bucket (tikv#1602)

 

Signed-off-by: XuHuaiyu <391585975@qq.com>

* client: implement SendRequestAsync for RPCClient (tikv#1604)

ref tikv#1586

Signed-off-by: zyguan <zhongyangguan@gmail.com>

* execdetails: export grpc process and wait time to time details (tikv#1614)

 

Signed-off-by: Neil Shen <overvenus@gmail.com>

Co-authored-by: Bisheng Huang <hbisheng@gmail.com>

* Refine pessimistic lock related metrics and stats (tikv#1620)

 

Signed-off-by: yibin87 <huyibin@pingcap.com>

* metrics: adjust bucket count to reduce metrics data (tikv#1609)

 

Signed-off-by: Lynn <zimu_xia@126.com>

* update tidb for integration tests (tikv#1621)

 

Signed-off-by: disksing <i@disksing.com>

* support redact key in logs (tikv#1612)

ref pingcap/tidb#59279

Signed-off-by: tangenta <tangenta@126.com>

Co-authored-by: you06 <you1474600@gmail.com>

* update integration_test/go.mod (tikv#1624)

 

Signed-off-by: tangenta <tangenta@126.com>

* memdb: introduce snapshot interface (tikv#1623)

 

Signed-off-by: you06 <you1474600@gmail.com>

Co-authored-by: ekexium <eke@fastmail.com>

* pd: enable OutputMustContainAllKeyRange (tikv#1632)

 

Signed-off-by: lhy1024 <admin@liudos.us>

* Fix backoff lose info when forked (tikv#1627)

ref pingcap/tidb#60271

Signed-off-by: yibin87 <huyibin@pingcap.com>

* tikv: disable health-feedback in next-gen (tikv#1635)

 

Signed-off-by: zyguan <zhongyangguan@gmail.com>

* enable ts validation for normal read (tikv#1619)

 

Signed-off-by: ekexium <eke@fastmail.com>

* Add txn write conflict metrics (tikv#1551)

close tikv#1550

Signed-off-by: sujuntao <juntao.su@foxmail.com>

Co-authored-by: sujuntao <juntao.su@foxmail.com>

* apicodec: fix a typo when encoding request for CmdMvccGetByKey (tikv#1638)

 

Signed-off-by: tiancaiamao <tiancaiamao@gmail.com>

Co-authored-by: cfzjywxk <lsswxrxr@163.com>

* *: update kvproto version (tikv#1636)

ref tikv#1631

Signed-off-by: Chao Wang <cclcwangchao@hotmail.com>

* txn: provide more information in commit RPC / log mvcc debug info when commit failed for `TxnLockNotFound` (tikv#1640)

ref tikv#1631

Signed-off-by: Chao Wang <cclcwangchao@hotmail.com>

* txn: handle undetermined error in client go (tikv#1642)

close tikv#1641

Signed-off-by: Chao Wang <cclcwangchao@hotmail.com>

* txn: fix the implemention of undetermined error (tikv#1644)

close tikv#1641

Signed-off-by: Chao Wang <cclcwangchao@hotmail.com>

* locate: implement SendReqAsync for RegionRequestSender (tikv#1618)

ref tikv#1586

Signed-off-by: zyguan <zhongyangguan@gmail.com>

* update pd client for resource group and keyspace (tikv#1645)

 

Signed-off-by: lhy1024 <admin@liudos.us>

* tests: bump tidb to fix integration tests (tikv#1650)

 

Signed-off-by: zyguan <zhongyangguan@gmail.com>

* Fix some metrics that miss const labels (tikv#1652)

 

Signed-off-by: yibin87 <huyibin@pingcap.com>

* Replace etcd safe point with txn safe point for read safety check (tikv#1634)

 

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

* Fix stale read metrics (tikv#1649)

close tikv#1648

Signed-off-by: you06 <you1474600@gmail.com>

* *: support async batch get (tikv#1646)

ref tikv#1586

Signed-off-by: zyguan <zhongyangguan@gmail.com>

* Update kvproto dependancy and set keyspace name for rpc context (tikv#1667)

close tikv#1668

Signed-off-by: yibin87 <huyibin@pingcap.com>

* ci: add next-gen integration tests (tikv#1661)

 

Signed-off-by: ekexium <eke@fastmail.com>

* snapshot: set `ReplicaRead` to false when `ReplicaReadType` fallbacks to `ReplicaReadLeader` (tikv#1663)

ref pingcap/tidb#61745

Signed-off-by: you06 <you1474600@gmail.com>

* resource_control: support collecting cross AZ traffic in ru consumption (tikv#1669)

 

Signed-off-by: glorv <glorvs@163.com>

* txnkv: prevent some actions from being interrupted by kill (tikv#1665)

fix pingcap/tidb#61454

Signed-off-by: zyguan <zhongyangguan@gmail.com>

* region_cache: add ForceRefreshAllStores function (tikv#1686)

 

Signed-off-by: guo-shaoge <shaoge1994@163.com>

* upgrade gRPC to allow consumption by peerdb

* Bump gprc version to 1.73.0

Remove references to `grpc.NewSharedBufferPool()`, it was removed from
the grpc package and causes build to fail - it's always on.
(https://pkg.go.dev/google.golang.org/grpc/experimental#WithBufferPool)

Signed-off-by: Tiago Scolari <git@tscolari.me>

* go mod tidy

---------

Signed-off-by: Ryan Leung <rleungx@gmail.com>
Signed-off-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Signed-off-by: Vlad Dmitriev <vldmit@gmail.com>
Signed-off-by: you06 <you1474600@gmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: okJiang <819421878@qq.com>
Signed-off-by: Neil Shen <overvenus@gmail.com>
Signed-off-by: disksing <i@disksing.com>
Signed-off-by: XuHuaiyu <391585975@qq.com>
Signed-off-by: yibin87 <huyibin@pingcap.com>
Signed-off-by: Lynn <zimu_xia@126.com>
Signed-off-by: tangenta <tangenta@126.com>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: Chao Wang <cclcwangchao@hotmail.com>
Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: Tiago Scolari <git@tscolari.me>
Co-authored-by: Ryan Leung <rleungx@gmail.com>
Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Co-authored-by: Vlad Dmitriev <vldmit@gmail.com>
Co-authored-by: you06 <you1474600@gmail.com>
Co-authored-by: ekexium <eke@fastmail.com>
Co-authored-by: zyguan <zhongyangguan@gmail.com>
Co-authored-by: MyonKeminta <9948422+MyonKeminta@users.noreply.github.com>
Co-authored-by: okJiang <819421878@qq.com>
Co-authored-by: Neil Shen <overvenus@gmail.com>
Co-authored-by: disksing <i@disksing.com>
Co-authored-by: HuaiyuXu <xuhuaiyu@pingcap.com>
Co-authored-by: Bisheng Huang <hbisheng@gmail.com>
Co-authored-by: yibin <huyibin@pingcap.com>
Co-authored-by: Lynn <zimu_xia@126.com>
Co-authored-by: tangenta <tangenta@126.com>
Co-authored-by: lhy1024 <lhylhy1024@gmail.com>
Co-authored-by: JT <43174723+ImSjt@users.noreply.github.com>
Co-authored-by: sujuntao <juntao.su@foxmail.com>
Co-authored-by: tiancaiamao <tiancaiamao@gmail.com>
Co-authored-by: cfzjywxk <lsswxrxr@163.com>
Co-authored-by: 王超 <cclcwangchao@hotmail.com>
Co-authored-by: glorv <glorvs@163.com>
Co-authored-by: guo-shaoge <shaoge1994@163.com>
Co-authored-by: Kevin Biju <kevin@peerdb.io>
Co-authored-by: Tiago Scolari <tiago@diagrid.io>
@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Jul 9, 2025
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: cannot checkout release-8.5: error checking out release-8.5: exit status 1. output: error: pathspec 'release-8.5' did not match any file(s) known to git

@ti-chi-bot ti-chi-bot bot removed the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Delayed etcd endpoint sync causing KVStore errors

6 participants