Skip to content

Conversation

@rleungx
Copy link
Member

@rleungx rleungx commented Jan 19, 2026

What problem does this PR solve?

Issue Number: Close #8511.

What is changed and how does it work?

[2026/01/19 10:39:07.054 +08:00] [WARN] [cluster.go:665] ["etcd start failed, will retry"] [error="[PD:etcd:ErrStartEtcd]listen tcp 127.0.0.1:43255: bind: address already in use: listen tcp 127.0.0.1:43255: bind: address already in use"]
[2026/01/19 10:39:07.055 +08:00] [WARN] [cluster.go:665] ["etcd start failed, will retry"] [error="[PD:etcd:ErrStartEtcd]listen tcp 127.0.0.1:43255: bind: address already in use: listen tcp 127.0.0.1:43255: bind: address already in use"]
    client_test.go:896: 
        	Error Trace:	/home/prow/go/src/github.com/tikv/pd/tests/integrations/client/client_test.go:896
        	            				/home/prow/go/src/github.com/tikv/pd/tests/integrations/client/router_client_test.go:73
        	            				/home/prow/go/pkg/mod/github.com/stretchr/testify@v1.9.0/suite/suite.go:157
        	            				/home/prow/go/src/github.com/tikv/pd/tests/integrations/client/router_client_test.go:46
        	Error:      	Received unexpected error:
        	            	[PD:etcd:ErrStartEtcd]listen tcp 127.0.0.1:43255: bind: address already in use: listen tcp 127.0.0.1:43255: bind: address already in use

We do have a retry when the address is already in use, but we use the same port, which is useless.

Check List

Tests

  • Unit test

Release note

None.

Summary by CodeRabbit

  • New Features

    • Test cluster startup now retries on port/address conflicts: ports are reallocated and servers recreated to recover while preserving original services and options.
    • Added configurable retry limits and backoff to improve automated recovery when startup conflicts occur.
  • Chores

    • Test infrastructure now retains startup context and initial configuration to support reliable restart flows and clearer errors when retries are exhausted.

✏️ Tip: You can customize this high-level summary in your review settings.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Jan 19, 2026
@coderabbitai
Copy link

coderabbitai bot commented Jan 19, 2026

📝 Walkthrough

Walkthrough

Adds a retry-capable initial test-server startup flow that detects port-binding conflicts, stops/destroys affected test servers, reallocates client/peer URLs via tempurl, recreates servers with original services/options, and retries startup with backoff using a new internal retry helper; public API unchanged.

Changes

Cohort / File(s) Summary
Test cluster startup & state
tests/cluster.go
Adds internal retry startup flow (runInitialServersWithRetry), defaultMaxRetryTimes, and stores ctx, services, and opts in TestCluster so servers can be recreated. RunInitialServers delegates to the retry helper; handles ErrStartEtcd and port-conflict recovery (stop/destroy, regenerate configs, recreate servers, backoff/retry).
Imports
tests/cluster.go
Adds github.com/tikv/pd/pkg/utils/tempurl for reallocating client/peer URLs during port-conflict retries.

Sequence Diagram(s)

sequenceDiagram
    participant Cluster as TestCluster
    participant Alloc as tempurl.Alloc
    participant Server as TestServer
    participant Etcd as Etcd

    Cluster->>Server: Start()
    Server->>Etcd: startEtcd() (bind ports)
    alt started
        Etcd-->>Server: started
        Server-->>Cluster: running
    else ErrStartEtcd (port conflict)
        Etcd-->>Server: ErrStartEtcd
        Server-->>Cluster: error
        Cluster->>Server: Stop() / Destroy()
        Cluster->>Alloc: Alloc() new ClientURL / PeerURL
        Alloc-->>Cluster: new URLs
        Cluster->>Server: Recreate(with original services/opts, new URLs)
        Cluster->>Server: Start() after backoff (retry)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

size/M

Suggested reviewers

  • lhy1024
  • HunDunDM

Poem

🐇
I sniff the sockets, find a knot undone,
I hop and fetch fresh ports beneath the sun.
Stop, remake, then try once more—
Backoff, bounce, the servers soar.
Tests resume; I nibble on a run.

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'tests: retry when address already in use' directly describes the main change: adding port-conflict-aware retry logic for test cluster startup.
Description check ✅ Passed The PR description includes Issue Number linking to #8511, explains the problem (retrying with same port is useless), and indicates unit tests are included. However, the commit-message block is empty.
Linked Issues check ✅ Passed The PR implementation directly addresses issue #8511 by adding port-conflict handling: when 'address already in use' errors occur, the code stops/destroys existing servers, regenerates configurations with new temp URLs, and retries with different ports.
Out of Scope Changes check ✅ Passed All changes in cluster.go are focused on handling address-in-use errors with port regeneration and retry logic, directly supporting the linked issue objective. No unrelated out-of-scope changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 19, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/cluster.go`:
- Around line 689-699: The retry path recreates servers with incorrect
parameters: RunInitialServersWithRetry currently calls NewTestServer with nil
services and rebuilds server config using only WithGCTuner(false), losing
original services and ConfigOption overrides (e.g., those required by
NewTestClusterWithKeyspaceGroup). Fix by adding fields to TestCluster to persist
the original services ([]string) and opts ([]ConfigOption) supplied at creation,
populate them when the cluster is created, and then update
RunInitialServersWithRetry to pass those persisted services and merge/append
WithGCTuner(false) onto the persisted opts when calling Generate and
NewTestServer so NewTestServer(...) receives the original services and full set
of config options instead of nil/partial options.
🧹 Nitpick comments (1)
tests/cluster.go (1)

674-679: Consider logging errors from Stop/Destroy during cleanup.

While ignoring cleanup errors is acceptable for test code to ensure cleanup proceeds, logging them could help debug flaky tests.

Optional: Log cleanup errors
 			// Stop and destroy all servers
 			for _, s := range servers {
 				if s.State() == Running {
-					_ = s.Stop()
+					if err := s.Stop(); err != nil {
+						log.Warn("failed to stop server during retry cleanup", zap.Error(err))
+					}
+				}
+				if err := s.Destroy(); err != nil {
+					log.Warn("failed to destroy server during retry cleanup", zap.Error(err))
 				}
-				_ = s.Destroy()
 			}

Signed-off-by: Ryan Leung <rleungx@gmail.com>
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 19, 2026
@rleungx
Copy link
Member Author

rleungx commented Jan 19, 2026

/retest

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/cluster.go`:
- Around line 655-658: The function RunInitialServersWithRetry currently treats
maxRetries improperly (and even uses a faulty loop "for i := range maxRetries");
add an upfront validation that maxRetries is at least 1 (or return a clear error
if <=0) and change the loop to a proper integer loop (e.g., for i := 0; i <
maxRetries; i++) so the body executes the intended number of attempts; update
TestCluster.RunInitialServersWithRetry to enforce the minimum and use the
corrected loop control referencing maxRetries.

Signed-off-by: Ryan Leung <rleungx@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/cluster.go`:
- Around line 721-730: The retry branch currently only checks for "ErrStartEtcd"
on lastErr, which allows "ErrCancelStartEtcd" to short-circuit retries; update
the condition to treat both "ErrStartEtcd" and "ErrCancelStartEtcd" as retryable
(e.g., check lastErr.Error() contains either string) so this path matches the
behavior in RunServersWithRetry; keep the existing cleanup (iterating servers,
stopping Running ones) and sleep/continue logic unchanged.
♻️ Duplicate comments (1)
tests/cluster.go (1)

697-705: Preserve services/opts when clusters are restarted to keep retries correct.

This retry path relies on c.services and c.opts. restartTestCluster doesn’t populate them, so a port-conflict retry after a restart can recreate servers without keyspace services or custom options. Consider copying these fields from the original cluster (or deriving from isKeyspaceGroupEnabled) when constructing the restarted cluster.

newTestCluster := &TestCluster{
    config:   cluster.config,
    servers:  make(map[string]*TestServer, len(cluster.servers)),
    services: cluster.services,
    opts:     cluster.opts,
    // tsPool...
}

tests/cluster.go Outdated
Comment on lines 721 to 730
// For non-port-conflict errors, use regular retry
if strings.Contains(lastErr.Error(), "ErrStartEtcd") {
log.Warn("etcd start failed, will retry", zap.Error(lastErr))
for _, s := range servers {
if s.State() == Running {
_ = s.Stop()
}
}
time.Sleep(100 * time.Millisecond)
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Retry should include ErrCancelStartEtcd for parity with RunServersWithRetry.

Right now this path retries only on ErrStartEtcd, so ErrCancelStartEtcd will short-circuit retries and can reintroduce flakes in partial-start scenarios.

🐛 Proposed fix
-		if strings.Contains(lastErr.Error(), "ErrStartEtcd") {
+		if strings.Contains(lastErr.Error(), "ErrCancelStartEtcd") ||
+			strings.Contains(lastErr.Error(), "ErrStartEtcd") {
🤖 Prompt for AI Agents
In `@tests/cluster.go` around lines 721 - 730, The retry branch currently only
checks for "ErrStartEtcd" on lastErr, which allows "ErrCancelStartEtcd" to
short-circuit retries; update the condition to treat both "ErrStartEtcd" and
"ErrCancelStartEtcd" as retryable (e.g., check lastErr.Error() contains either
string) so this path matches the behavior in RunServersWithRetry; keep the
existing cleanup (iterating servers, stopping Running ones) and sleep/continue
logic unchanged.

@rleungx
Copy link
Member Author

rleungx commented Jan 19, 2026

/retest

@codecov
Copy link

codecov bot commented Jan 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.60%. Comparing base (6dbd9cd) to head (726a03c).
⚠️ Report is 10 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10162      +/-   ##
==========================================
+ Coverage   78.56%   78.60%   +0.03%     
==========================================
  Files         520      520              
  Lines       69676    69720      +44     
==========================================
+ Hits        54739    54800      +61     
+ Misses      10990    10973      -17     
  Partials     3947     3947              
Flag Coverage Δ
unittests 78.60% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 19, 2026

@liyishuai: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tests/cluster.go Outdated
servers := make([]*TestServer, 0, len(c.config.InitialServers))
for _, conf := range c.config.InitialServers {
servers = append(servers, c.GetServer(conf.Name))
return c.RunInitialServersWithRetry(3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about defining a constant to make it easier to locate and adjust the retry times later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

tests/cluster.go Outdated
}

// For non-port-conflict errors, use regular retry
if strings.Contains(lastErr.Error(), "ErrStartEtcd") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a switch statement for different lastErr values seems clearer.

Signed-off-by: Ryan Leung <rleungx@gmail.com>
@rleungx rleungx requested a review from JmPotato January 19, 2026 07:49
Copy link
Contributor

@lhy1024 lhy1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rest LGTM

tests/cluster.go Outdated
}

// Use the original services passed during cluster creation
s, err := NewTestServer(context.Background(), serverConf, c.services)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about using c.ctx?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Ryan Leung <rleungx@gmail.com>
Copy link
Member

@JmPotato JmPotato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest LGTM.

}

// RunServersWithRetry starts to run multiple TestServer with retry logic.
func RunServersWithRetry(servers []*TestServer, maxRetries int) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is no longer used, we should removed it then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another test still relying on it.

tests/cluster.go Outdated
}

// RunInitialServersWithRetry starts to run servers with port conflict handling.
func (c *TestCluster) RunInitialServersWithRetry(maxRetries int) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (c *TestCluster) RunInitialServersWithRetry(maxRetries int) error {
func (c *TestCluster) runInitialServersWithRetry(maxRetries int) error {

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jan 19, 2026
Signed-off-by: Ryan Leung <rleungx@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/cluster.go`:
- Around line 701-710: runInitialServersWithRetry uses c.ctx, c.services, and
c.opts when recreating servers on port conflicts but restartTestCluster (used by
RestartTestPDCluster) doesn't populate those fields, causing nil context or lost
services/options on retries; update restartTestCluster to copy or set the
cluster fields (ctx, services, opts) from the original cluster (or accept them
as parameters) so that NewTestServer calls in runInitialServersWithRetry always
see valid c.ctx, c.services, and c.opts, ensuring retries preserve the original
configuration.
♻️ Duplicate comments (1)
tests/cluster.go (1)

723-731: Include ErrCancelStartEtcd in retryable errors (parity with RunServersWithRetry).

This path currently retries only on ErrStartEtcd, so ErrCancelStartEtcd can short-circuit retries in partial-start scenarios.

💡 Proposed tweak
-    case strings.Contains(errMsg, "ErrStartEtcd"):
+    case strings.Contains(errMsg, "ErrCancelStartEtcd") ||
+        strings.Contains(errMsg, "ErrStartEtcd"):

Signed-off-by: Ryan Leung <rleungx@gmail.com>
@rleungx
Copy link
Member Author

rleungx commented Jan 21, 2026

/retest

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jan 21, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 21, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JmPotato, lhy1024, liyishuai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 21, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-01-19 09:39:10.956277379 +0000 UTC m=+400378.570234236: ☑️ agreed by lhy1024.
  • 2026-01-21 05:16:20.739439224 +0000 UTC m=+557408.353396150: ☑️ agreed by JmPotato.

@rleungx
Copy link
Member Author

rleungx commented Jan 21, 2026

/retest

1 similar comment
@rleungx
Copy link
Member Author

rleungx commented Jan 21, 2026

/retest

@ti-chi-bot ti-chi-bot bot merged commit aa68e91 into tikv:master Jan 21, 2026
32 checks passed
@rleungx rleungx deleted the retry-init-server branch January 21, 2026 08:07
bufferflies pushed a commit to bufferflies/pd that referenced this pull request Jan 22, 2026
close tikv#8511

Signed-off-by: Ryan Leung <rleungx@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test: tests fail because bind address already in use

4 participants