-
Notifications
You must be signed in to change notification settings - Fork 753
tests: retry when address already in use #10162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds a retry-capable initial test-server startup flow that detects port-binding conflicts, stops/destroys affected test servers, reallocates client/peer URLs via Changes
Sequence Diagram(s)sequenceDiagram
participant Cluster as TestCluster
participant Alloc as tempurl.Alloc
participant Server as TestServer
participant Etcd as Etcd
Cluster->>Server: Start()
Server->>Etcd: startEtcd() (bind ports)
alt started
Etcd-->>Server: started
Server-->>Cluster: running
else ErrStartEtcd (port conflict)
Etcd-->>Server: ErrStartEtcd
Server-->>Cluster: error
Cluster->>Server: Stop() / Destroy()
Cluster->>Alloc: Alloc() new ClientURL / PeerURL
Alloc-->>Cluster: new URLs
Cluster->>Server: Recreate(with original services/opts, new URLs)
Cluster->>Server: Start() after backoff (retry)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/cluster.go`:
- Around line 689-699: The retry path recreates servers with incorrect
parameters: RunInitialServersWithRetry currently calls NewTestServer with nil
services and rebuilds server config using only WithGCTuner(false), losing
original services and ConfigOption overrides (e.g., those required by
NewTestClusterWithKeyspaceGroup). Fix by adding fields to TestCluster to persist
the original services ([]string) and opts ([]ConfigOption) supplied at creation,
populate them when the cluster is created, and then update
RunInitialServersWithRetry to pass those persisted services and merge/append
WithGCTuner(false) onto the persisted opts when calling Generate and
NewTestServer so NewTestServer(...) receives the original services and full set
of config options instead of nil/partial options.
🧹 Nitpick comments (1)
tests/cluster.go (1)
674-679: Consider logging errors from Stop/Destroy during cleanup.While ignoring cleanup errors is acceptable for test code to ensure cleanup proceeds, logging them could help debug flaky tests.
Optional: Log cleanup errors
// Stop and destroy all servers for _, s := range servers { if s.State() == Running { - _ = s.Stop() + if err := s.Stop(); err != nil { + log.Warn("failed to stop server during retry cleanup", zap.Error(err)) + } + } + if err := s.Destroy(); err != nil { + log.Warn("failed to destroy server during retry cleanup", zap.Error(err)) } - _ = s.Destroy() }
Signed-off-by: Ryan Leung <rleungx@gmail.com>
81f4dc8 to
c2d2abc
Compare
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/cluster.go`:
- Around line 655-658: The function RunInitialServersWithRetry currently treats
maxRetries improperly (and even uses a faulty loop "for i := range maxRetries");
add an upfront validation that maxRetries is at least 1 (or return a clear error
if <=0) and change the loop to a proper integer loop (e.g., for i := 0; i <
maxRetries; i++) so the body executes the intended number of attempts; update
TestCluster.RunInitialServersWithRetry to enforce the minimum and use the
corrected loop control referencing maxRetries.
Signed-off-by: Ryan Leung <rleungx@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/cluster.go`:
- Around line 721-730: The retry branch currently only checks for "ErrStartEtcd"
on lastErr, which allows "ErrCancelStartEtcd" to short-circuit retries; update
the condition to treat both "ErrStartEtcd" and "ErrCancelStartEtcd" as retryable
(e.g., check lastErr.Error() contains either string) so this path matches the
behavior in RunServersWithRetry; keep the existing cleanup (iterating servers,
stopping Running ones) and sleep/continue logic unchanged.
♻️ Duplicate comments (1)
tests/cluster.go (1)
697-705: Preserve services/opts when clusters are restarted to keep retries correct.This retry path relies on
c.servicesandc.opts.restartTestClusterdoesn’t populate them, so a port-conflict retry after a restart can recreate servers without keyspace services or custom options. Consider copying these fields from the original cluster (or deriving fromisKeyspaceGroupEnabled) when constructing the restarted cluster.newTestCluster := &TestCluster{ config: cluster.config, servers: make(map[string]*TestServer, len(cluster.servers)), services: cluster.services, opts: cluster.opts, // tsPool... }
tests/cluster.go
Outdated
| // For non-port-conflict errors, use regular retry | ||
| if strings.Contains(lastErr.Error(), "ErrStartEtcd") { | ||
| log.Warn("etcd start failed, will retry", zap.Error(lastErr)) | ||
| for _, s := range servers { | ||
| if s.State() == Running { | ||
| _ = s.Stop() | ||
| } | ||
| } | ||
| time.Sleep(100 * time.Millisecond) | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retry should include ErrCancelStartEtcd for parity with RunServersWithRetry.
Right now this path retries only on ErrStartEtcd, so ErrCancelStartEtcd will short-circuit retries and can reintroduce flakes in partial-start scenarios.
🐛 Proposed fix
- if strings.Contains(lastErr.Error(), "ErrStartEtcd") {
+ if strings.Contains(lastErr.Error(), "ErrCancelStartEtcd") ||
+ strings.Contains(lastErr.Error(), "ErrStartEtcd") {🤖 Prompt for AI Agents
In `@tests/cluster.go` around lines 721 - 730, The retry branch currently only
checks for "ErrStartEtcd" on lastErr, which allows "ErrCancelStartEtcd" to
short-circuit retries; update the condition to treat both "ErrStartEtcd" and
"ErrCancelStartEtcd" as retryable (e.g., check lastErr.Error() contains either
string) so this path matches the behavior in RunServersWithRetry; keep the
existing cleanup (iterating servers, stopping Running ones) and sleep/continue
logic unchanged.
|
/retest |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10162 +/- ##
==========================================
+ Coverage 78.56% 78.60% +0.03%
==========================================
Files 520 520
Lines 69676 69720 +44
==========================================
+ Hits 54739 54800 +61
+ Misses 10990 10973 -17
Partials 3947 3947
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
@liyishuai: adding LGTM is restricted to approvers and reviewers in OWNERS files. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
tests/cluster.go
Outdated
| servers := make([]*TestServer, 0, len(c.config.InitialServers)) | ||
| for _, conf := range c.config.InitialServers { | ||
| servers = append(servers, c.GetServer(conf.Name)) | ||
| return c.RunInitialServersWithRetry(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about defining a constant to make it easier to locate and adjust the retry times later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tests/cluster.go
Outdated
| } | ||
|
|
||
| // For non-port-conflict errors, use regular retry | ||
| if strings.Contains(lastErr.Error(), "ErrStartEtcd") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a switch statement for different lastErr values seems clearer.
Signed-off-by: Ryan Leung <rleungx@gmail.com>
lhy1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rest LGTM
tests/cluster.go
Outdated
| } | ||
|
|
||
| // Use the original services passed during cluster creation | ||
| s, err := NewTestServer(context.Background(), serverConf, c.services) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about using c.ctx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Ryan Leung <rleungx@gmail.com>
JmPotato
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest LGTM.
| } | ||
|
|
||
| // RunServersWithRetry starts to run multiple TestServer with retry logic. | ||
| func RunServersWithRetry(servers []*TestServer, maxRetries int) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is no longer used, we should removed it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another test still relying on it.
tests/cluster.go
Outdated
| } | ||
|
|
||
| // RunInitialServersWithRetry starts to run servers with port conflict handling. | ||
| func (c *TestCluster) RunInitialServersWithRetry(maxRetries int) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func (c *TestCluster) RunInitialServersWithRetry(maxRetries int) error { | |
| func (c *TestCluster) runInitialServersWithRetry(maxRetries int) error { |
Signed-off-by: Ryan Leung <rleungx@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/cluster.go`:
- Around line 701-710: runInitialServersWithRetry uses c.ctx, c.services, and
c.opts when recreating servers on port conflicts but restartTestCluster (used by
RestartTestPDCluster) doesn't populate those fields, causing nil context or lost
services/options on retries; update restartTestCluster to copy or set the
cluster fields (ctx, services, opts) from the original cluster (or accept them
as parameters) so that NewTestServer calls in runInitialServersWithRetry always
see valid c.ctx, c.services, and c.opts, ensuring retries preserve the original
configuration.
♻️ Duplicate comments (1)
tests/cluster.go (1)
723-731: IncludeErrCancelStartEtcdin retryable errors (parity withRunServersWithRetry).This path currently retries only on
ErrStartEtcd, soErrCancelStartEtcdcan short-circuit retries in partial-start scenarios.💡 Proposed tweak
- case strings.Contains(errMsg, "ErrStartEtcd"): + case strings.Contains(errMsg, "ErrCancelStartEtcd") || + strings.Contains(errMsg, "ErrStartEtcd"):
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JmPotato, lhy1024, liyishuai The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
1 similar comment
|
/retest |
close tikv#8511 Signed-off-by: Ryan Leung <rleungx@gmail.com>
What problem does this PR solve?
Issue Number: Close #8511.
What is changed and how does it work?
We do have a retry when the address is already in use, but we use the same port, which is useless.
Check List
Tests
Release note
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.