-
Notifications
You must be signed in to change notification settings - Fork 70
Fix blue-green migration might be stuck due to an existing reconnection #406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
BewareMyPower
merged 3 commits into
apache:main
from
BewareMyPower:bewaremypower/fix-pip-188-tests
Feb 29, 2024
Merged
Fix blue-green migration might be stuck due to an existing reconnection #406
BewareMyPower
merged 3 commits into
apache:main
from
BewareMyPower:bewaremypower/fix-pip-188-tests
Feb 29, 2024
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes apache#405 ### Motivation After triggering a blue-green migration, the socket will be disconnected and then schedule a reconnection to the blue cluster. However, the blue cluster could never respond with a response for Producer or Subscribe commands. Take producer as example, it means `connectionOpened` will not complete and `reconnectionPending_` will not become false. Then, after receiving a `CommandProducerClose` command from the blue cluster, a new reconnection will be scheduled to the green cluster but it will be skipped because `reconnectionPending_` is true, which means the previous `connectionOpened` future is not completed until the 30s timeout is reached. ``` 2024-02-26 06:09:30.251 INFO [139737465607744] HandlerBase:101 | [persistent://public/unload-test/topic-1708927732, sub, 0] Ignoring reconnection attempt since there's already a pending reconnection 2024-02-26 06:10:00.035 WARN [139737859880512] ProducerImpl:291 | [persistent://public/unload-test/topic-1708927732, cluster-a-0-0] Failed to reconnect producer: TimeOut ``` ### Modifications When receiving the `TOPIC_MIGRATED` command, cancel the pending `Producer` and `Subscribe` commands so that `connectionOpened` will fail with a retryable error. In the next time of reconnection, the green cluster will be connected. Fix the `ExtensibleLoadManagerTest` with a more strict timeout check. After this change, it will pass in about 3 seconds locally, while in CI even if it passed, it takes about 70 seconds before. Besides, fix the possible crash on macOS when closing the client, see apache#405 (comment)
The test is still flaky when calling Logs in CI:
Logs locally:
|
The flakiness in the current test is related to the broker side (apache/pulsar#22136) so I mark this PR as ready to review. |
heesung-sn
reviewed
Feb 27, 2024
@heesung-sn I moved the field into |
heesung-sn
approved these changes
Feb 28, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #405
Motivation
After triggering a blue-green migration, the socket will be disconnected and then schedule a reconnection to the blue cluster. However, the blue cluster could never respond with a response for Producer or Subscribe commands. Take producer as example, it means
connectionOpened
will not complete andreconnectionPending_
will not become false.Then, after receiving a
CommandProducerClose
command from the blue cluster, a new reconnection will be scheduled to the green cluster but it will be skipped becausereconnectionPending_
is true, which means the previousconnectionOpened
future is not completed until the 30s timeout is reached.Modifications
When receiving the
TOPIC_MIGRATED
command, cancel the pendingProducer
andSubscribe
commands so thatconnectionOpened
will fail with a retryable error. In the next time of reconnection, the green cluster will be connected.Fix the
ExtensibleLoadManagerTest
with a more strict timeout check. After this change, it will pass in about 3 seconds locally, while in CI even if it passed, it takes about 70 seconds before.Besides, fix the possible crash on macOS when closing the client, see #405 (comment)