Skip to content

Fix blue-green migration might be stuck due to an existing reconnection #406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

BewareMyPower
Copy link
Contributor

Fixes #405

Motivation

After triggering a blue-green migration, the socket will be disconnected and then schedule a reconnection to the blue cluster. However, the blue cluster could never respond with a response for Producer or Subscribe commands. Take producer as example, it means connectionOpened will not complete and reconnectionPending_ will not become false.

Then, after receiving a CommandProducerClose command from the blue cluster, a new reconnection will be scheduled to the green cluster but it will be skipped because reconnectionPending_ is true, which means the previous connectionOpened future is not completed until the 30s timeout is reached.

2024-02-26 06:09:30.251 INFO  [139737465607744] HandlerBase:101 | [persistent://public/unload-test/topic-1708927732, sub, 0] Ignoring reconnection attempt since there's already a pending reconnection
2024-02-26 06:10:00.035 WARN  [139737859880512] ProducerImpl:291 | [persistent://public/unload-test/topic-1708927732, cluster-a-0-0] Failed to reconnect producer: TimeOut

Modifications

When receiving the TOPIC_MIGRATED command, cancel the pending Producer and Subscribe commands so that connectionOpened will fail with a retryable error. In the next time of reconnection, the green cluster will be connected.

Fix the ExtensibleLoadManagerTest with a more strict timeout check. After this change, it will pass in about 3 seconds locally, while in CI even if it passed, it takes about 70 seconds before.

Besides, fix the possible crash on macOS when closing the client, see #405 (comment)

Fixes apache#405

### Motivation

After triggering a blue-green migration, the socket will be disconnected
and then schedule a reconnection to the blue cluster. However, the blue
cluster could never respond with a response for Producer or Subscribe
commands. Take producer as example, it means `connectionOpened` will not
complete and `reconnectionPending_` will not become false.

Then, after receiving a `CommandProducerClose` command from the blue
cluster, a new reconnection will be scheduled to the green cluster but
it will be skipped because `reconnectionPending_` is true, which means
the previous `connectionOpened` future is not completed until the 30s
timeout is reached.

```
2024-02-26 06:09:30.251 INFO  [139737465607744] HandlerBase:101 | [persistent://public/unload-test/topic-1708927732, sub, 0] Ignoring reconnection attempt since there's already a pending reconnection
2024-02-26 06:10:00.035 WARN  [139737859880512] ProducerImpl:291 | [persistent://public/unload-test/topic-1708927732, cluster-a-0-0] Failed to reconnect producer: TimeOut
```

### Modifications

When receiving the `TOPIC_MIGRATED` command, cancel the pending
`Producer` and `Subscribe` commands so that `connectionOpened` will fail
with a retryable error. In the next time of reconnection, the green
cluster will be connected.

Fix the `ExtensibleLoadManagerTest` with a more strict timeout check.
After this change, it will pass in about 3 seconds locally, while in CI
even if it passed, it takes about 70 seconds before.

Besides, fix the possible crash on macOS when closing the client, see
apache#405 (comment)
@BewareMyPower BewareMyPower added the bug Something isn't working label Feb 27, 2024
@BewareMyPower BewareMyPower added this to the 3.5.0 milestone Feb 27, 2024
@BewareMyPower BewareMyPower self-assigned this Feb 27, 2024
@BewareMyPower BewareMyPower marked this pull request as draft February 27, 2024 13:22
@BewareMyPower
Copy link
Contributor Author

BewareMyPower commented Feb 27, 2024

The test is still flaky when calling receive during unload.

Logs in CI:

2024-02-27 10:25:03.553 INFO  [139979829278272] ExtensibleLoadManagerTest:136 | receive i: 2 10 ms
2024-02-27 10:25:03.554 INFO  [139979829278272] ExtensibleLoadManagerTest:143 | acked i:2 0 ms
2024-02-27 10:25:03.559 INFO  [139980456872960] ExtensibleLoadManagerTest:183 | before lookup responseData:{"brokerUrl":"pulsar://broker-2:6650","httpUrl":"http://broker-2:8080","nativeUrl":"pulsar://broker-2:6650","brokerUrlSsl":""},unload url:http://localhost:8080/admin/v2/namespaces/public/unload-test/0x00000000_0xffffffff/unload?destinationBroker=broker-1:8080,lookupCountBeforeUnload:2
2024-02-27 10:25:08.555 INFO  [139979829278272] ExtensibleLoadManagerTest:136 | receive i: 2 5000 ms

Logs locally:

2024-02-27 21:25:29.831 INFO  [0x170017000] ExtensibleLoadManagerTest:115 | produce i: 7 7 ms
2024-02-27 21:25:29.833 INFO  [0x1e9aa2100] ExtensibleLoadManagerTest:184 | before lookup responseData:{"brokerUrl":"pulsar://broker-2:6650","httpUrl":"http://broker-2:8080","nativeUrl":"pulsar://broker-2:6650","brokerUrlSsl":""},unload url:http://localhost:8080/admin/v2/namespaces/public/unload-test/0x00000000_0xffffffff/unload?destinationBroker=broker-1:8080,lookupCountBeforeUnload:2
2024-02-27 21:25:29.833 INFO  [0x1700a3000] ExtensibleLoadManagerTest:136 | receive i: 6 10 ms
2024-02-27 21:25:29.833 INFO  [0x1700a3000] ExtensibleLoadManagerTest:143 | acked i:6 0 ms
2024-02-27 21:25:29.833 INFO  [0x1700a3000] ExtensibleLoadManagerTest:136 | receive i: 7 0 ms
2024-02-27 21:25:29.833 INFO  [0x1700a3000] ExtensibleLoadManagerTest:143 | acked i:7 0 ms
2024-02-27 21:25:34.833 INFO  [0x1700a3000] ExtensibleLoadManagerTest:136 | receive i: 7 5000 ms

@BewareMyPower BewareMyPower marked this pull request as ready for review February 27, 2024 14:12
@BewareMyPower BewareMyPower marked this pull request as draft February 27, 2024 14:13
@BewareMyPower BewareMyPower marked this pull request as ready for review February 27, 2024 14:23
@BewareMyPower
Copy link
Contributor Author

The flakiness in the current test is related to the broker side (apache/pulsar#22136) so I mark this PR as ready to review.

@BewareMyPower
Copy link
Contributor Author

@heesung-sn I moved the field into HandlerBase as firstRequestIdAfterConnect_ and initialized it with -1. PTAL again.

@BewareMyPower BewareMyPower merged commit 9f96eb9 into apache:main Feb 29, 2024
@BewareMyPower BewareMyPower deleted the bewaremypower/fix-pip-188-tests branch February 29, 2024 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Flaky ExtensibleLoadManagerTest, which could also cause segfault in some platforms
2 participants