-
Notifications
You must be signed in to change notification settings - Fork 425
Ensure successful message propagation in case of disconnection mid-handshake #2725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR makes the following interpretation of the issue and follows the solution accordingly:
If in case, my interpretation of the problem has been erroneous, do let me and I shall be glad to correct it! :) |
|
Also, this PR has been set to draft because the tests are incomplete and only partially test the added code. |
TheBlueMatt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the status here. You have it marked draft, do you want feedback? Whay kind of feedback/review is this ready for?
Hi, @TheBlueMatt So, I am facing trouble preparing a test for when the peer is reconnected on time, and hence the open channel message is sent to it because it conflicts with how the rest of the test codebase is set up. So I wanted to get a general Approach ACK, before I go about hacking in the test to make them work. |
|
Updated from pr2725.01 -> pr2725.02 (diff) Changes:
Note:
|
TheBlueMatt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay here, busy with thanksgiving travel and other stuff.
|
Updated from pr2725.03 -> pr2725.04 (diff) Changes:
Thank you very much, @TheBlueMatt, for this new idea to solve this problem. |
We still need to resend |
You are right! That was an oversight from my side. Thank you for pointing it out.
You are right. However, the goal of the PR is to ensure proper execution of the |
I don't think we necessarily care about that. What we really want is to end up with a funded channel (within a reasonable timeout) if a user requests one while being able to handle the counterparty disconnecting mid-handshake.
Typically nodes forget all about channels before sending/receiving |
|
Thanks, @wpaulino, for the details about the message transmissions! Seems like it's worth considering extending the PR from fixing the original issue to not failing channel creation mid-handshake due to channel disconnection. I am tinkering with an approach, and I shall update the PR very soon! |
|
Update: Okay, so I have figured out an approach, but this depends on the behavior changes introduced in #2760. We can track the list of msg_events we have sent during the handshake process, which can be used in case a peer disconnects midway. Once the funding is signed, we graduate the channel from However, currently, in the main, we graduate the channel as soon as we have created the funding. let (chan: Channel<SP>, msg_opt) = match peer_state.channel_by_id.remove(temporary_channel_id) {
...
},Since #2760 is already getting approval and will soon be merged, I shall build this new approach over the changes introduced there. |
|
I don't think we need to explicitly track which message we're ready to send to our counterparty - if we are disconnected from a peer, then reconnect prior to funding, we have to restart from the |
|
Updated from pr2725.04 -> pr2725.05 (diff) Updates:
Logic: -> Follow the standard handshake routine. -> If we disconnect mid-handshake from our peer (that is, OutboundV1Channel is not resolved to a funded channel), we don't immediately close the OutboundV1Channel. -> Instead, we track how long it has been since we disconnected from peers. -> If we connect back within time, we rebroadcast SendOpenChannel corresponding to OutboundV1Channel to the peer. -> If we do not connect back within N (=2) timer ticks, we force close and remove the channel. Note:
|
TheBlueMatt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM. One comment. Note that the test fixes in the last commit will need to get squashed into the commit that broke tests. We require (but don't actually check in CI) that each individual commit builds and passes tests.
|
Updated from pr2725.06 -> pr2725.07 (diff) Update:
|
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #2725 +/- ##
==========================================
- Coverage 89.14% 89.14% -0.01%
==========================================
Files 116 116
Lines 93205 93186 -19
Branches 93205 93186 -19
==========================================
- Hits 83089 83066 -23
- Misses 7583 7587 +4
Partials 2533 2533 ☔ View full report in Codecov by Sentry. |
|
Updated from pr2725.10 -> pr2725.11 (diff)
|
|
@shaavan, the updates you've made in response to the comments seem to be well-detailed and focused on improving the PR's clarity and functionality. It's good to see that you've expanded the tests to cover the new behavior thoroughly. Regarding the mention of (\( ⁰⊖⁰)/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 4
Configuration used: CodeRabbit UI
Files selected for processing (2)
- lightning/src/ln/channelmanager.rs (3 hunks)
- lightning/src/ln/functional_tests.rs (4 hunks)
Additional comments: 6
lightning/src/ln/functional_tests.rs (3)
- 3727-3737: The logic for handling disconnection before funding transaction broadcast is clear. However, the test seems to be disabled. Confirm if this is intentional and if so, provide a reason or a TODO comment for future enablement.
- 10513-10561: The new test
test_channel_close_when_not_timely_acceptedis well-structured and seems to cover the scenario it's designed for. However, ensure that the test is enabled and verify that it passes in the test suite.- 10563-10607: The test
test_rebroadcast_open_channel_when_reconnect_mid_handshakeappears to correctly simulate the scenario of a peer disconnecting and reconnecting mid-handshake. Verify that the test is enabled and that it passes in the test suite.lightning/src/ln/channelmanager.rs (3)
- 895-897: The logic here checks if any channel is in the
FundedorUnfundedOutboundV1phase. Ensure that this logic aligns with the intended behavior of theis_livefunction, especially considering the newUnfundedOutboundV1state.- 8876-8877: The
UnfundedOutboundV1channel phase is set to always return true, which implies that these channels are considered live even if the peer is disconnected. Confirm that this behavior is consistent with the overall system logic and that it won't lead to any unexpected side effects.- 9028-9032: The addition of logic to push a
SendOpenChannelmessage forUnfundedOutboundV1channels is consistent with the PR's objective to allow rebroadcasting if the peer reconnects. Ensure that theget_open_channelfunction generates the correct message and that this behavior is tested.
jkczyz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good other than some comment re-phrasing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (2)
- lightning/src/ln/channelmanager.rs (3 hunks)
- lightning/src/ln/functional_tests.rs (4 hunks)
Files skipped from review as they are similar to previous changes (2)
- lightning/src/ln/channelmanager.rs
- lightning/src/ln/functional_tests.rs
|
@TheBlueMatt |
|
Updated from pr2725.12 -> pr2725.13 (diff)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (2)
- lightning/src/ln/channelmanager.rs (3 hunks)
- lightning/src/ln/functional_tests.rs (6 hunks)
Files skipped from review as they are similar to previous changes (1)
- lightning/src/ln/channelmanager.rs
Additional comments: 6
lightning/src/ln/functional_tests.rs (6)
- 3727-3739: The logic for handling peer disconnection before funding is broadcasted seems to correctly simulate the disconnection and checks for the expected closure reasons. However, it's important to ensure that the
UNFUNDED_CHANNEL_AGE_LIMIT_TICKSconstant is appropriately defined and used across the test to simulate the timeout accurately.Ensure
UNFUNDED_CHANNEL_AGE_LIMIT_TICKSis defined with a value that accurately represents the intended timeout duration for the test scenario.
- 10515-10559: The test
test_channel_close_when_not_timely_acceptedsimulates a scenario where peers disconnect mid-handshake, and the channel is not timely accepted. The test setup and the disconnection simulation are correctly implemented. However, the assertion that checks the channel's state after disconnection (line 10534) and the assertion for the channel's closure (line 10550) are critical to validate the intended behavior. It's essential to ensure that these assertions accurately reflect the expected state changes in the system under test.Verify that the assertions accurately reflect the expected outcomes and that the test covers all relevant scenarios for the feature being tested.
- 10561-10604: The test
test_rebroadcast_open_channel_when_reconnect_mid_handshakecorrectly simulates a peer disconnection and reconnection mid-handshake. The test ensures that theSendOpenChannelmessage is rebroadcast upon reconnection (lines 10598-10603). This behavior aligns with the PR's objective to improve the robustness of the channel handshake process. However, it's crucial to verify that the rebroadcast logic is implemented as intended in the actual system code and not just within the test environment.Confirm that the rebroadcast logic for the
SendOpenChannelmessage upon peer reconnection is correctly implemented in the system code and not solely within the test.
- 10762-10764: The introduction of the test
test_close_in_funding_batchaims to ensure that if one channel in a batch closes, the entire batch is closed. This test is crucial for validating the robustness of batch processing in channel funding. It's important to ensure that the test setup correctly simulates the batch funding scenario and that the logic for triggering a channel close within the batch is accurately implemented.Ensure the test accurately simulates batch funding scenarios and correctly implements the logic for closing a channel within a batch.
- 10788-10820: The logic within
test_close_in_funding_batchfor force-closing a channel and verifying the closure of all channels in the batch (lines 10794-10820) is critical for ensuring the intended behavior of batch processing. The assertions and checks (lines 10797-10803, 10805-10809, and 10811-10818) are essential for validating the state of the system after a force-close operation. It's important to verify that these checks accurately reflect the expected outcomes and that the test covers all relevant scenarios for batch processing in channel funding.Verify that the assertions and checks within the test accurately reflect the expected outcomes for batch processing in channel funding and that all relevant scenarios are covered.
- 10820-10820: The final assertion in
test_close_in_funding_batchthat checks for the immediate closure of all channels in the batch upon a single channel's force-close (line 10820) is a key part of validating the intended behavior. However, it's crucial to ensure that this behavior aligns with the system's design and that the test accurately reflects the real-world scenario it intends to simulate.Confirm that the immediate closure of all channels in a batch upon a single channel's force-close aligns with the system's design and that the test accurately simulates this scenario.
- Do not remove channel immediately when peer_disconnect, instead removed it after some time if peer doesn't reconnect soon (handled in previous commit). - Do not mark per ok_to_remove if we have some OutboundV1Channels too. - Rebroadcast SendOpenChannel for outboundV1Channel when peer reconnects. - Update the relevant tests to account for the behavior change. - Repurpose the test_disconnect_in_funding_batch to test that all channels in the batch close when one them closes.
- The first test make sure that the OutboundV1Channel is not immediately removed when peers disconnect, but is removed after N timer ticks. - The second test makes sure that the SendOpenChannel is rebroadcasted for the OutboundV1Channel if peer reconnects within time.
|
Updated from pr2725.13 -> pr2725.14 (diff)
Updated from pr2725.14 -> pr2725.15 Range-diff
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (2)
- lightning/src/ln/channelmanager.rs (3 hunks)
- lightning/src/ln/functional_tests.rs (6 hunks)
Files skipped from review as they are similar to previous changes (1)
- lightning/src/ln/channelmanager.rs
Additional comments: 7
lightning/src/ln/functional_tests.rs (7)
- 3698-3698: The test description for
test_peer_disconnected_before_funding_broadcastedis clear and sets the context well for what the test aims to achieve. However, ensure that the test implementation fully covers the scenario described, including both the disconnection and the failure to reconnect within the specified time.- 3727-3739: The logic to simulate peer disconnection before funding is broadcasted and to check the channel closure with the appropriate
ClosureReasonis implemented correctly. However, consider adding a comment explaining whyUNFUNDED_CHANNEL_AGE_LIMIT_TICKSis used to simulate the passage of time and its significance in the context of this test.- 10517-10557: The test
test_channel_close_when_not_timely_acceptedcorrectly simulates a scenario where a peer disconnects mid-handshake and checks the state of channels and peer state after a specified time has passed. This test effectively covers the new behavior introduced in the PR. Ensure that the constants used, likeUNFUNDED_CHANNEL_AGE_LIMIT_TICKS, are well-documented and their values are justified within the context of this test.- 10560-10598: The test
test_rebroadcast_open_channel_when_reconnect_mid_handshakeaccurately simulates the scenario of peer disconnection and reconnection during the handshake process. It checks that theSendOpenChannelmessage is rebroadcast upon reconnection, aligning with the PR's objectives. This test is well-structured and covers the critical functionality introduced. Ensure that the test includes assertions for the state of both nodes after reconnection to fully validate the rebroadcast logic.- 10756-10756: The introduction of
test_close_in_funding_batchaims to test the behavior when one of the channels in a batch closes. This is a good addition to ensure that batch processing of channel closures behaves as expected. However, the test description could be expanded to detail the expected behavior of the batch closure process for clarity.- 10782-10813: In
test_close_in_funding_batch, the logic to force-close a channel and check the resulting state, including monitor updates and message events, is implemented correctly. This test effectively validates the behavior when a channel in a funding batch is closed. Ensure that the test also verifies the state of other channels in the batch to confirm that they are affected as expected by the batch closure process.- 10814-10814: The assertion that all channels in the batch should close immediately after one channel is force-closed is a critical part of
test_close_in_funding_batch. This ensures that the batch processing logic is working as intended. Consider adding more detailed assertions to verify the closure reasons for each channel in the batch to ensure they align with the expected outcomes.
| _ => panic!("Unexpected message."), | ||
| } | ||
|
|
||
| // We broadcast the commitment transaction as part of the force-close. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, this is kinda dumb, maybe we should fix that, but its not super critical and certainly unrelated to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super interested in understanding the issue here! And probably might give it a try if it's not a super biggie!
Resolves #2096