Skip to content

Complete pending write operations before canceling io. #628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

cwgreene
Copy link

@cwgreene cwgreene commented Oct 27, 2022

PR Summary

Resolve Early Cancellation of pipelines.
PowerShell/Win32-OpenSSH#2012

PR Context

As described in PowerShell/Win32-OpenSSH#2012 when a process such as git uses ssh as a channel, WinSSH will prematurely terminate before flushing all data to the pipe. This results in the error Early EOF.

To fix this issue, when closing the pipe, we first complete write operations to it before cancelling IO on it, rather than afterwards (tbh the current logic makes no sense). That said, I believe a simpler way of addressing the issue is simply to remove the IO cancel completely, though this may make the process slightly less responsive to Ctrl-C.

@cwgreene
Copy link
Author

cwgreene commented Oct 27, 2022

@microsoft-github-policy-service agree company="The Trade Desk"

Comment on lines 272 to 275
// This call is likely racy, there's no guarantee
// that a thread has begun IO operations when it's called.
// Why stop io operations when we're going to close it anyhow?
CancelIoEx(WINHANDLE(pio), NULL);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this comment is mostly here because I'm unsure about what CancelIoEx is supposed to be doing here. If it is needed, I think its not doing what its supposed to be doing because of the possible race condition. If it is not needed, then I believe that this PR can be simplified to just removing it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will take the change as it is. You can delete this comment.

@TBBle
Copy link

TBBle commented Oct 29, 2022

microsoft-github-policy-service agree [company="The Trade Desk"]

Missing an @?

@cwgreene
Copy link
Author

cwgreene commented Oct 30, 2022

microsoft-github-policy-service agree [company="The Trade Desk"]

Missing an @?

Thanks @TBBle misinterpreted your comment there the first time :) Fixed, thank you.

@cwgreene
Copy link
Author

Current error is:

"The RPC server is unavailable"

I do not believe that this is something which I have the power to correct, let me know if I am in error and I will dig into this more. :)

@tgauth
Copy link
Collaborator

tgauth commented Oct 31, 2022

Current error is:

"The RPC server is unavailable"

I do not believe that this is something which I have the power to correct, let me know if I am in error and I will dig into this more. :)

Might be an intermittent AppVeyor thing. Could you push another (empty) commit, or close then re-open the PR to re-trigger AppVeyor?

@cwgreene cwgreene closed this Nov 3, 2022
@cwgreene cwgreene reopened this Nov 3, 2022
@vthiebaut10 vthiebaut10 self-requested a review November 17, 2022 18:28
@vvuk
Copy link

vvuk commented Nov 29, 2022

Note -- I can confirm that this patch seems to fix the early-eof issue that's been plaguing git + windows-openssh. At least, it's fixed with all 3 git pulls that I was able to consistently reproduce the issue with with an older version of ssh. It would be great to get this merged ASAP and a new version of openssh published, windows git users would be very grateful :)

@cwgreene
Copy link
Author

I am happy to make any changes, or update comments. The main reason for the comments was that I was seeking clarification. Let me know if there are any updates that would be desired.

@cwgreene
Copy link
Author

Hallo! Pinging @tgauth. Any advice on next steps?

@vthiebaut10
Copy link
Collaborator

Hi @cwgreene - I'm sorry for the delay. I'll take some time reviewing your PR next Community Day on Monday.

@vthiebaut10
Copy link
Collaborator

@cwgreene - Just an update: I've been reviewing this PR and I understand how it fixes the race condition that is happening when cloning a large repo with git. But because this change has the potential to cause a regression in other scenarios, I need to spend some more time investigating it before merging it. This will probably take a while, but hopefully we can finalize this review before the next release.

@cwgreene
Copy link
Author

@vthiebaut10 understood. Is there anything that I can do on my end to facilitate that?

@cwgreene
Copy link
Author

What are the scenarios which we should consider that are at risk of regression?

@smuehlst
Copy link

smuehlst commented Feb 3, 2023

I built my own copy of ssh.exe from @cwgreene's branch, and I also can confirm that it fixes the fetch-pack: unexpected disconnect while reading sideband packet error for me where git pull with the standard Windows ssh.exe fails.

@vvuk
Copy link

vvuk commented Mar 1, 2023

There was a beta release last week -- why wasn't this included? I understand that maintaining a project like this is complex, but this issue has been outstanding for years (since 2019!), and @cwgreene finally provided a resolution. This should have been the first fix to go into a new beta release. What's the hold up here?

@cwgreene
Copy link
Author

@vthiebaut10 Hello! Happy Pi Day! Is there anything I can do to assist in moving this along?

@gruselglatz
Copy link

#667

@vthiebaut10
Copy link
Collaborator

I'm sorry for the late response. Our team is investigating this issue, and it seems to be more complicated than initially thought. We will update this issue as soon as we are more certain of the correct approach for resolving this issue.

@cwgreene
Copy link
Author

What have you found out so far?

@PaulHigin
Copy link

@cwgreene We are concerned about this change because performing a wait before the CancelIOEx() function could cause a hang or other regressions. Our understanding of this function as an implementation of POSIX fd close is that it should abruptly terminate any queued or pending operations (see linux man page for close). This is what the function currently does, which we think is correct, so we are now thinking that there may be a different problem with our async write and select implementation. We are investigating but have no new information yet.

@cwgreene
Copy link
Author

cwgreene commented Mar 24, 2023

Hi @PaulHigin! Thanks for the information.

It's been several months since I looked at this, so I'll review, however, I believe that in the case of a pipe (which is how git is communicating with ssh), as long as the pipe isn't full, the kernel will always complete the the 'write' operation to the pipe before returning from write even in O_NONBLOCK mode (which is set on pipes, and I seem to recall that this code path only gets involved when O_NONBLOCK is set).

On the close man page

https://man7.org/linux/man-pages/man2/close.2.html

I don't see anything says that close will immediately halt any pending write operations. It does mention that operations on a regular file are not guaranteed to flush; however, a pipe is not a regular file.

Do we need special casing logic here to handle pipes?

@vvuk
Copy link

vvuk commented Mar 24, 2023

FWIW, my understanding is the same as @cwgreene's. write has no async beheaviour itself, whenever it returns all operations are complete (or would block). So the only issue is if one thread calls close() on a fd that another thread is in the middle of a write() operation to. If this is actually happening, it's inherently a race -- it's going to be valid to implement that any in-progress write would complete before a close (because that's what might happen much of the time during the race anyway).

@cwgreene
Copy link
Author

Note also, if we look a little further up the call stack to this function we find this:

	if (pio->type == SOCK_FD)
		r = socketio_close(pio);
	else
		r = fileio_close(pio);		 // This is path we take here

So I don't believe changing this code path should affect network operations, as sockets are handled differently. This should reduce the cases we need to think about in determining the blast radius.

@PaulHigin
Copy link

@cwgreene I spent some more time looking into this and I see your point about this only affecting pipe fds. Also, I notice that higher layer close functions do perform a pending operation drain, so I am thinking it is the right thing to do it here also with this fix. I just wonder if we should also drain any pending read operations. I think the 'CancelOperationEx' needs to be called to ensure the file close completes.

Copy link
Collaborator

@vthiebaut10 vthiebaut10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cwgreene Thank you for your patience. We will be merging your change and it will be available in the next release. I'm leaving a few comments that I'd like you to consider before we merge.
Thanks!

WaitForSingleObject(pio->write_overlapped.hEvent, INFINITE);

/* drain queued APCs */
SleepEx(0, TRUE);
Copy link
Collaborator

@vthiebaut10 vthiebaut10 Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you added this call to SleepEx to this if statement block?

@@ -260,6 +260,18 @@ int
syncio_close(struct w32_io* pio)
{
debug4("syncio_close - pio:%p", pio);

/* Flush descriptor.*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar justification as:

/*
* we report to POSIX app that an async write has completed as soon its
* copied to internal buffer. The app may subsequently try to close the
* fd thinking everything is written. IF the Windows handle is closed
* now, the pipe/file io write operation may terminate prematurely.
* To compensate for the discrepency
* wait here until async write has completed.
* If you see any process waiting here indefinitely - its because no one
* is draining from other end of the pipe/file. This is an unfortunate
* consequence that should otherwise have very little impact on practical
* scenarios.
*/

Suggested change
/* Flush descriptor.*/
/*
* Wait for io write operation that is called by worker thread to terminate
* to avoid the write operation being terminated prematurely by CancelIoEx.
* If you see any process waiting here indefinitely - its because no one
* is draining from other end of the pipe. This is an unfortunate
* consequence that should otherwise have very little impact on practical
* scenarios.
*/

Feel free to modify this comment with any details that you think is relevant.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is similar to alternate path code for the file close operation, so I think it makes sense have it here as well, since we are using the same wait pattern.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was basically to perform the same logic before and after. My understanding/recollection is the sleep is actually needed to ensure that the queue actually gets drained.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering if the queue needs to be drained immediately after the write operation terminates, or if we can just have one SleepEx call once both the read and write operation have terminated. I don't think it hurts to have it there, so I am good to merge as is. But I'm just wondering if it's needed.

Comment on lines 272 to 275
// This call is likely racy, there's no guarantee
// that a thread has begun IO operations when it's called.
// Why stop io operations when we're going to close it anyhow?
CancelIoEx(WINHANDLE(pio), NULL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will take the change as it is. You can delete this comment.

@PaulHigin
Copy link

@vthiebaut10 I agree with this change overall, so if @cwgreene is unavailable to make these extra minor comment change suggestions, I think we can just make them ourselves and perform the merge.

@cwgreene
Copy link
Author

cwgreene commented Apr 5, 2023

Hi! I can remove the comment (thank you for responding to it :) ) in a few hours if that works for you, otherwise, happy with the above recommnedations. :)

@cwgreene
Copy link
Author

cwgreene commented Apr 5, 2023

Actually, one moment I can remove the comment now.

@cwgreene
Copy link
Author

cwgreene commented Apr 5, 2023

Updated comment to use vthiebaut10 wording.

@PaulHigin PaulHigin merged commit 33a141f into PowerShell:latestw_all Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants