-
Notifications
You must be signed in to change notification settings - Fork 333
Complete pending write operations before canceling io. #628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete pending write operations before canceling io. #628
Conversation
@microsoft-github-policy-service agree company="The Trade Desk" |
contrib/win32/win32compat/termio.c
Outdated
// This call is likely racy, there's no guarantee | ||
// that a thread has begun IO operations when it's called. | ||
// Why stop io operations when we're going to close it anyhow? | ||
CancelIoEx(WINHANDLE(pio), NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this comment is mostly here because I'm unsure about what CancelIoEx is supposed to be doing here. If it is needed, I think its not doing what its supposed to be doing because of the possible race condition. If it is not needed, then I believe that this PR can be simplified to just removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will take the change as it is. You can delete this comment.
Missing an |
Thanks @TBBle misinterpreted your comment there the first time :) Fixed, thank you. |
Current error is:
I do not believe that this is something which I have the power to correct, let me know if I am in error and I will dig into this more. :) |
Might be an intermittent AppVeyor thing. Could you push another (empty) commit, or close then re-open the PR to re-trigger AppVeyor? |
Note -- I can confirm that this patch seems to fix the early-eof issue that's been plaguing git + windows-openssh. At least, it's fixed with all 3 git pulls that I was able to consistently reproduce the issue with with an older version of ssh. It would be great to get this merged ASAP and a new version of openssh published, windows git users would be very grateful :) |
I am happy to make any changes, or update comments. The main reason for the comments was that I was seeking clarification. Let me know if there are any updates that would be desired. |
Hallo! Pinging @tgauth. Any advice on next steps? |
Hi @cwgreene - I'm sorry for the delay. I'll take some time reviewing your PR next Community Day on Monday. |
@cwgreene - Just an update: I've been reviewing this PR and I understand how it fixes the race condition that is happening when cloning a large repo with git. But because this change has the potential to cause a regression in other scenarios, I need to spend some more time investigating it before merging it. This will probably take a while, but hopefully we can finalize this review before the next release. |
@vthiebaut10 understood. Is there anything that I can do on my end to facilitate that? |
What are the scenarios which we should consider that are at risk of regression? |
I built my own copy of ssh.exe from @cwgreene's branch, and I also can confirm that it fixes the |
There was a beta release last week -- why wasn't this included? I understand that maintaining a project like this is complex, but this issue has been outstanding for years (since 2019!), and @cwgreene finally provided a resolution. This should have been the first fix to go into a new beta release. What's the hold up here? |
@vthiebaut10 Hello! Happy Pi Day! Is there anything I can do to assist in moving this along? |
I'm sorry for the late response. Our team is investigating this issue, and it seems to be more complicated than initially thought. We will update this issue as soon as we are more certain of the correct approach for resolving this issue. |
What have you found out so far? |
@cwgreene We are concerned about this change because performing a wait before the CancelIOEx() function could cause a hang or other regressions. Our understanding of this function as an implementation of POSIX fd close is that it should abruptly terminate any queued or pending operations (see linux man page for close). This is what the function currently does, which we think is correct, so we are now thinking that there may be a different problem with our async write and select implementation. We are investigating but have no new information yet. |
Hi @PaulHigin! Thanks for the information. It's been several months since I looked at this, so I'll review, however, I believe that in the case of a pipe (which is how git is communicating with ssh), as long as the pipe isn't full, the kernel will always complete the the 'write' operation to the pipe before returning from On the close man page https://man7.org/linux/man-pages/man2/close.2.html I don't see anything says that close will immediately halt any pending write operations. It does mention that operations on a regular file are not guaranteed to flush; however, a pipe is not a regular file. Do we need special casing logic here to handle pipes? |
FWIW, my understanding is the same as @cwgreene's. |
Note also, if we look a little further up the call stack to this function we find this:
So I don't believe changing this code path should affect network operations, as sockets are handled differently. This should reduce the cases we need to think about in determining the blast radius. |
@cwgreene I spent some more time looking into this and I see your point about this only affecting pipe fds. Also, I notice that higher layer close functions do perform a pending operation drain, so I am thinking it is the right thing to do it here also with this fix. I just wonder if we should also drain any pending read operations. I think the 'CancelOperationEx' needs to be called to ensure the file close completes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cwgreene Thank you for your patience. We will be merging your change and it will be available in the next release. I'm leaving a few comments that I'd like you to consider before we merge.
Thanks!
WaitForSingleObject(pio->write_overlapped.hEvent, INFINITE); | ||
|
||
/* drain queued APCs */ | ||
SleepEx(0, TRUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you added this call to SleepEx to this if statement block?
contrib/win32/win32compat/termio.c
Outdated
@@ -260,6 +260,18 @@ int | |||
syncio_close(struct w32_io* pio) | |||
{ | |||
debug4("syncio_close - pio:%p", pio); | |||
|
|||
/* Flush descriptor.*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar justification as:
openssh-portable/contrib/win32/win32compat/fileio.c
Lines 1089 to 1100 in cdee736
/* | |
* we report to POSIX app that an async write has completed as soon its | |
* copied to internal buffer. The app may subsequently try to close the | |
* fd thinking everything is written. IF the Windows handle is closed | |
* now, the pipe/file io write operation may terminate prematurely. | |
* To compensate for the discrepency | |
* wait here until async write has completed. | |
* If you see any process waiting here indefinitely - its because no one | |
* is draining from other end of the pipe/file. This is an unfortunate | |
* consequence that should otherwise have very little impact on practical | |
* scenarios. | |
*/ |
/* Flush descriptor.*/ | |
/* | |
* Wait for io write operation that is called by worker thread to terminate | |
* to avoid the write operation being terminated prematurely by CancelIoEx. | |
* If you see any process waiting here indefinitely - its because no one | |
* is draining from other end of the pipe. This is an unfortunate | |
* consequence that should otherwise have very little impact on practical | |
* scenarios. | |
*/ |
Feel free to modify this comment with any details that you think is relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is similar to alternate path code for the file close operation, so I think it makes sense have it here as well, since we are using the same wait pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was basically to perform the same logic before and after. My understanding/recollection is the sleep is actually needed to ensure that the queue actually gets drained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just wondering if the queue needs to be drained immediately after the write operation terminates, or if we can just have one SleepEx call once both the read and write operation have terminated. I don't think it hurts to have it there, so I am good to merge as is. But I'm just wondering if it's needed.
contrib/win32/win32compat/termio.c
Outdated
// This call is likely racy, there's no guarantee | ||
// that a thread has begun IO operations when it's called. | ||
// Why stop io operations when we're going to close it anyhow? | ||
CancelIoEx(WINHANDLE(pio), NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will take the change as it is. You can delete this comment.
@vthiebaut10 I agree with this change overall, so if @cwgreene is unavailable to make these extra minor comment change suggestions, I think we can just make them ourselves and perform the merge. |
Hi! I can remove the comment (thank you for responding to it :) ) in a few hours if that works for you, otherwise, happy with the above recommnedations. :) |
Actually, one moment I can remove the comment now. |
Updated comment to use vthiebaut10 wording. |
PR Summary
Resolve Early Cancellation of pipelines.
PowerShell/Win32-OpenSSH#2012
PR Context
As described in PowerShell/Win32-OpenSSH#2012 when a process such as git uses ssh as a channel, WinSSH will prematurely terminate before flushing all data to the pipe. This results in the error
Early EOF
.To fix this issue, when closing the pipe, we first complete write operations to it before cancelling IO on it, rather than afterwards (tbh the current logic makes no sense). That said, I believe a simpler way of addressing the issue is simply to remove the IO cancel completely, though this may make the process slightly less responsive to Ctrl-C.