-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Network errors when refreshing an app repeatedly #11560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you confirm the runtime version you were on? (ye olde |
This seems to be on the latest sdk sdk\3.0.100-preview7-012605\ Host (useful for support): Version: 3.0.0-preview7-27825-01 Commit: afd0301944 |
F5 aborts requests that are in progress so some networking errors are expected. What makes you think there is a bug here? Do you have server logs to go with this? |
Should there be an |
Server error:
|
I don't think it's the abort. Once I get a repro, I can press F5 once and it repros (with no second F5 to interrupt/abort requests) |
|
Wireshark PCAP and Chrome Network Trace: https://microsoft-my.sharepoint.com/:f:/g/personal/anurse_microsoft_com1/EhcT4nYxsfFDqriQY3xDiiUBUws9dh385-BZw_-D41M4Yg?e=qYUlHw (MSFT-internal link) |
Let's try bisecting out the HTTPS changes and the Bedrock changes and see if this repros there. @halter73 can you do that and see if we can either fix it or revert the right thing and try again in preview 8. cc @davidfowl |
What tool do you use to create the gif @javiercn? Is there something built in to Windows yet? |
I ran this with Kestrel in debug and saw the following assertion failure?/NullReferenceException:
We also saw AuthenticationExceptions:
|
This was the assert printing the nullrefex: https://github.com/aspnet/AspNetCore/blob/921dd947b9c54571cc9571ca490ae06bcb526b92/src/Servers/Kestrel/Core/src/Internal/Http2/Http2OutputProducer.cs |
So @jkotalik and I think we might have figured out at least part of what's going on here. With "real" pipes, it's OK to call PipeWriter.Complete() immediately after PipeWriter.FlushAsync() returns a ValueTask, even if the ValueTask returned by FlushAsync() hasn't completed by that point. This comes into play when an HTTP/2 connection is "aborted" which happens after a timeout or any time the client sends a TCP FIN. In this case, Http2Connection.InputOrOutputCompleted() calls Http2FrameWriter.Abort() which acquires the _writeLock and completes the connection-level PipeWriter which was recently made to be a StreamPipeWriter. Http2FrameWriter doesn't not ensure that any previous calls to FlushAsync() have completed, it just ensures no PipeWriter invocations are made in the future for that invocation. StreamPipeWriter, however, assumes that any calls to FlushAsync() have fully completed. Ultimately we need to fix StreamPipeWriter to allow calls to Complete() prior to FlushAsync() completing in order to be as compatible as possible with the "real" PipeWriter. Here's another seemingly-related error that I saw which isn't as easily attributable to an async operation continuing after the PipeWriterCompletion. This time it's an ODE from DiagnosticPoolBlock indicating a use-after-free error (which you would never see in a production app as the name implies). What's weird in this case is that StreamPipeWriter.GetSpan() did not throw an InvalidOperationException for being already completed. Despite this, by the time the ODE was caught, the StreamPipeWriter was completed. On the other hand, StreamPipeWriter.GetSpan() must have completed before the StreamPipeWriter was completed because Http2FrameWriter.WriteDataAsync() was upstack during the StreamPipeWriter.GetSpan() call, and WriteDataAsync must have acquired the _writeLock and checked Http2FrameWriter._completed was false.
I only saw the ODE once so far, and it wasn't possible to correlate the Kestrel debug logs that included the connection id with the StaticFileMiddleware logs which don't because the console logger doesn't include scopes by default. |
The short-term fix is probably to bring back the AdaptedPipeline or something similar. |
This also doesn't explain why clients would send a GOAWAY frame indicating an HTTP/2 protocol violation. I haven't seen that in my testing. That would mean that there had to be an issue prior to the client sending a FIN which I haven't seen yet. Maybe fixing the other issues will either fix or unmask it. |
My current thinking with the GOAWAY errors and the DiagnosticMemoryPool ODEs, though I haven't proven it yet, is that blocks are being returned twice. First by StreamPipeWriter.Complete(), and second by the continuation StreamPipeWriter.FlushAsync(). This then allows the same block to subsequently be leased out twice concurrently to two different pipes leading to corruption. |
Can’t we just not complete the connection level pipe like what the http1connection does? That seems like a sane mitigation. This issue of completing the pipe actually came up when we were discussing the dispose/complete behavior of the StreamPipeWriter but we left it as is unstop further notice. |
That would almost work except for that StreamPipeWriter doesn't return unsent blocks to the pool when Stream.Write/FlushAsync throws even in the case where it's expected. Normally those unreturned blocks are picked up by Kestrel later calling StreamPipeWriter.Complete(). Your comment has me wondering whether Http1Connection now properly returns blocks for canceled HTTPS writes. I suspect not. I don't think we should leave it up to the block finalizers to handle something not-at-all-unusual like this. Furthermore, I don't think we should be using or exposing StreamPipeWriter at all yet if it can fail in a way where it returns the same block twice. That could lead to all sorts of data corruption and information disclosure issues. |
Complete is called at the right time in the pipeline, it just shouldn't be called by Kestrel. Instead it should just await the flush and yield.
Why wouldn't it? I'm not following can you clarify.
Exposing it is how we find real issues (that's what the previews and our tests are for). If blocks can be returned twice then the issue should just but fixed, no need to overreact and not use the type because we found an issue. |
Not exactly. ConnectionContext.DisposeAsync() is called at the right time, and that induces both of our transports to complete the application pipes, but those are the raw pipes that feed into SslStream, not StreamPipeWriter.
I agree we should expose it, but we should fix this issue first. Hopefully we can get it fixed for preview7. StreamPipeWriter is the thing we're exposing to the user, and it provides a super subtle easy way to double-return memory pool blocks. Use-after-free bugs are super hard to diagnose, and I don't want to have to do it more than once because of this issue. |
How would this issue happen if callers are properly awaiting flushes before calling complete?
Don't forget it's already exposed on both the HttpContext and in the lesser used Kestrel middleware pipeline. We need to fix the issue, not panic. Reverting the change in the HttpContext layer would be expensive and the wrong fix at this point. PS: Nobody wants use after free bugs, especially in managed code. |
I'm still describing the hypothetical world you suggested where we change Http2Connection not to call complete like Http1Connection. The problem is that Http1Connection just leaves it to ConnectionDispatcher.Execute() to call ConnectionContext.DisposeAsync() and hopes that cleans everything up. In the case where you don't adapt the ConnectionContext with StreamPipeWriter, it does clean up all the default pipes. The problem is that Socket/LibuvConnection.DisposeAsync() (i.e. ConnectionContext.DisposeAsync) complete the non-adapted "Transport" pipes, so the StreamPipeWriter never gets completed if we do what you suggest in your hypothetical.
See above. |
Except that;'s not how it works, when middleware adapts the transport it is responsbile for completing the pipes it creates. This is how the HttpsConnectionMiddleware works. The chain looks like this: dispatcher -> httpsmiddleware -> httpmiddleware httpmiddleware yields from it's delegate then the httpsmiddleware which cleans up the SslStream and disposes the SslDuplexPipe (which completes the StreamPipeReader and StreamPipeWriter). It looks like this (SslStream.DisposeAsync -> DuplexPipeStreamAdapter.DisposeAsync -> Input.Complete, Output.Complete) . The httpsmiddleware then restores the original transport pipe on the connection and yields control to the dispatcher which calls ConnectionContext.DisposeAsync which properly cleans up the transport pipes. |
Thanks for the detailed call-chain. That actually helps a lot. My brain is still a bit stuck in the old ConnectionAdapter ways of last week 😆, but this definitely the way things should be now that middleware directly adapts pipes instead of streams. I now completely agree that we don't need to be calling PipeWriter.Complete() in Http2FrameWriter anymore. And I think Http1Connection is fine too. There can still be issues with app code that doesn't properly await writes/flushes, but that shouldn't be a huge concern since we can consider that user error. If we really want to make it safe, we could wait the TimingPipeFlusher task ourselves before letting the HTTP middleware complete. Either way, we should fix StreamPipeWriter. |
I'm not sure the double-return is a real issue anymore either. Since BufferSegment.ResetMemory() clears its _memoryOwner field, a second call to ResetMemory() should only be a problem if the BufferSegment itself was leased again and its _memoryOwner field was set to a non-null value before the extraneous call to ResetMemory(). But if the StreamPipeWriter is already completed, there shouldn't be any subsequent attempts to lease a BufferSegment. |
My guess is that there's overlapping operations going on that ends up causing corruption in the StreamPipeWriter (it has no locks). Here's a trace from when things go bad:
This is the same connection and you can see we end up interleaving GetSpan/Advance and FlushAsync on the same connection. |
So, what's the plan for preview 7? Is there a tactical quick-fix we can do or is this going to require more in-depth work? |
Yeah, the quick fix is to bring back the adapted pipeline here: 25d5688#diff-1892f694d558e36b8367aad39ce40046L92 to start using pipes again. I need to sync with these two today to check on the status of this as it seems @davidfowl PR didn't work. |
I have a branch here already https://github.com/aspnet/AspNetCore/tree/davidfowl/back-to-pipes. |
I agree that the StreamPipeWriter should track whether there's a currently-ongoing flush and reject any other API calls until the flush completes. While this will probably break existing PipeWriter consumers when we switch from DefaultPipeWriter to StreamPipeWriter, at least the error should be easily observed and fixed. In the case of Http2FrameWriter, it looks like we're going to need to use a task queue or something similar for all PipeWriter API calls, not just the FlushAsync calls. |
The TimingPipeFlusher used by Http2FrameWriter and a few other types in order to support unawaited calls to Stream.WriteAsync() on top of a PipeWriter is completely busted if it wraps anything other than the DefaultPipeWriter. While the TimingPipeFlusher prevents overlapping calls to FlushAsync, it does not prevent any other PipeWriter API calls (e.g. GetSpan() and Advance()) from being called in parallel. That's because after the TimingPipeFlusher sees the last flush wasn't awaited, it releases Http2FrameWriter's writeLock. When the previous flush completes, the TimingPipeFlusher will then acquire its own lock to start the next flush. This prevents any other flushes from happening simultaneously, but it doesn't prevent the Http2FrameWriter from re-acquiring its own lock and calling GetSpan/Advance/etc during a call to flush. Of course, would could just put all PipeWriter API calls and their associated arguments into a queue, so that API calls that were initiated from a bunch of different parallel HTTP/2 streams, always happen in a valid order. If we want to do that though, likely the easiest and most efficient way to queue all the PipeWriter API calls is to just have Http2FrameWriter put a DefaultPipeWriter in front of the ConnectionContext's PipeWriter. We could probably optimize the case where the ConnectionContext's PipeWriter is a DefaultPipeWriter. |
I’m going to look into making StreamPipeWriter work with the overlapping calls. It shouldn’t be too hard actually (we have a bunch of this logic in pipe today but it might be easier since we don’t have to mess with as much state). We just need to make sure the linked list pointers are updated appropriately. |
This helps if the ConnectionContext's PipeWriter is a StreamPipeWriter, but not if it's any other custom PipeWriter implementation. We should be careful to only use user-replaceable PipeWriter's in the normal way. That means awaiting FlushAsync calls before making any other PipeWriter call other than CancelPendingFlush(). |
I created a blazorserverside app using the latest sdk from core-sdk
dotnet new blazorserverside
dotnet run
Open chrome
Press F5 repeatedly at some point a request will fail and the UI will change.
The text was updated successfully, but these errors were encountered: