Skip to content

Crash: precondition failure in RequestBag+StateMachine.swift #576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rausnitz opened this issue Apr 7, 2022 · 15 comments · Fixed by #577
Closed

Crash: precondition failure in RequestBag+StateMachine.swift #576

rausnitz opened this issue Apr 7, 2022 · 15 comments · Fixed by #577

Comments

@rausnitz
Copy link

rausnitz commented Apr 7, 2022

We're on version 1.9.0 and we saw hundreds of these crashes in our server logs today. It looks like it started happening in our logs more than a week ago but suddenly escalated today.

We tried rolling back to 1.8.2 but that did not resolve the issue.

0x5561fde6eed2, Backtrace.(printBacktrace in _B82A8C0ED7C904841114FDF244F9E58E)(signal: Swift.Int32) -> () at /app/.build/checkouts/swift-backtrace/Sources/Backtrace/Backtrace.swift:66
0x7f108b62497f
0x5561fdd4ab60, Swift runtime failure: precondition failure at /app/.build/checkouts/async-http-client/Sources/AsyncHTTPClient/RequestBag+StateMachine.swift:0
0x5561fdd4ab60, AsyncHTTPClient.RequestBag.StateMachine.resumeRequestBodyStream() -> AsyncHTTPClient.RequestBag<A>.StateMachine.ResumeProducingAction at /app/.build/checkouts/async-http-client/Sources/AsyncHTTPClient/RequestBag+StateMachine.swift:130
0x5561fdd52166, AsyncHTTPClient.RequestBag.(resumeRequestBodyStream0 in _828EC98A744D86D088B343098CA6B2A6)() -> () at /app/.build/checkouts/async-http-client/Sources/AsyncHTTPClient/RequestBag.swift:102
0x5561fe5668bb, reabstraction thunk helper from @escaping @callee_guaranteed () -> () to @escaping @callee_guaranteed () -> (@out ()) at /app/<compiler-generated>:0
0x5561fe5668bb, partial apply forwarder for reabstraction thunk helper from @escaping @callee_guaranteed () -> () to @escaping @callee_guaranteed () -> (@out ()) at /app/<compiler-generated>:0
0x5561fe560110, reabstraction thunk helper from @escaping @callee_guaranteed () -> (@out ()) to @escaping @callee_guaranteed () -> () at /app/<compiler-generated>:0
0x5561fe560110, closure #4 () -> () in NIOPosix.SelectableEventLoop.run() throws -> () at /app/.build/checkouts/swift-nio/Sources/NIOPosix/SelectableEventLoop.swift:520
0x5561fe560110, reabstraction thunk helper from @callee_guaranteed () -> (@error @owned Swift.Error) to @escaping @callee_guaranteed () -> (@out (), @error @owned Swift.Error) at /app/<compiler-generated>:0
0x5561fe560110, generic specialization <()> of NIOPosix.withAutoReleasePool<A>(() throws -> A) throws -> A at /app/.build/checkouts/swift-nio/Sources/NIOPosix/SelectableEventLoop.swift:28
0x5561fe560110, NIOPosix.SelectableEventLoop.run() throws -> () at /app/.build/checkouts/swift-nio/Sources/NIOPosix/SelectableEventLoop.swift:519
0x5561fe54b6c4, static NIOPosix.MultiThreadedEventLoopGroup.(runTheLoop in _C2B1528F4FBA68A3DBFA89DBAEBE9D4D)(thread: NIOPosix.NIOThread, parentGroup: Swift.Optional<NIOPosix.MultiThreadedEventLoopGroup>, canEventLoopBeShutdownIndividually: Swift.Bool, selectorFactory: () throws -> NIOPosix.Selector<NIOPosix.NIORegistration>, initializer: (NIOPosix.NIOThread) -> (), _: (NIOPosix.SelectableEventLoop) -> ()) -> () at /app/.build/checkouts/swift-nio/Sources/NIOPosix/MultiThreadedEventLoopGroup.swift:89
0x5561fe54b6c4, closure #1 (NIOPosix.NIOThread) -> () in static NIOPosix.MultiThreadedEventLoopGroup.(setupThreadAndEventLoop in _C2B1528F4FBA68A3DBFA89DBAEBE9D4D)(name: Swift.String, parentGroup: NIOPosix.MultiThreadedEventLoopGroup, selectorFactory: () throws -> NIOPosix.Selector<NIOPosix.NIORegistration>, initializer: (NIOPosix.NIOThread) -> ()) -> NIOPosix.SelectableEventLoop at /app/.build/checkouts/swift-nio/Sources/NIOPosix/MultiThreadedEventLoopGroup.swift:110
0x5561fe54e20e, partial apply forwarder for closure #1 (NIOPosix.NIOThread) -> () in static NIOPosix.MultiThreadedEventLoopGroup.(setupThreadAndEventLoop in _C2B1528F4FBA68A3DBFA89DBAEBE9D4D)(name: Swift.String, parentGroup: NIOPosix.MultiThreadedEventLoopGroup, selectorFactory: () throws -> NIOPosix.Selector<NIOPosix.NIORegistration>, initializer: (NIOPosix.NIOThread) -> ()) -> NIOPosix.SelectableEventLoop at /app/<compiler-generated>:0
0x5561fe54e509, reabstraction thunk helper from @escaping @callee_guaranteed (@guaranteed NIOPosix.NIOThread) -> () to @escaping @callee_guaranteed (@in_guaranteed NIOPosix.NIOThread) -> (@out ()) at /app/<compiler-generated>:0
0x5561fe54e22d, partial apply forwarder for reabstraction thunk helper from @escaping @callee_guaranteed (@guaranteed NIOPosix.NIOThread) -> () to @escaping @callee_guaranteed (@in_guaranteed NIOPosix.NIOThread) -> (@out ()) at /app/<compiler-generated>:0
0x5561fe57abb5, closure #1 (Swift.Optional<Swift.UnsafeMutableRawPointer>) -> Swift.Optional<Swift.UnsafeMutableRawPointer> in static NIOPosix.ThreadOpsPosix.run(handle: inout Swift.Optional<Swift.UInt>, args: NIOPosix.Box<(body: (NIOPosix.NIOThread) -> (), name: Swift.Optional<Swift.String>)>, detachThread: Swift.Bool) -> () at /app/.build/checkouts/swift-nio/Sources/NIOPosix/ThreadPosix.swift:105
0x7f108b6196da
0x7f1089b2661e
0xffffffffffffffff

Any ideas on how we can avoid this or debug it further?

@Lukasa
Copy link
Collaborator

Lukasa commented Apr 8, 2022

Hmm, this appears to be happening due to overzealous policing of the .finished state in the request bag state machine. Specifically, if we resume the request body stream but have already failed the request, we'll crash. The crash is reasonable in some cases, but it's probably overzealous here.

As a temporary workaround you can try disabling HTTP/2 support to see if it's only occurring in the HTTP/2 path, but if it's occurring in the HTTP/1 path then there may be other issues. Do you have logging enabled? It would be useful to see the logs to see the specific trigger for these issues, as I think this can only happen if we have hit a different error in the request pipeline already.

@Lukasa
Copy link
Collaborator

Lukasa commented Apr 8, 2022

This appears to be a bit of a window condition. Out of curiosity, do you ever call .cancel on the HTTPClient.Task object?

@nkarnik
Copy link

nkarnik commented Apr 8, 2022

@Lukasa Zach can correct me but I don't think we ever call .cancel

@rausnitz
Copy link
Author

rausnitz commented Apr 8, 2022

@Lukasa Zach can correct me but I don't think we ever call .cancel

Confirmed, cancel() does not appear anywhere in our codebase.

Will reply to your earlier comment ASAP.

@Lukasa
Copy link
Collaborator

Lukasa commented Apr 8, 2022

Calling .cancel is the only way through the code I can confidently say can trigger this path, but that's sufficiently good to allow me to write a test that hits the crash, and I'll use it as a justification for defensive programming as well.

@rausnitz
Copy link
Author

rausnitz commented Apr 8, 2022

Hmm, this appears to be happening due to overzealous policing of the .finished state in the request bag state machine. Specifically, if we resume the request body stream but have already failed the request, we'll crash. The crash is reasonable in some cases, but it's probably overzealous here.

As a temporary workaround you can try disabling HTTP/2 support to see if it's only occurring in the HTTP/2 path, but if it's occurring in the HTTP/1 path then there may be other issues. Do you have logging enabled? It would be useful to see the logs to see the specific trigger for these issues, as I think this can only happen if we have hit a different error in the request pipeline already.

I set the log level to DEBUG, but I'm not sure what logs would be useful to share. Our backend serves many simultaneous requests, and this crash happens many times per minute at its peak. I haven't been able to isolate the logs that are related to the crash, and I would need to scrub the logs of sensitive information before sharing here, so I'll need to narrow the scope of the logs to share to make that practical.

@Lukasa
Copy link
Collaborator

Lukasa commented Apr 8, 2022

I think we'd need to go to TRACE in order to see the appropriate logs. Unfortunately filtering is going to be hard: because the crash happens after an event loop hop it's not easy to work out exactly which request has hit the crash, so we can't filter by request ID.

Out of curiosity, has it been possible to try disabling HTTP/2?

@Lukasa
Copy link
Collaborator

Lukasa commented Apr 8, 2022

You can also try using the branch from #577 to see how that affects your issue.

@rausnitz
Copy link
Author

rausnitz commented Apr 8, 2022

Sounds good, will change to TRACE. We have an isolated backend environment that I can point my frontend environment to, so I'm trying to reproduce the crash there shortly. If that works, the logs will be more useful because my requests will be the only ones in the logs.

I haven't tried disabling HTTP/2 yet. The DEBUG logs were showing requests to certain external services using HTTP/2 while others use HTTP/1.1, and I haven't been able to test the effect of disabling HTTP/2 outside of prod yet.

DEBUG  [ DEBUG ] connection closed [ahc-connection-id: 18, ahc-http-version: http/2, ahc-pool-key: https://{{removed}}:443 TLS-hash: -5432264542835967314] (AsyncHTTPClient/ConnectionPool/HTTPConnectionPool.swift:517)

DEBUG  [ DEBUG ] connection closed [ahc-connection-id: 27, ahc-http-version: http/1.1, ahc-pool-key: https://{{removed}}:443 TLS-hash: -5432264542835967314] (AsyncHTTPClient/ConnectionPool/HTTPConnectionPool.swift:474)

@rausnitz
Copy link
Author

rausnitz commented Apr 8, 2022

@Lukasa I think we're in good shape now. I deployed to production using #577 about 30 minutes ago, and have not seen a crash since. No logs contain Received signal 4. Backtrace since that deploy, compared to hundreds of logs mentioning Received signal 4. Backtrace in the prior hour. I'll continue to monitor.

Do you have any guesses about where this behavior might have come from? As far as we can tell, it was not correlated with any release of our application, and we haven't made changes to our use of the client or general configuration recently. Could unexpected behavior by an external service (that our application's HTTP client is sending requests to) cause this to occur?

Thanks so much for your help with this!

@Lukasa
Copy link
Collaborator

Lukasa commented Apr 8, 2022

I suspect an external service, yes.

The specific theory I have is that this is a timing window. RequestBag is local to the Task event loop, so all information it receives about the underlying Channel state requires an event loop hop. That means that it can race with information from the Task event loop, or from any other thread in the program. In this instance, we might ask to resume the body after we've decided we don't need to upload it for some other reason.

The thing that gives me pause is that I don't think there are many things that can fail a Task in this way. Aside from cancel the only other theory I had is that if there's an error in a streaming RequestBody this might happen. But even that seems unlikely. So without logs it's hard to be totally sure what the cause is.

However, the precondition itself was a bit aggressive, so the fact that this can fail in even one way makes it fairly likely that it can fail in others.

@Lukasa
Copy link
Collaborator

Lukasa commented Apr 8, 2022

Oh, and to clarify, if the other service's timing has changed that can affect the size of this window.

@rausnitz
Copy link
Author

rausnitz commented Apr 8, 2022

Thanks for explaining that.

Two hours since the deploy and we continue to be crash-free.

Sorry to say I wasn't able to reproduce the crash in a separate environment, so this will probably remain a mystery, but if we learn anything more I'll post it here.

fabianfett pushed a commit that referenced this issue Apr 11, 2022
Motivation

The RequestBag intermediates between two different threads. This means
it can get requests that were reasonable when they were made but have
been superseded with newer information since then. These generally have
to be tolerated.

Unfortunately if we received a request to resume the request body stream
_after_ the need for that stream has been invalidated, we could hit a
crash. That's unnecessary, and we should tolerate it better.

Modifications

Tolerated receiving requests to resume body streaming in the finished
state.

Result

Fewer crashes
Fixes #576
@rausnitz
Copy link
Author

I wanted to provide a bit more info in case it's of interest to you. During the period when these crashes were occurring, we had a prolonged increase in CPU utilization, occasionally spiking to 100%. (That CPU increase seems to be directly related to the onboarding a new customer that required our servers to process a higher volume of data than they typically do.)

Maybe there's a relationship between maxing out our CPU and the unexpected bevahior reported in this issue. Not sure what the mechanism would be, just a hunch.

@Lukasa
Copy link
Collaborator

Lukasa commented Apr 12, 2022

The high CPU usage likely correlates with increased queue depth and contention on the event loop threads, which widens any timing window that involves scheduling on those threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants