-
Notifications
You must be signed in to change notification settings - Fork 38.5k
InMemoryWebSessionStore indirectly causing infinite loop inside tomcat-native OpenSSL under load #26407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I just found spring-projects/spring-security#9200, which also discusses threading concerns when using |
The use of In other words we have to get off Tomcat's thread and there is no way to get back on it that I can see, so one way or another this should probably be looked at in Tomcat, assuming the loop is in Tomcat code. At best we could switch back to a thread marked as "non-blocking" via |
Yeah, it's a tricky situation since the JVM doesn't have a non-blocking way to retrieve random data. I feel like the infinite loop is a pretty serious problem, since Spring Security + WebFlux + Tomcat + OpenSSL is a fairly common (I think?) combination, and it easily occurs "out-of-the-box". I agree that we should probably get the tomcat folks to take a look. I just found https://bz.apache.org/bugzilla/show_bug.cgi?id=64771, which appears to be the same problem. Unfortunately it is currently closed due to inactivity. I'll comment there. |
I'm unable to recreate this issue with the provided sample application. I'm running the app locally rather than via Docker, I'm using Tomcat Native 1.1.26 rather than 1.1.25 (should not impact this) and my local development build of OpenSSL 3.0.x. Load is running at > 6k req/s. |
A code review strongly suggests that the root cause is the non-container thread attempting to read from the socket after the container has completed the request and recycled the various objects. This appears to be consistent with the error message in my previous comment.
|
Probably an outside chance but the root cause might be a bug in Tomcat's async error handling that has been fixed since 9.0.41. Since I can't repeat the failure with this test case, it might be worth trying to rebuild the test case with a 9.0.x snapshot from https://repository.apache.org/content/groups/snapshots/ |
Thanks for investigating @markt-asf ! I'll take a look at the snapshot, and I'll experiment more to try to narrow down how to best reproduce. As you can imagine, I was experimenting with lots of different variables, so there might be some cases where it reproduces more easily. I'll get back with my results. |
I was able to reproduce this more easily with JSON payloads, whereas the original app used text/plain payloads. I don't think it is anything specific to JSON, but perhaps the serialization/deserialization adds subtle timing differences or something. Anyway, here's an updated project that handles JSON. My load tests used the following request (note that any basic auth creds will work):
I haven't tried the latest tomcat snapshots yet. Will try that next. |
Thanks but no success so far here. How long does the test need to run before you see the issue? |
Usually reproduces within a few mins. |
I was able to reproduce the infinite loop with the latest tomcat 9.0-SNAPSHOT (reported as 9.0.42-dev on startup). tomcat-ssl-snaphot.zip In addition, in each run, the app would eventually stop responding to requests during the load test. On some runs, the exception below was reported. However, this exception is not reported during every run, even though the app stopped responding on every run. I did not encounter the exception or the app stopping responding with 9.0.41.
Ideas on where to go from here? Can you try using the docker image I provided to ensure we're testing the exact same app? Any additional debugging I can do or data I can provide on my side? |
I'm unable to reproduce this after 20 mins. That stack trace is consistent with not following Servlet 2.3.3.4. Given the simplicity of the application, this looks like a library issue at the moment rather than a container issue. |
Thanks for the details @markt-asf. This is a WebFlux application where the Servlet calls As you said it is a basic scenario. Based on your analysis from Tomcat and mine at the framework level, I suspect a race between the request completing (perhaps with some exceptional reason), and an attempt to read the body as part of handling. @philsttr it could help to know what happens to the request (i.e. why is it complete when we are trying to read the body) perhaps through From the stacktrace in In other words say the application is about to read from the request body and wants to call |
@philsttr note a recent change #26434 in this area of the code. It probably won't solve it but worth checking. Are you able to reproduce with more logging, e.g. @markt-asf I'm still interested in your feedback. In Servlet 3.1 non-blocking I/O, how can an application or framework guarantee no residual calls to |
If |
@rstoyanchev I'll retry the test with more logging and the latest snapshots next week. FWIW, I do recall the test client receiving some errors. I'll gather/provide more info on them after the next test. Sorry for delay.. pulled off on other things. You know how it goes. |
@markt-asf the sequence I'm thinking about, and where I suspect a race, is where the request ends just before the call to
|
That should not be possible. The container only calls |
Okay so in other words if the application does not attempt any I/O for a period of time, the container will not independently detect or notify of a lost connection? One more question, what if an I/O error occurs while attempting to read or write from a non-container thread. In that case the exception will not propagate up to the container. |
Certainly after In the case of a non-container thread, the container is aware of the error because it will have thrown it (from the InputStream or similar). The assumption is the non-container thread exists ASAP after the exception (without using the request, response etc) and the |
@rstoyanchev I didn't get a chance this week to re-run the tests. My apologies. I haven't forgotten. Hopefully next week will be better. |
No problem. Same here, I was planning to have another look so not waiting on you. |
@philsttr one suggestion, you can enable detailed logging in the Servlet - Reactive Streams bridge:
Successful Request Sample Output
Each log message has a unique id so that as long you can identify one you can grep for all and then track what happened with the request. So that way if you capture the scenario, we can see what happens at that level. @markt-asf, from the stack trace in the original comment, What I can say is for the scenario in general, and you'll see this in the log output (for a successful request) that I pasted, is that we start on a Tomcat thread where we call
In short, in this case we're likely to start reading on an application thread, if the first |
We fixed a Tomcat bug yesterday that might be contributing (or possibly causing) this. The |
This commit ensures handling is cancelled in case of onError/Timeout callback from the Servlet container. Separately we detect the same in ServletServerHttpRequest and ServletServerHttpResponse, which signal onError to the read publisher and cancel writing, but if the onError/Timeout arrives after reading is done and before writing has started (e.g. longer handling), then neither will reach handling. See gh-26434, gh-26407
That's awesome news! Yes, I should be able to get back to this soon. Again sorry for delay. |
I was finally able get some time today to work on this. As a baseline, I first confirmed that I could reproduce what I originally reported using the original spring-boot/spring/tomcat versions. Then, I upgraded the sample app that I included earlier to:
After upgrading, I was not able to reproduce the infinite loop. However, during the load tests, some new exceptions appeared on the server log, which seem related...
and
Full stacks in tomcat-ssl-snapshot-log.txt. My next step was to enable the additional logging requested by @rstoyanchev. |
The ISE appears to be the defensive code added as part of https://bz.apache.org/bugzilla/show_bug.cgi?id=64771, so looks like the issue is still present but at least it doesn't cause a loop. @markt-asf what happens if there is a connection issue very early? Let's say a For the IAE, in the trace log I search for the last log message on the same thread
The output shows a request that has read its input and written its output without any issues. I don't actually see the completion message, for which logging for I'm not 100% sure if the exception is for the exact same request. I'm basically assuming that when the IAE happened it was a continuation of request processing on the same thread, and the log messages are only a millisecond apart, but nevertheless there is a chance I suppose they are unrelated. It would be a very useful feature if Tomcat exposed a request id of some sort that it also uses for logging and exceptions, and that an application can also use so that log messages can be correlated more reliably, especially in such async scenarios. |
I introduced an async log appender, and was able to reproduce the I also enabled logging of |
Great. This is what I extract from for the failing request. I've inserted some comments:
In the beginning we are on "unboundedElastic-68", which was probably used to create the WebFlux session id without blocking the Tomcat thread. We've just received a demand signal to start reading so we check This is surprising because my assumption so far was that somehow there was an attempt to access a recycled request. However the output here shows the request completing normally and we even get callbacks from Tomcat along the way. The only catch again might be that the ISE is not related to the previous log messages on the same thread but this seems unlikely because it would have been preceded by other log messages (before the |
Thanks for the additional analysis. I can see a code path that could cause difficulty. It would be helpful if someone who understands the detail of WebFlux can confirm if this is consistent with what is being seen here. On a container thread the following happens:
If a separate thread calls I think this is a Tomcat bug but I want to double-check the language in the Servlet on concurrency / thread-safety around async processing in case there is anything there that would prohibit an application using the API this way. |
Those steps should happen before WebFlux handling. The Servlet does this in its service method:
In other words WebFlux handling starts only after the read and write listeners and the async listener have been registered. |
Thanks. I've dug a little further and there are other ways for concurrent calls to isReady() to cause problems. I'm still investigating potential fixes. |
I am working on handling concurrent calls to isReady() but I have a follow-up question. In the stack trace reported by the OP, when the "unboundedElastic-68" thread calls isReady() is there any code path where that same thread could then try reading from the request? My concern is if isReady() on that thread returns true at the same time as onDataAvailable() is fired, there could then be two threads trying to read from the request. |
The request body is consumed through a |
@rstoyanchev @philsttr |
I haven't checked this one yet, unfortunately. Hope to get to it soon. I noticed that the changes to Tomcat have been released in 9.0.45, so I'll test with that version. I ran into #26834 in a different app. |
Good news! With tomcat 9.0.45, spring-boot 2.4.5, spring 5.3.6 I could not reproduce the original infinite loop, or the IllegalArgumentException/IllegalStateException, or any other exception during my load tests. I also confirmed that I was still able to reproduce the original problem with the older versions, to make sure nothing had changed in my test setup. So, that gives me confidence that this problem has been resolved in the latest versions. Thank you so much @rstoyanchev and @markt-asf ! It's really impressive when engineers from three organizations can come together and fix what I consider to be a pretty complex problem like this. You two have been so responsive and helpful. I wish I could say the same for my responsiveness, but I hope you can forgive me. |
Thanks for confirming the fix. No need to apologise - we all have different priorities to work with. |
I echo that, the time taken to report, investigate, and test is much appreciated. Thanks! |
I'm marking this as superseded by #26434 since the investigation for both went hand in hand, with the fixes for both released in 5.3.5. |
@markt-asf |
That may have been part of it but there were multiple fixes involved in both Tomcat and webflux. I don;t have an exact list. |
I am having same issue with CPU attached thread dump. Stack is made of Spring 4.3.0 not using WebFlux, running on Tomcat 8.5.39. Look like upgrade will help based on your commit so asked to confirm. Thanks for reply @markt-asf "https-openssl-nio-443-exec-10" # 126 daemon prio=5 os_prio=0 cpu=67095669.07ms elapsed=846595.29s tid=0x00007f46a991d000 nid=0x226c runnable [0x00007f463419d000] |
This commit ensures handling is cancelled in case of onError/Timeout callback from the Servlet container. Separately we detect the same in ServletServerHttpRequest and ServletServerHttpResponse, which signal onError to the read publisher and cancel writing, but if the onError/Timeout arrives after reading is done and before writing has started (e.g. longer handling), then neither will reach handling. See spring-projectsgh-26434, spring-projectsgh-26407
This is a fun one....
When load testing a new app using...
...CPU utilization will max out, and stay maxed out even after the load test completes.
When investigating, I found that threads in the global
boundedElastic
Scheduler are consuming the entire CPU, as seen intop
(broken out by threads)...Taking a stackdump of the process reveals the two threads are inside tomcat's
OpenSSLEngine
...(note that
nid=0xd4
correlates toPID 212
above)In a debug session, I discovered that an infinite loop is executing in tomcat's
OpenSSLEngine.unwrap
where:I would have expected the OpenSSL I/O code to execute on one of the
https-openssl-nio-*
threads, not theboundedElastic
Scheduler. Therefore, I started investigating why this code was executing on theboundedElastic
Scheduler.After more debugging I narrowed it down to
InMemoryWebSessionStore.createWebSession()
.This is the only location in this particular app that uses the
boundedElastic
Scheduler.The
WebSession
is being created because Spring Security'sWebSessionServerRequestCache
is being used, which persists requests in theWebSession
.If I disable the request cache (which removes the usage of
WebSession
, which removes the call toInMemoryWebSessionStore.createWebSession()
, which removes usage ofboundedElastic
), then all I/O is performed on thehttps-openssl-nio-*
threads, and the infinite loop does not occur.I haven't fully investigated why the infinite loop occurs, but I assume there is a threadsafety bug somewhere in tomcat's
OpenSSLEngine
. (Either that or it was never intended to be used from multiple threads.) Having said that, I don't think that the I/O should be occurring on theboundedElastic
thread, so I did not investigate further.In other words, in my opinion, using
InMemoryWebSessionStore
should not cause the OpenSSL I/O to occur on aboundedElastic
thread.I have attached a simple application that can be used to reproduce the problem.
After extracting, use
docker-compose up
to build and start a container with the spring boot app with the above configuration.Sending a lot of load (>= 2000 calls per second) to the
/echo
endpoint will reproduce the infinite loop.However, you can see OpenSSL I/O occurring on the
boundedElastic
threads with any amount of load.The text was updated successfully, but these errors were encountered: