-
Notifications
You must be signed in to change notification settings - Fork 116
Description
I can't say yet if this is a Cantaloupe specific issue or a local configuration issue, but @jcoyne encouraged me to share what we've found here.
Description
Our production Cantaloupe servers were sometimes entering a state in which they would respond to health checks and template routes (e.g., /iiif/2) but would hang indefinitely when IIIF requests were made.
Discovery
We finally caught this happening live and the (apparent) timeline looks like:
- Kakadu crashes, taking Cantaloupe with it
- Seeing the failure, systemd restarts Cantaloupe
- Our load balancer hasn't had time to recognize the failure so our reverse proxy is feeding Cantaloupe a flood of requests the moment it comes back up
- Within seconds, Cantaloupe has entered the degraded state. Looking at the output of
jstackwe see all theqtpthreads are in this state or similar:
qtp1177067563-103" #103 [1610475] prio=5 os_prio=0 cpu=246.17ms elapsed=3329.79s tid=0x00007c44e0009970 nid=1610475 in Object.wait() [0x00007c462d0fc000]
java.lang.Thread.State: RUNNABLE
at org.apache.jena.rdf.model.ModelFactory.createDefaultModel(ModelFactory.java:91)
- waiting on the Class initialization monitor for org.apache.jena.rdf.model.impl.ModelCom
at edu.illinois.library.cantaloupe.image.Metadata.loadXMP(Metadata.java:232)
at edu.illinois.library.cantaloupe.image.Metadata.getXMPModel(Metadata.java:202)
at edu.illinois.library.cantaloupe.image.Metadata.readOrientationFromXMP(Metadata.java:160)
Full log: deadlock-jstack.log
Seeing all of the waiting on the Class initialization monitor for org.apache.jena.rdf.model.impl.ModelCom entries looks like a deadlock to me. We recreated this situation in our stage environment by:
- Using
ab(Apache Benchmark) to send a constant stream of info.json requests at a rate consistent with a heavy burst of traffic - Restarting Cantaloupe
Doing this, Cantaloupe would enter the degraded state reliably, with similar jstack output.
Remediation
Increasing http.min_threads from 8 to 16 has been sufficient so far at making Cantaloupe resilient to this, in our particular situation.