Skip to content

Potential deadlock situation when Cantaloupe is restarted under heavy load #873

@taylor-steve

Description

@taylor-steve

I can't say yet if this is a Cantaloupe specific issue or a local configuration issue, but @jcoyne encouraged me to share what we've found here.

Description

Our production Cantaloupe servers were sometimes entering a state in which they would respond to health checks and template routes (e.g., /iiif/2) but would hang indefinitely when IIIF requests were made.

Discovery

We finally caught this happening live and the (apparent) timeline looks like:

  • Kakadu crashes, taking Cantaloupe with it
  • Seeing the failure, systemd restarts Cantaloupe
  • Our load balancer hasn't had time to recognize the failure so our reverse proxy is feeding Cantaloupe a flood of requests the moment it comes back up
  • Within seconds, Cantaloupe has entered the degraded state. Looking at the output of jstack we see all the qtp threads are in this state or similar:
qtp1177067563-103" #103 [1610475] prio=5 os_prio=0 cpu=246.17ms elapsed=3329.79s tid=0x00007c44e0009970 nid=1610475 in Object.wait()  [0x00007c462d0fc000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.jena.rdf.model.ModelFactory.createDefaultModel(ModelFactory.java:91)
	- waiting on the Class initialization monitor for org.apache.jena.rdf.model.impl.ModelCom
	at edu.illinois.library.cantaloupe.image.Metadata.loadXMP(Metadata.java:232)
	at edu.illinois.library.cantaloupe.image.Metadata.getXMPModel(Metadata.java:202)
	at edu.illinois.library.cantaloupe.image.Metadata.readOrientationFromXMP(Metadata.java:160)

Full log: deadlock-jstack.log

Seeing all of the waiting on the Class initialization monitor for org.apache.jena.rdf.model.impl.ModelCom entries looks like a deadlock to me. We recreated this situation in our stage environment by:

  • Using ab (Apache Benchmark) to send a constant stream of info.json requests at a rate consistent with a heavy burst of traffic
  • Restarting Cantaloupe

Doing this, Cantaloupe would enter the degraded state reliably, with similar jstack output.

Remediation

Increasing http.min_threads from 8 to 16 has been sufficient so far at making Cantaloupe resilient to this, in our particular situation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions