Spawn websocket connections in the core pool #1522

leoyvens · 2020-03-03T20:05:37Z

Otherwise they would indefinitely take up a thread in the blocking pool, causing that pool to run out of threads under high load and freeze up the node. Now only the execution of the selection set, which is the actually blocking part, is spawned as blocking. Lesson learned, don't put long-running tasks in the blocking pool.

Also bumped the unresponsive timeout from 10s to 100s, since we don't expect it to happen anymore and might want to avoid crashing the node in case of temporary unresponsiveness for some other reason.

Tested locally that the issue no longer reproduces.

Most of the diff if from dropping the lifetime from Resolver, it needs to be static now so it can be put in a tokio task.

Jannis · 2020-03-03T20:15:55Z

graph/src/data/query/error.rs

+    Panic(String),
+    EventStreamError,


I wonder if these aren't too much of an implementation detail; then again... why not! They might help us debug things in the future.

node/src/main.rs

lutter · 2020-03-03T20:55:27Z

graphql/src/subscription/mod.rs

+    let result = graph::spawn_blocking_allow_panic(async move {
+        execute_selection_set(&ctx, &selection_set, &subscription_type, &None)
+    })
+    .await


This should helps since now the number of blocking threads we need is no longer dependent on the number of subscriptions, but we can still exhaust the blocking pool because of the thundering herd behavior of subscriptions. If the individual queries for each subscription take a while to run (say 1s which is not that hard to cause) the first few subscriptions through here will exhaust the connection pool, which will cause following subscriptions to wait for those queries to finish, eventually filling up the blocking pool. We should gate spawning the blocking thread on a semaphore that is sized so that we do not exhaust the connection pool (say allows 75% of the connection pool through)

What we should really do is change the store to move actual work to the blocking pool itself by having some internal function Store.with_connection(f:Fn(Connection)->Result) which first acquires a semaphore sized according to the max number of connections in the pool and then executes f on the blocking pool. We'd then change all Store methods that right now just get a connection to use with_connection and do their work inside that.

But that's too much work for this fix; that's why I suggested a semaphore here with a guess at how big it should be.

It does make sense to throttle subscription queries so they don't delay normal queries. Ideally we'd have query nodes dedicated only to subscriptions. I'll put an async semaphore here as you suggest.

lutter

This looks great! Thanks for adding the semaphore

lutter · 2020-03-04T18:59:16Z

graphql/src/subscription/mod.rs

+    static ref SUBSCRIPTION_QUERY_SEMAPHORE: Semaphore = {
+        // This is duplicating the logic in main.rs to get the connection pool size, which is
+        // unfourtunate. But because this module has no share state otherwise, it's not simple to
+        // refactor so that the semaphore isn't a global.


And the right place for this semaphore would be internal to the Store anyway, so we guard any attempt to get a connection with it, but this is totally fine for now.

Added a comment about the 'right' way to do this to #905

Thanks for registering this in the issue, merging the PR.

leoyvens requested review from Jannis and lutter March 3, 2020 20:05

Jannis reviewed Mar 3, 2020

View reviewed changes

Jannis approved these changes Mar 3, 2020

View reviewed changes

lutter reviewed Mar 3, 2020

View reviewed changes

node/src/main.rs Outdated Show resolved Hide resolved

graphql: Spawn ws connections as non-blocking

4df8c61

leoyvens force-pushed the leo/spawn-ws-in-non-blocking-pool branch from 6c51603 to 4df8c61 Compare March 3, 2020 20:55

lutter requested changes Mar 3, 2020

View reviewed changes

subscription: Limit concurrent subcription queries

cf733c3

lutter mentioned this pull request Mar 4, 2020

Move slow db interactions to tokio blocking pool #905

Open

lutter approved these changes Mar 4, 2020

View reviewed changes

leoyvens merged commit 456ce87 into master Mar 4, 2020

leoyvens deleted the leo/spawn-ws-in-non-blocking-pool branch March 4, 2020 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spawn websocket connections in the core pool #1522

Spawn websocket connections in the core pool #1522

Uh oh!

leoyvens commented Mar 3, 2020

Uh oh!

Jannis Mar 3, 2020

Uh oh!

Uh oh!

lutter Mar 3, 2020

Uh oh!

lutter Mar 3, 2020 •

edited

Loading

Uh oh!

leoyvens Mar 3, 2020

Uh oh!

leoyvens Mar 4, 2020

Uh oh!

lutter left a comment

Uh oh!

lutter Mar 4, 2020

Uh oh!

lutter Mar 4, 2020

Uh oh!

leoyvens Mar 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Spawn websocket connections in the core pool #1522

Spawn websocket connections in the core pool #1522

Uh oh!

Conversation

leoyvens commented Mar 3, 2020

Uh oh!

Jannis Mar 3, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lutter Mar 3, 2020

Choose a reason for hiding this comment

Uh oh!

lutter Mar 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leoyvens Mar 3, 2020

Choose a reason for hiding this comment

Uh oh!

leoyvens Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

lutter left a comment

Choose a reason for hiding this comment

Uh oh!

lutter Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

lutter Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

leoyvens Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lutter Mar 3, 2020 •

edited

Loading