Skip to content

Spawn websocket connections in the core pool #1522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 4, 2020

Conversation

leoyvens
Copy link
Collaborator

@leoyvens leoyvens commented Mar 3, 2020

Otherwise they would indefinitely take up a thread in the blocking pool, causing that pool to run out of threads under high load and freeze up the node. Now only the execution of the selection set, which is the actually blocking part, is spawned as blocking. Lesson learned, don't put long-running tasks in the blocking pool.

Also bumped the unresponsive timeout from 10s to 100s, since we don't expect it to happen anymore and might want to avoid crashing the node in case of temporary unresponsiveness for some other reason.

Tested locally that the issue no longer reproduces.

Most of the diff if from dropping the lifetime from Resolver, it needs to be static now so it can be put in a tokio task.

@leoyvens leoyvens requested review from Jannis and lutter March 3, 2020 20:05
Comment on lines +60 to +61
Panic(String),
EventStreamError,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these aren't too much of an implementation detail; then again... why not! They might help us debug things in the future.

@leoyvens leoyvens force-pushed the leo/spawn-ws-in-non-blocking-pool branch from 6c51603 to 4df8c61 Compare March 3, 2020 20:55
let result = graph::spawn_blocking_allow_panic(async move {
execute_selection_set(&ctx, &selection_set, &subscription_type, &None)
})
.await
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should helps since now the number of blocking threads we need is no longer dependent on the number of subscriptions, but we can still exhaust the blocking pool because of the thundering herd behavior of subscriptions. If the individual queries for each subscription take a while to run (say 1s which is not that hard to cause) the first few subscriptions through here will exhaust the connection pool, which will cause following subscriptions to wait for those queries to finish, eventually filling up the blocking pool. We should gate spawning the blocking thread on a semaphore that is sized so that we do not exhaust the connection pool (say allows 75% of the connection pool through)

Copy link
Collaborator

@lutter lutter Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we should really do is change the store to move actual work to the blocking pool itself by having some internal function Store.with_connection(f:Fn(Connection)->Result) which first acquires a semaphore sized according to the max number of connections in the pool and then executes f on the blocking pool. We'd then change all Store methods that right now just get a connection to use with_connection and do their work inside that.

But that's too much work for this fix; that's why I suggested a semaphore here with a guess at how big it should be.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does make sense to throttle subscription queries so they don't delay normal queries. Ideally we'd have query nodes dedicated only to subscriptions. I'll put an async semaphore here as you suggest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@lutter lutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for adding the semaphore

static ref SUBSCRIPTION_QUERY_SEMAPHORE: Semaphore = {
// This is duplicating the logic in main.rs to get the connection pool size, which is
// unfourtunate. But because this module has no share state otherwise, it's not simple to
// refactor so that the semaphore isn't a global.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the right place for this semaphore would be internal to the Store anyway, so we guard any attempt to get a connection with it, but this is totally fine for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment about the 'right' way to do this to #905

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for registering this in the issue, merging the PR.

@leoyvens leoyvens merged commit 456ce87 into master Mar 4, 2020
@leoyvens leoyvens deleted the leo/spawn-ws-in-non-blocking-pool branch March 4, 2020 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants