Skip to content

Fix hanging when starting over 512 subgraphs #2354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 7, 2021

Conversation

leoyvens
Copy link
Collaborator

@leoyvens leoyvens commented Apr 7, 2021

This issue was introduced when upgrading to tokio 1.0, the #[tokio::main] syntax changed and the number of blocking threads was inadvertently lowered from 2000 to the default 512. I was able to reproduce the issue locally. This fixes the root cause which is the incorrect use of the blocking thread pool. At least in the critical locations, we still use it incorrectly in others, #905 tracks getting this right in all places.

There are at least two ways we where using the blocking thread pool incorrectly:

  • The code path to start a subgraph had a few nested calls to spawn_blocking. This leads to a "double dip" deadlock, the nested calls are waiting on the callers to release a blocking threads, but the callers are of course waiting on the nested call to complete.
  • We were using spawn_blocking for the task that calls run_subgraph. Tasks that run indeterminately should not be put on the blocking thread pool, because that could exhaust the available threads. Ideally we'd make it so that run_subgraph doesn't do blocking calls, but the quick fix is to use an OS thread. I added graph::spawn_thread to make this convenient.

@leoyvens leoyvens requested a review from lutter April 7, 2021 16:02
@@ -355,7 +352,21 @@ where
&network,
&required_capabilities, e))?.clone();

store.start_subgraph_deployment(&logger, &manifest.id)?;
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok. In this commit I had to move start_subgraph_deployment to the same thread that does run_subgraph since, with copying, start_subgraph_deployment can take hours and would otherwise block the instance manager. (There's nothing to do for this PR, just an FYI)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thing it is in a blocking task then! Ideally it would use something like with_conn but it didn't seem so straightforward to refactor it, so that's left for another day.

@leoyvens leoyvens merged commit 67d742d into master Apr 7, 2021
@leoyvens leoyvens deleted the leo/fix-use-of-spawn-blocking branch April 7, 2021 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants