Distribute the network future polling time more evenly #6903

tomaka · 2020-08-17T10:44:42Z

At the moment, it's been diagnosed that these two loops, especially the second, sometimes take a long time to be processed, up to more than a minute. During this processing, the rest of the polling isn't reached. In particular, the Prometheus metrics aren't updated, which in turn causes alarms to ring.

This change looks a bit like a hack, but I do believe that it is a fundamentally correct change (if you exclude the fact that the number of iterations is completely arbitrary), as we are kind of emulating something similar to select.

I've actually been willing to make this change for a while now, but have always tried to avoid doing so because reducing the load of the network worker was the primary way to fix this. However, it is indeed possible, even when everything is working normally, for this loop to take an infinite amount of time. As such, this change is, as mentioned, I think, fundamentally correct.

client/network/src/service.rs

mxinden

In case this does fix the issue of outdated metrics (with the metric updating at the end of the loop) I am fine merging this.

To me it seems hard to find the right values for the break constants, in case this does not have the expected behaviour, I am not in favor of merging.

tomaka · 2020-08-17T11:56:09Z

I'd like to point out that the current code will exhibit being stuck in these loops if the network traffic is high enough and causes messages to be sent to the worker at a higher rate than they are being processed.

In other words, there is always going to be some threshold of network traffic over which the worker will be stuck. Reducing the CPU consumption of the worker will raise that threshold, but can't make this threshold disappear unless we do a change similar to what this PR does.

client/network/src/service.rs

mxinden · 2020-08-19T09:21:56Z

That's why actually enforcing time bounds would seem to be a more direct approach to the problem.

I agree with @romanb that enforcing time bounds is a more intuitive approach to the given problem. I only wonder how expensive a call to Instant::now on each iteration is.

Either way, have you had a chance to test this out on your local node @tomaka, or should we add the burnin label?

tomaka · 2020-08-19T09:24:15Z

I did test this on a local node. It's what is running at the moment: http://34.71.135.129:9090/graph

tomaka · 2020-08-19T09:27:42Z

One can see the difference here
This graph represents the number of times, since the last restart, when polling network-worker took more than one second.

Since the PR got deployed, it happened once.

tomaka · 2020-08-19T11:45:22Z

I do think this change is quite important, so I'm going to merge this.
I'll try refactor this polling function and open another PR.

tomaka · 2020-08-19T11:45:27Z

bot merge

ghost · 2020-08-19T11:45:30Z

Trying merge.

Distribute the network future polling time more evenly

7d30557

tomaka added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Aug 17, 2020

tomaka requested a review from mxinden August 17, 2020 10:51

tomaka commented Aug 17, 2020

View reviewed changes

client/network/src/service.rs Outdated Show resolved Hide resolved

Update client/network/src/service.rs

87e6ecc

mxinden reviewed Aug 17, 2020

View reviewed changes

romanb reviewed Aug 17, 2020

View reviewed changes

client/network/src/service.rs Show resolved Hide resolved

tomaka added C3-medium PR touches the given topic and has a medium impact on builders. and removed C1-low PR touches the given topic and has a low impact on builders. labels Aug 17, 2020

romanb approved these changes Aug 19, 2020

View reviewed changes

ghost merged commit 265dd74 into paritytech:master Aug 19, 2020

tomaka deleted the net-hack-perf branch August 19, 2020 11:45

This was referenced Aug 19, 2020

Refactor sc_network::NetworkService::poll to use async/await #6919

Closed

Network worker sometimes freezes #6882

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distribute the network future polling time more evenly #6903

Distribute the network future polling time more evenly #6903

Uh oh!

tomaka commented Aug 17, 2020

Uh oh!

Uh oh!

mxinden left a comment

Uh oh!

tomaka commented Aug 17, 2020 •

edited

Loading

Uh oh!

Uh oh!

mxinden commented Aug 19, 2020 •

edited

Loading

Uh oh!

tomaka commented Aug 19, 2020

Uh oh!

tomaka commented Aug 19, 2020

Uh oh!

tomaka commented Aug 19, 2020

Uh oh!

tomaka commented Aug 19, 2020

Uh oh!

ghost commented Aug 19, 2020

Uh oh!

Uh oh!

Distribute the network future polling time more evenly #6903

Distribute the network future polling time more evenly #6903

Uh oh!

Conversation

tomaka commented Aug 17, 2020

Uh oh!

Uh oh!

mxinden left a comment

Choose a reason for hiding this comment

Uh oh!

tomaka commented Aug 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mxinden commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomaka commented Aug 19, 2020

Uh oh!

tomaka commented Aug 19, 2020

Uh oh!

tomaka commented Aug 19, 2020

Uh oh!

tomaka commented Aug 19, 2020

Uh oh!

ghost commented Aug 19, 2020

Uh oh!

Uh oh!

tomaka commented Aug 17, 2020 •

edited

Loading

mxinden commented Aug 19, 2020 •

edited

Loading