Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Distribute the network future polling time more evenly #6903

Merged
2 commits merged into from
Aug 19, 2020

Conversation

tomaka
Copy link
Contributor

@tomaka tomaka commented Aug 17, 2020

At the moment, it's been diagnosed that these two loops, especially the second, sometimes take a long time to be processed, up to more than a minute. During this processing, the rest of the polling isn't reached. In particular, the Prometheus metrics aren't updated, which in turn causes alarms to ring.

This change looks a bit like a hack, but I do believe that it is a fundamentally correct change (if you exclude the fact that the number of iterations is completely arbitrary), as we are kind of emulating something similar to select.

I've actually been willing to make this change for a while now, but have always tried to avoid doing so because reducing the load of the network worker was the primary way to fix this. However, it is indeed possible, even when everything is working normally, for this loop to take an infinite amount of time. As such, this change is, as mentioned, I think, fundamentally correct.

@tomaka tomaka added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Aug 17, 2020
@tomaka tomaka requested a review from mxinden August 17, 2020 10:51
Copy link
Contributor

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case this does fix the issue of outdated metrics (with the metric updating at the end of the loop) I am fine merging this.

To me it seems hard to find the right values for the break constants, in case this does not have the expected behaviour, I am not in favor of merging.

@tomaka
Copy link
Contributor Author

tomaka commented Aug 17, 2020

I'd like to point out that the current code will exhibit being stuck in these loops if the network traffic is high enough and causes messages to be sent to the worker at a higher rate than they are being processed.

In other words, there is always going to be some threshold of network traffic over which the worker will be stuck. Reducing the CPU consumption of the worker will raise that threshold, but can't make this threshold disappear unless we do a change similar to what this PR does.

@tomaka tomaka added C3-medium PR touches the given topic and has a medium impact on builders. and removed C1-low PR touches the given topic and has a low impact on builders. labels Aug 17, 2020
@mxinden
Copy link
Contributor

mxinden commented Aug 19, 2020

That's why actually enforcing time bounds would seem to be a more direct approach to the problem.

I agree with @romanb that enforcing time bounds is a more intuitive approach to the given problem. I only wonder how expensive a call to Instant::now on each iteration is.

Either way, have you had a chance to test this out on your local node @tomaka, or should we add the burnin label?

@tomaka
Copy link
Contributor Author

tomaka commented Aug 19, 2020

I did test this on a local node. It's what is running at the moment: http://34.71.135.129:9090/graph

@tomaka
Copy link
Contributor Author

tomaka commented Aug 19, 2020

One can see the difference here
This graph represents the number of times, since the last restart, when polling network-worker took more than one second.

Since the PR got deployed, it happened once.

@tomaka
Copy link
Contributor Author

tomaka commented Aug 19, 2020

I do think this change is quite important, so I'm going to merge this.
I'll try refactor this polling function and open another PR.

@tomaka
Copy link
Contributor Author

tomaka commented Aug 19, 2020

bot merge

@ghost
Copy link

ghost commented Aug 19, 2020

Trying merge.

@ghost ghost merged commit 265dd74 into paritytech:master Aug 19, 2020
@tomaka tomaka deleted the net-hack-perf branch August 19, 2020 11:45
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C3-medium PR touches the given topic and has a medium impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants