-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Distribute the network future polling time more evenly #6903
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case this does fix the issue of outdated metrics (with the metric updating at the end of the loop) I am fine merging this.
To me it seems hard to find the right values for the break constants, in case this does not have the expected behaviour, I am not in favor of merging.
I'd like to point out that the current code will exhibit being stuck in these In other words, there is always going to be some threshold of network traffic over which the worker will be stuck. Reducing the CPU consumption of the worker will raise that threshold, but can't make this threshold disappear unless we do a change similar to what this PR does. |
I agree with @romanb that enforcing time bounds is a more intuitive approach to the given problem. I only wonder how expensive a call to Either way, have you had a chance to test this out on your local node @tomaka, or should we add the burnin label? |
I did test this on a local node. It's what is running at the moment: http://34.71.135.129:9090/graph |
One can see the difference here Since the PR got deployed, it happened once. |
I do think this change is quite important, so I'm going to merge this. |
bot merge |
Trying merge. |
At the moment, it's been diagnosed that these two
loop
s, especially the second, sometimes take a long time to be processed, up to more than a minute. During this processing, the rest of the polling isn't reached. In particular, the Prometheus metrics aren't updated, which in turn causes alarms to ring.This change looks a bit like a hack, but I do believe that it is a fundamentally correct change (if you exclude the fact that the number of iterations is completely arbitrary), as we are kind of emulating something similar to
select
.I've actually been willing to make this change for a while now, but have always tried to avoid doing so because reducing the load of the network worker was the primary way to fix this. However, it is indeed possible, even when everything is working normally, for this
loop
to take an infinite amount of time. As such, this change is, as mentioned, I think, fundamentally correct.