Skip to content

Use a dedicated thread for timers in rather than a Timer #21490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 8, 2020

Conversation

davidfowl
Copy link
Member

  • This make it possible to still timeout various operations when there's thread pool starvation occurring.
  • One downside is that the "heartbeatslow" warning that usually can communicate starvation to users is now gone.

- This make it possible to still timeout various operations when there's thread pool starvation occuring.
- One downside is that the heartbeatslow warning that usually can communicate starvation to users is now gone.
@Tratcher
Copy link
Member

Tratcher commented May 5, 2020

Interesting theory. How can we tell if it's effective?

Tests need updating.

@halter73
Copy link
Member

halter73 commented May 5, 2020

  • This make it possible to still timeout various operations when there's thread pool starvation occurring.
  • One downside is that the "heartbeatslow" warning that usually can communicate starvation to users is now gone.

There's still risk of the thread being preempted and not getting rescheduled for long periods when there's threadpool starvation. I think if a heartbeat takes longer than the _interval (1s), we should still log the "heartbeatslow" warning.

Today there are two possible causes for the warning:

  1. A lot of threadpool threads become available at once, and there's a backlog of multiple timer callbacks due to starvation that get dequeued simultaneously. This we can no longer catch without queuing work items.
  2. There are so many active threadpool threads the timer callback gets preempted and takes over 1 second of real time before the callback gets rescheduled and complete. This we can still catch with a dedicated thread.

I was hoping to get some customer reports of the heartbeat slow log now that we started logging the heartbeat duration in 5.0 (#15273) to see which of these causes is more common.

}
}

public void Dispose()
{
_timer?.Dispose();
_stopped = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not implement DisposeAsync and wait? Maybe throw after a timeout.

Copy link
Member Author

@davidfowl davidfowl May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it’s not important. Any hung thread will be visible in the debugger/dump.

_timerThread = new Thread(state => ((Heartbeat)state).TimerLoop())
{
Name = "Kestrel Timer",
IsBackground = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LibuvThread, we set IsBackground to false in debug so we can find issues where the thread isn't properly stopping.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but the libuv thread does more than this. I wouldn’t compare the 2. We could set it to true in debug if you’re paranoid about the timer not shutting down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How/when does this get initiated? Is it lazy and triggered by a first request? i.e. does it need executioncontext flow supression?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want an HttpContext stuck hanging around on a Thread 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How/when does this get initiated? Is it lazy and triggered by a first request? i.e. does it need executioncontext flow supression?

Startup. Could suppress execution context but that isn't new behavior. The Timer before didn't suppress it 😄 , nothing new here.

@halter73
Copy link
Member

halter73 commented May 5, 2020

You're going to want to look at

heartbeatHandler.Verify(h => h.OnHeartbeat(systemClock.UtcNow), Times.Once());

and the other tests in that class.

@davidfowl
Copy link
Member Author

There are so many active threadpool threads the timer callback gets preempted and takes over 1 second of real time before the callback gets rescheduled and complete. This we can still catch with a dedicated thread.

Sure. What’s great about this is that this thread will continue to run during the ramp up to a high number of threadpool threads. Today the thread pool needs to dequeue the timer callback as threads are slowly being injected into the thread pool (fighting other threads).

@halter73
Copy link
Member

halter73 commented May 6, 2020

Sure. What’s great about this is that this thread will continue to run during the ramp up to a high number of threadpool threads. Today the thread pool needs to dequeue the timer callback as threads are slowly being injected into the thread pool (fighting other threads).

The thread could still get preempted. A developer could block too long in a custom IConnectionHeartbeatFeature callback. We could introduce a deadlock. There are a bunch of reasons we should still time OnHeartbeat() and log "heartbeatslow" if it takes over a second.

I agree that "heartbeatslow" warnings are likely more often caused because bunch of timer callbacks get dequeued in rapid succession as the threadpool works through a backlog of work items, but what's the harm in timing OnHeartbeat()? Are we sure that already-dequeued timer callback never gets preempted by the OS for over a second? What's wrong with logging something if and when this does happen?

Again, getting #15273 in the hands of customers would be telling.

@davidfowl
Copy link
Member Author

The thread could still get preempted. A developer could block too long in a custom IConnectionHeartbeatFeature callback. We could introduce a deadlock. There are a bunch of reasons we should still time OnHeartbeat() and log "heartbeatslow" if it takes over a second.

Yes all 4 of those developers writing connection middleware 😬.

@halter73
Copy link
Member

halter73 commented May 6, 2020

You're right about the connection middleware. I'm sure there are less that use the feature. The warning wouldn't catch a deadlock either because we don't have another thread to observe it on.

We should still log if OnHeartbeat() takes over _interval though. If it never happens, no harm.

- Print heartbeat slow if duration > interval
- Fix tests
@JamesNK
Copy link
Member

JamesNK commented May 7, 2020

Is there overhead in reserving a dedicated thread? e.g. is it reserving a bunch of memory for its stack that it will never use

@davidfowl
Copy link
Member Author

Is there overhead in reserving a dedicated thread? e.g. is it reserving a bunch of memory for its stack that it will never use

There's definitely overhead but not enough to matter. This thread runs {number of connection} callbacks every second, so it's busy.

@JamesNK
Copy link
Member

JamesNK commented May 7, 2020

Yeah, whether the overhead matters or not is more my question. I recall threads grabbing a megabyte of memory for its stack. I was wondering whether that was still a thing, and if it is, whether this is worth it.

@benaadams
Copy link
Member

I recall threads grabbing a megabyte of memory for its stack.

I think it's just reserved pretend memory these days?

@davidfowl
Copy link
Member Author

@dougbu what gives?

@dougbu
Copy link
Contributor

dougbu commented May 7, 2020

@davidfowl what gives with what?

@davidfowl
Copy link
Member Author

There’s an OOM in msbuild blocking my merge

@dougbu
Copy link
Contributor

dougbu commented May 7, 2020

Ah, that's a known problem w/ msbuild: dotnet/msbuild#3577. @wtgodbe is trying to get a fix through Arcade in dotnet/arcade#5421 and, once that's agreed on, will apply the change to eng/common files in #21478 to avoid the (very long) delays before Arcade can publish an updated version.

And, no, we don't know why the problem is suddenly much worse. @rainersigwald any comment on that side of the issue?

@halter73
Copy link
Member

halter73 commented May 7, 2020

I think dotnet/msbuild#3577 is tracking the issue (even though it's really old @wtgodbe is talking about opening a new issue) and dotnet/arcade#5421 and/or #21478 should mitigate it.

There have been a lot of these failures reported in the ASP.NET Build Teams channel.

@davidfowl
Copy link
Member Author

OK well somebody with admin merge my stuff 😄

@dougbu
Copy link
Contributor

dougbu commented May 7, 2020

Just retry the failing leg. The OOM errors are flaky, not failing all the time.

@davidfowl davidfowl merged commit 6aa13dd into master May 8, 2020
@davidfowl davidfowl deleted the davidfowl/reliable-timers branch May 8, 2020 08:07
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants