-
Notifications
You must be signed in to change notification settings - Fork 5k
Add new telemetry EventCounter/EventSource tracing to System.Net.* #37428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tagging subscribers to this area: @dotnet/ncl |
Are you proposing leaving the existing logging as-is, or are you proposing migrating the existing functionality to the new namespaces/types and be more rigorously designed? It'd be nice if we had just one way of doing things in the codebase, as well as if we're going to be rigorous about treating the logging as documented behavior that we're able to do that consistently across old and the new. And as we clean up the existing logging, I think it'd be worthwhile examining if all of it is really necessary, e.g. we've got a bunch of entered/exit events at the method level of a bunch of methods, but it appears to be very inconsistent, and I don't know that any of it is actually valuable.
Is this only applicable when MaxConnectionsPerServer is set, or for HTTP/2 when we hit the server's stream limit? Otherwise I'm not sure what this means.
Presumably that includes due to explicit cancellation, timeouts, etc., right?
What about current in addition to max?
I'm curious what one does with this information...?
Is this meant to be just those that failed or all that completed? Copy/paste error?
Any idea what kind of overhead this might add? We'll need to be sure that it's basically zero when the counters are disabled. Also, what's the design for how someone will enable these and the fine-grainedness at which they're able to, e.g. if turning these on does slow down a high-throughput server measurably, we'll want to be careful about how easy that is to do accidentally.
Anything about request latencies? |
I had email discussions with @noahfalk and his team about this question. He generally recommends against deleting events, changing namespace names because it can break applications that dependson it. I really wanted to create a new namespace that used .NET namespace names (i.e. System.Net.Http) instead of Windows ETW style names (Microsoft-System-Net-Http) as this matches what is done for new EventCounter's today (including ASP.NET) and is more discoverable. And I thought a new namespace focused on measurable telemetry would keep things streamlined for performance reasons since fewer call-sites would be activated for emitting events when you want them for measuring system health/performance as compared to deeper diagnostics/debugging with the existing events we have.
It is true that some of the previously ported "Microsoft-System-Net-" events we have are a mess and are inconsistent. I think we can address that. But I would prefer to do that with a separate discussion/issue and not bottleneck on getting this telemetry focused events/counters into dotnet/runtime. |
We had similar "counters" (Performance Counters) in .NET Framework and we didn't see perf issues reported by customers that turned on those counters. But we can toggle counters like these with further keyword filtering. @anurse guidance on adding ASP.NET counters talks about these topics and recommends a pattern for adding them using keywords that need to be explicitly opt-in. I added a link to that guidance in the top-post of this issue. |
In ASP.NET, we have total-requests (i.e. requests-started), current-requests (i.e. requests-completed - requests-started), and failed-requests (i.e. requests-aborted). I prefer the way ASP.NET tracks current-requests instead of requests-completed even though both convey the same information. I also think failed-requests is better than requests-aborted since the latter implies that it's only counting requests that were aborted by the client when it's counting other cases like when Send(Async) throws due to a 500 (I assume). I do think "request-started" is better than "total-requests" because there's less ambiguity, but I don't think we can change around the names of those ASP.NET counters at this point. I'm not sure how much we value clarity vs consistency. I find "per" in "http11-connections-per-pool-max" and "http20-streams-per-connection-max" a bit confusing. In my mind, "per" implies division. Is this the high-water mark of the average connections/pool? Or is this the high-water mark for any single pool? If it's the latter, I'd recommend "http11-connections-single-pool-max" and "http20-streams-single-connection-max". What does everyone think about process-wide counters for current HTTP/1.1 connections, HTTP/2 connections, HTTP/2 streams and System.Net.Sockets connections? I agree counters for min/max/mean request duration are interesting. That's something we haven't added to ASP.NET yet but are considering. |
I think But for the client-side like HttpClient, typically, aborted requests, i.e. those that throw exceptions are more useful than "failed" requests. HttpClient doesn't consider returning a valid status code (even 500) as an "failure/error". And the .NET Framework Performance Counters for networking have had HttpWebRequest track "aborted" as well (as a rate counter).
Thx for the feedback on the naming. Yes, I was attempting to catch "high-water mark" of any pool. So, your suggestion for names like |
@stephentoub Regarding your comments from #37428 (comment)
My original intent was that it mirrored .NET Framework "HttpWebRequest Average Queue Time" which was basically how long the request sat around waiting to be dispatched to a thread and a valid connection and begin sending out request header bytes. We get asked all the time about why does my request time out? What is it doing? And it would be good to be able to point to the reason as "it's just sitting around doing nothing". So, in terms of whether it is going to be an HTTP/1.1 or HTTP/2.0 request, it doesn't matter. @davidfowl did add a
Yes.
Oops. I meant 'failed'. Thx. I will update top post.
I'm not sure how we would do that in a counter as it changes all the time and there are many different pools. But we can consider this later.
That is part of the #827 work that is tracking that.
Actually, this might end up being a rate counter, i.e. "Number of TLS handshakes started per second", etc. I will follow up in actual future PR. |
That's all goodness. But there's a key difference here. In .NET Framework, there was a default max requests per server of 2 or 10 or something along those lines. In .NET Core, it defaults to unlimited, so someone would have to explicitly set MaxConnectionsPerServer for this counter to be meaningful, right?
Just as we do for all of the rest of the PollingCounters we've defined, e.g. runtime/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/RuntimeEventSource.cs Lines 62 to 85 in 797dc08
Timer.ActiveCounter fluctuates all the time, ThreadPool.ThreadCount fluctuates all the time, etc. |
That assumes that the request will being right away (i. e. create a new connection if needed and immediately send out request header bytes). But in a loaded system we've seen things sit around waiting for threadpool threads just to begin doing any work. And I think it might be useful to measure that. Perhaps we won't end up with an actual counter though. Implementing #827 will result in having events (not a counter, per se) with timestamps for each step of the request lifecycle. Perhaps aggregating those events can give us the same information about unexpected latencies. |
How are you planning to track that / differentiate it from other causes of Socket.ConnectAsync not completing quickly? |
This is the first of several PRs to add telemetry focused events and counters. A few things to note: * Added a new dedicated namespace for telemetry. Current feedback from partners is that our existing events are "confusing and a mess" due to their focus on low-level diagnostics and verbose debugging. See discussion in dotnet#37428 regarding reasons for a new namespace. * Please comment on the naming patterns for the events, counters, and display names, etc. I looked at ASP.NET events/counters and @anurse older guidance in https://gist.github.com/anurse/af1859663ac91c6cf69c820cebe92303. But there are inconsistencies between all of that. So, I'd like to get agreement on naming patterns for events/counters (i.e. noun-verb vs. verb-noun, present tense vs. past tense, etc.). * No automated tests were added at this time. The current discussions with others (reverse proxy team) is that automated tests are brittle in CI. But manual tests will be run. Contributes to dotnet#37428
Thanks for the very nice writeup @davidsh ! I've been handling some other obligations last week and still not quite free of it, but I am hoping within a few days to have more time to dig into this. |
I may need to rename the kestrel event source. Everything is inconsistent now 😁. Can we agree dots > slashes? |
I think you mean "dashes" instead of "slashes"? @noahfalk recommended against renaming existing EventSource/EventCounter namespaces. But perhaps that is general guidance. |
Do we have an issue tracking cleaning that up? It really is a mess. |
|
* Add telemetry to System.Net.Http This is the first of several PRs to add telemetry focused events and counters. A few things to note: * Added a new dedicated namespace for telemetry. Current feedback from partners is that our existing events are "confusing and a mess" due to their focus on low-level diagnostics and verbose debugging. See discussion in #37428 regarding reasons for a new namespace. * Please comment on the naming patterns for the events, counters, and display names, etc. I looked at ASP.NET events/counters and @anurse older guidance in https://gist.github.com/anurse/af1859663ac91c6cf69c820cebe92303. But there are inconsistencies between all of that. So, I'd like to get agreement on naming patterns for events/counters (i.e. noun-verb vs. verb-noun, present tense vs. past tense, etc.). * No automated tests were added at this time. The current discussions with others (reverse proxy team) is that automated tests are brittle in CI. But manual tests will be run. Contributes to #37428 * Address PR feedback * PR feedback * Added RequestAbort as separate event method with dimensions (host, port) * Address PR feedback 2 * RequestStart now includes host, port, scheme, path and version * RequestStart and RequestStop now bracket the lifetime of the request including reading the response body stream. If the response body is not complete read then disposing the stream will trigger RequestStop as well. * Address feedback 3 * Use exception filter to improve performance * Avoid negative 'current-requests' counter value * Address feedback 4
Socket's connect start, stop and cancelled events are logged as activities via EventSource. Contributes to #37428
Counters in SocketsTelemetry measuring `bytes-received` (Bytes Received) The cumulative total number of bytes received by all Socket objects since the process started. `bytes-sent` (Bytes Sent) The cumulative number of bytes sent by all Socket objects since the process started. `outgoing-connections-established` (Outgoing Connections Established) The cumulative total number of outgoing connections established by Socket objects for stream sockets since the process started. `incoming-connections-established` (Incoming Connections Established) The cumulative total number of incoming connections established by Socket objects for stream sockets since the process started. `datagrams-received` (Datagrams Received) The cumulative total number of datagram packets received by all Socket objects since the process started. `datagrams-sent` (Datagrams Sent) The cumulative total number of datagram packets sent by all Socket objects since the process started. Contributes to #37428
Counters in SocketsTelemetry measuring `bytes-received` (Bytes Received) The cumulative total number of bytes received by all Socket objects since the process started. `bytes-sent` (Bytes Sent) The cumulative number of bytes sent by all Socket objects since the process started. `outgoing-connections-established` (Outgoing Connections Established) The cumulative total number of outgoing connections established by Socket objects for stream sockets since the process started. `incoming-connections-established` (Incoming Connections Established) The cumulative total number of incoming connections established by Socket objects for stream sockets since the process started. `datagrams-received` (Datagrams Received) The cumulative total number of datagram packets received by all Socket objects since the process started. `datagrams-sent` (Datagrams Sent) The cumulative total number of datagram packets sent by all Socket objects since the process started. Contributes to dotnet#37428
Current list of implemented events and counters in
|
@MihaZupan is there anything left, or should we close this issue as everything we planned is now implemented? |
Uh oh!
There was an error while loading. Please reload this page.
Proposed Telemetry
To support projects such as YARP, we plan to add new counters (EventCounter, PollingCounter, etc) and EventSource events. The main goal of this new telemetry is to support automated tools that can measure the performance/health of the overall system.
Note: This top-post will be edited and keep up to date with changes.
Goals
NonGoals (for .NET 5)
Prior Art
Overall Design
New EventSource classes and namespaces
We currently have an existing 'NetEventSource' class which is used for all current System.Net.* event tracing. The current design uses the same class name factored into partial classes per System.Net.* assembly.
The primary purpose of these existing events is for diagnostics and debugging. They were added in-masse when .NET Core was first ported from .NET Framework. All the existing System.Net Logging was copied and used as the original basis for .NET Core events. While the tracing in useful for debugging, it is very verbose and not ideally suited for consumption with automated tools.
These existing EventSource namespaces have a prefix of the form "Microsoft-System-Net-". It was a naming pattern that was based on historical patterns in Windows OS for ETW events. Newer EventSource and EventCounter classes in .NET Core have been added that now follow a .NET namespace naming pattern. ASP.NET Core uses this naming pattern. After discussions with the .NET Diagnostics team, it was suggested that we start new namespaces for telemetry focused events/counters.
The plan for these telemetry focused events and counters is to create new EventSource namespaces matching the namespace name of the .NET types (Sockets, NameResolution, Http, etc.). We should also treat all these new events and counters like APIs. We should be rigorous in documenting them since customer software will be used to consume them. We should also prioritize end-to-end scenarios as we add new events/counters.
Proposed Counters
Counters are listed below under each new EventSource namespace. They are described in the following form
EventCounter Name
(Display Name) Detailed informationThe actual .NET class used for implementation might be
EventCounter
orPollingCounter
etc. depending on implementation requirements.In addition to the counters listed below, there will be some corresponding EventSource events that will add dimensional information such as 'host name', 'ip address', 'port' etc. The name of the EventSource event will be similar to the counter name. Actual details will be provided in subsequent PRs.
System.Net.Http
requests-queue-duration
(HTTP Requests Queue Duration)The time-on-queue (min/max/mean) for all HTTP request objects that left the queue in the last interval since the process started.
HTTP request objects are measured as HttpRequestMessage objects being processed by SocketsHttpHandler.Send(Async).
requests-started
(HTTP Requests Started)The cumulative number of HTTP request objects started since the process started.
requests-completed
(HTTP Requests Completed)The cumulative number of HTTP request objects completed since the process started. Completed means that an HTTP response was received regardless of status code value.
requests-aborted
(HTTP Requests Aborted)The cumulative number of HTTP request objects aborted since the process started. Aborted means that an Exception occurred during the SocketsHttpHandler.Send(Async) call.
requests-started-per-second
(HTTP Requests Started Rate)The number of HTTP request objects started per second since the process started.
requests-completed-per-second
(HTTP Requests Completed Rate)The number of HTTP request objects completed per second since the process started.
requests-aborted-per-second
(HTTP Requests Aborted Rate)The number of HTTP request objects aborted per second since the process started.
http11-connections-single-pool-max
(HTTP 1.1 connections per pool)The maximum number (high-water mark) of TCP connections across any HTTP 1.1 connection pool in SocketsHttpHandler.
http20-streams-single-connection-max
(HTTP 2.0 streams per connection)The maximum number (high-water mark) of HTTP/2 streams across any HTTP 2.0 connection in SocketsHttpHandler.
Update 7/1/2020
Current connection counters as suggested below
http11-connections-current-total
(HTTP 1.1 current total connections)The total number of open TCP connections across all HTTP 1.1 connection pools in SocketsHttpHandler.
http20-connections-current-total
(HTTP 2.0 current total connections)The total number of open TCP connections across all HTTP/2 connection pools in SocketsHttpHandler.
Connection lifecycle events
Http11ConnectionOpened
New HTTP 1.1. connection established.
Http11ConnectionAborted
HTTP 1.1 connection abruptly terminated.
Http11ConnectionClosed
Existing HTTP 1.1 connection closed.
Http2ConnectionOpened
New HTTP/2 connection established.
Http2ConnectionAborted
HTTP/2 connection abruptly terminated.
Http2ConnectionClosed
Existing HTTP/2 connection closed.
Request timings not covered by other events as requested in #827
ResponseBegin
Start receiving response. Response end matches
RequestStop
, so no extra event required.System.Net.Security
tls-handshakes-per-second
(TLS Handshake Rate)The number of TLS handshakes completed per second since the process started.
total-tls-handshakes
(Total TLS Handshakes)The total number of TLS handshakes attempted since the process started. This includes all handshakes completely successfully or failed.
current-tls-handshakes
(Current TLS Handshakes)The current number of TLS handshakes started but not yet completed.
failed-tls-handshakes
(Failed TLS Handshakes)The number of TLS handshakes failed since the process started.
tls10-connections-open
(TLS 1.0 Connections Open)The current number of active TLS 1.0 connections opened by SslStream objects.
tls11-connections-open
(TLS 1.1 Connections Open)The current number of active TLS 1.1 connections opened by SslStream objects.
tls12-connections-open
(TLS 1.2 Connections Open)The current number of active TLS 1.2 connections opened by SslStream objects.
tls13-connections-open
(TLS 1.3 Connections Open)The current number of active TLS 1.3 connections opened by SslStream objects.
tls10-handshake-duration
(TLS 1.0 Handshake Duration)The duration for completion (min/max/mean) for all TLS 1.0 handshakes since the process started.
tls11-handshake-duration
(TLS 1.1 Handshake Duration)The duration for completion (min/max/mean) for all TLS 1.1 handshakes since the process started.
tls12-handshake-duration
(TLS 1.2 Handshake Duration)The duration for completion (min/max/mean) for all TLS 1.2 handshakes since the process started.
tls13-handshake-duration
(TLS 1.3 Handshake Duration)The duration for completion (min/max/mean) for all TLS 1.3 handshakes since the process started.
System.Net.NameResolution
dns-lookups-requested
(DNS Lookups Requested)The cumulative number of DNS lookups requested since the process started.
dns-lookups-duration
(DNS Lookups Duration)The duration for completion (min/max/mean) for all DNS name lookup queries since the process started.
System.Net.Sockets
bytes-received
(Bytes Received)The cumulative total number of bytes received by all Socket objects since the process started.
bytes-sent
(Bytes Sent)The cumulative number of bytes sent by all Socket objects since the process started.
connections-established
(Connections Established)The cumulative total number of Socket objects for stream sockets that were ever connected since the process started.
datagrams-received
(Datagrams Received)The cumulative total number of datagram packets received by all Socket objects since the process started.
datagrams-sent
(Datagrams Sent)The cumulative total number of datagram packets sent by all Socket objects since the process started.
TODO
Detailed timings for HTTP request
The text was updated successfully, but these errors were encountered: