Skip to content

Add new telemetry EventCounter/EventSource tracing to System.Net.* #37428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidsh opened this issue Jun 4, 2020 · 17 comments
Closed

Add new telemetry EventCounter/EventSource tracing to System.Net.* #37428

davidsh opened this issue Jun 4, 2020 · 17 comments
Assignees
Labels
area-System.Net enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@davidsh
Copy link
Contributor

davidsh commented Jun 4, 2020

Proposed Telemetry

To support projects such as YARP, we plan to add new counters (EventCounter, PollingCounter, etc) and EventSource events. The main goal of this new telemetry is to support automated tools that can measure the performance/health of the overall system.

Note: This top-post will be edited and keep up to date with changes.

Goals

  • For 'dotnet-counters' scenarios add some basic counters that don't have any dimensions. The EventCounter, PollingCounter, etc. classes provide for in-memory aggregation and are generally lightweight.
  • For consumption via automated tools where developers want to do their own custom aggregation, emit EventSource events that include both numeric data and some fixed dimension name/value pairs. Initially, the fixed dimensions will be things like 'host name', 'ip address', 'port', etc.
  • Include some EventSource event names that have the 'Start' and 'Stop' suffixes which will automatically create in-process activity-id tracing for subsequent events. This allows for easy correlation for many events to higher-level activities. I.e. aggregate all SslStream operations to particular HttpClient requests. See Vance's blog for background.

NonGoals (for .NET 5)

  • Distributed tracing (cross-process) using OpenTelemetry. Note that the automatic generation of activity-ids (GUIDs) with EventSource 'Start', 'Stop' events can be used in the future to map into OpenTelemetry APIs.
  • Adding dimensions directly to EventCounter's. The current APIs for EventCounter don't work well for dimensions.

Prior Art

Overall Design

New EventSource classes and namespaces

We currently have an existing 'NetEventSource' class which is used for all current System.Net.* event tracing. The current design uses the same class name factored into partial classes per System.Net.* assembly.

The primary purpose of these existing events is for diagnostics and debugging. They were added in-masse when .NET Core was first ported from .NET Framework. All the existing System.Net Logging was copied and used as the original basis for .NET Core events. While the tracing in useful for debugging, it is very verbose and not ideally suited for consumption with automated tools.

These existing EventSource namespaces have a prefix of the form "Microsoft-System-Net-". It was a naming pattern that was based on historical patterns in Windows OS for ETW events. Newer EventSource and EventCounter classes in .NET Core have been added that now follow a .NET namespace naming pattern. ASP.NET Core uses this naming pattern. After discussions with the .NET Diagnostics team, it was suggested that we start new namespaces for telemetry focused events/counters.

The plan for these telemetry focused events and counters is to create new EventSource namespaces matching the namespace name of the .NET types (Sockets, NameResolution, Http, etc.). We should also treat all these new events and counters like APIs. We should be rigorous in documenting them since customer software will be used to consume them. We should also prioritize end-to-end scenarios as we add new events/counters.

Proposed Counters

Counters are listed below under each new EventSource namespace. They are described in the following form
EventCounter Name (Display Name) Detailed information

The actual .NET class used for implementation might be EventCounter or PollingCounter etc. depending on implementation requirements.

In addition to the counters listed below, there will be some corresponding EventSource events that will add dimensional information such as 'host name', 'ip address', 'port' etc. The name of the EventSource event will be similar to the counter name. Actual details will be provided in subsequent PRs.

System.Net.Http

requests-queue-duration (HTTP Requests Queue Duration)
The time-on-queue (min/max/mean) for all HTTP request objects that left the queue in the last interval since the process started.
HTTP request objects are measured as HttpRequestMessage objects being processed by SocketsHttpHandler.Send(Async).

requests-started (HTTP Requests Started)
The cumulative number of HTTP request objects started since the process started.

requests-completed (HTTP Requests Completed)
The cumulative number of HTTP request objects completed since the process started. Completed means that an HTTP response was received regardless of status code value.

requests-aborted (HTTP Requests Aborted)
The cumulative number of HTTP request objects aborted since the process started. Aborted means that an Exception occurred during the SocketsHttpHandler.Send(Async) call.

requests-started-per-second (HTTP Requests Started Rate)
The number of HTTP request objects started per second since the process started.

requests-completed-per-second (HTTP Requests Completed Rate)
The number of HTTP request objects completed per second since the process started.

requests-aborted-per-second (HTTP Requests Aborted Rate)
The number of HTTP request objects aborted per second since the process started.

http11-connections-single-pool-max (HTTP 1.1 connections per pool)
The maximum number (high-water mark) of TCP connections across any HTTP 1.1 connection pool in SocketsHttpHandler.

http20-streams-single-connection-max (HTTP 2.0 streams per connection)
The maximum number (high-water mark) of HTTP/2 streams across any HTTP 2.0 connection in SocketsHttpHandler.

Update 7/1/2020

Current connection counters as suggested below

http11-connections-current-total (HTTP 1.1 current total connections)
The total number of open TCP connections across all HTTP 1.1 connection pools in SocketsHttpHandler.

http20-connections-current-total (HTTP 2.0 current total connections)
The total number of open TCP connections across all HTTP/2 connection pools in SocketsHttpHandler.

Connection lifecycle events

Http11ConnectionOpened
New HTTP 1.1. connection established.

Http11ConnectionAborted
HTTP 1.1 connection abruptly terminated.

Http11ConnectionClosed
Existing HTTP 1.1 connection closed.

Http2ConnectionOpened
New HTTP/2 connection established.

Http2ConnectionAborted
HTTP/2 connection abruptly terminated.

Http2ConnectionClosed
Existing HTTP/2 connection closed.

Request timings not covered by other events as requested in #827
ResponseBegin
Start receiving response. Response end matches RequestStop, so no extra event required.

System.Net.Security

tls-handshakes-per-second (TLS Handshake Rate)
The number of TLS handshakes completed per second since the process started.

total-tls-handshakes (Total TLS Handshakes)
The total number of TLS handshakes attempted since the process started. This includes all handshakes completely successfully or failed.

current-tls-handshakes (Current TLS Handshakes)
The current number of TLS handshakes started but not yet completed.

failed-tls-handshakes (Failed TLS Handshakes)
The number of TLS handshakes failed since the process started.

tls10-connections-open (TLS 1.0 Connections Open)
The current number of active TLS 1.0 connections opened by SslStream objects.

tls11-connections-open (TLS 1.1 Connections Open)
The current number of active TLS 1.1 connections opened by SslStream objects.

tls12-connections-open (TLS 1.2 Connections Open)
The current number of active TLS 1.2 connections opened by SslStream objects.

tls13-connections-open (TLS 1.3 Connections Open)
The current number of active TLS 1.3 connections opened by SslStream objects.

tls10-handshake-duration (TLS 1.0 Handshake Duration)
The duration for completion (min/max/mean) for all TLS 1.0 handshakes since the process started.

tls11-handshake-duration (TLS 1.1 Handshake Duration)
The duration for completion (min/max/mean) for all TLS 1.1 handshakes since the process started.

tls12-handshake-duration (TLS 1.2 Handshake Duration)
The duration for completion (min/max/mean) for all TLS 1.2 handshakes since the process started.

tls13-handshake-duration (TLS 1.3 Handshake Duration)
The duration for completion (min/max/mean) for all TLS 1.3 handshakes since the process started.

System.Net.NameResolution

dns-lookups-requested (DNS Lookups Requested)
The cumulative number of DNS lookups requested since the process started.

dns-lookups-duration (DNS Lookups Duration)
The duration for completion (min/max/mean) for all DNS name lookup queries since the process started.

System.Net.Sockets

bytes-received (Bytes Received)
The cumulative total number of bytes received by all Socket objects since the process started.

bytes-sent (Bytes Sent)
The cumulative number of bytes sent by all Socket objects  since the process started.

connections-established (Connections Established)
The cumulative total number of Socket objects for stream sockets that were ever connected since the process started.

datagrams-received (Datagrams Received)
The cumulative total number of datagram packets received by all Socket objects since the process started.

datagrams-sent (Datagrams Sent)
The cumulative total number of datagram packets sent by all Socket objects since the process started.

TODO

Detailed timings for HTTP request

@davidsh davidsh added this to the 5.0 milestone Jun 4, 2020
@davidsh davidsh self-assigned this Jun 4, 2020
@ghost
Copy link

ghost commented Jun 4, 2020

Tagging subscribers to this area: @dotnet/ncl
Notify danmosemsft if you want to be subscribed.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Jun 4, 2020
@davidsh davidsh added enhancement Product code improvement that does NOT require public API changes/additions and removed untriaged New issue has not been triaged by the area owner labels Jun 4, 2020
@stephentoub
Copy link
Member

it was suggested that we start new namespaces for telemetry focused events/counters.

Are you proposing leaving the existing logging as-is, or are you proposing migrating the existing functionality to the new namespaces/types and be more rigorously designed?

It'd be nice if we had just one way of doing things in the codebase, as well as if we're going to be rigorous about treating the logging as documented behavior that we're able to do that consistently across old and the new. And as we clean up the existing logging, I think it'd be worthwhile examining if all of it is really necessary, e.g. we've got a bunch of entered/exit events at the method level of a bunch of methods, but it appears to be very inconsistent, and I don't know that any of it is actually valuable.

requests-queue-duration

Is this only applicable when MaxConnectionsPerServer is set, or for HTTP/2 when we hit the server's stream limit? Otherwise I'm not sure what this means.

Aborted means that an Exception occurred during the SocketsHttpHandler.Send(Async) call.

Presumably that includes due to explicit cancellation, timeouts, etc., right?

The maximum number (high-water mark) of TCP connections across any HTTP 1.1 connection pool in SocketsHttpHandler.

What about current in addition to max?

The current number of TLS handshakes started but not yet completed.

I'm curious what one does with this information...?

failed-tls-handshakes (Failed TLS Handshakes) The number of TLS handshakes completed since the process started.

Is this meant to be just those that failed or all that completed? Copy/paste error?

bytes-received (Bytes Received)
bytes-sent (Bytes Sent)

Any idea what kind of overhead this might add? We'll need to be sure that it's basically zero when the counters are disabled. Also, what's the design for how someone will enable these and the fine-grainedness at which they're able to, e.g. if turning these on does slow down a high-throughput server measurably, we'll want to be careful about how easy that is to do accidentally.

System.Net.Http

Anything about request latencies?

@davidsh
Copy link
Contributor Author

davidsh commented Jun 4, 2020

Are you proposing leaving the existing logging as-is, or are you proposing migrating the existing functionality to the new namespaces/types and be more rigorously designed?

I had email discussions with @noahfalk and his team about this question. He generally recommends against deleting events, changing namespace names because it can break applications that dependson it.

I really wanted to create a new namespace that used .NET namespace names (i.e. System.Net.Http) instead of Windows ETW style names (Microsoft-System-Net-Http) as this matches what is done for new EventCounter's today (including ASP.NET) and is more discoverable. And I thought a new namespace focused on measurable telemetry would keep things streamlined for performance reasons since fewer call-sites would be activated for emitting events when you want them for measuring system health/performance as compared to deeper diagnostics/debugging with the existing events we have.

And as we clean up the existing logging, I think it'd be worthwhile examining if all of it is really necessary, e.g. we've got a bunch of entered/exit events at the method level of a bunch of methods, but it appears to be very inconsistent, and I don't know that any of it is actually valuable.

It is true that some of the previously ported "Microsoft-System-Net-" events we have are a mess and are inconsistent. I think we can address that. But I would prefer to do that with a separate discussion/issue and not bottleneck on getting this telemetry focused events/counters into dotnet/runtime.

@davidsh
Copy link
Contributor Author

davidsh commented Jun 4, 2020

bytes-received (Bytes Received)
bytes-sent (Bytes Sent)
Any idea what kind of overhead this might add? We'll need to be sure that it's basically zero when the counters are disabled. Also, what's the design for how someone will enable these and the fine-grainedness at which they're able to, e.g. if turning these on does slow down a high-throughput server measurably, we'll want to be careful about how easy that is to do accidentally.

We had similar "counters" (Performance Counters) in .NET Framework and we didn't see perf issues reported by customers that turned on those counters. But we can toggle counters like these with further keyword filtering. @anurse guidance on adding ASP.NET counters talks about these topics and recommends a pattern for adding them using keywords that need to be explicitly opt-in. I added a link to that guidance in the top-post of this issue.

@halter73
Copy link
Member

halter73 commented Jun 4, 2020

In ASP.NET, we have total-requests (i.e. requests-started), current-requests (i.e. requests-completed - requests-started), and failed-requests (i.e. requests-aborted).

I prefer the way ASP.NET tracks current-requests instead of requests-completed even though both convey the same information. I also think failed-requests is better than requests-aborted since the latter implies that it's only counting requests that were aborted by the client when it's counting other cases like when Send(Async) throws due to a 500 (I assume).

I do think "request-started" is better than "total-requests" because there's less ambiguity, but I don't think we can change around the names of those ASP.NET counters at this point. I'm not sure how much we value clarity vs consistency.

I find "per" in "http11-connections-per-pool-max" and "http20-streams-per-connection-max" a bit confusing. In my mind, "per" implies division. Is this the high-water mark of the average connections/pool? Or is this the high-water mark for any single pool? If it's the latter, I'd recommend "http11-connections-single-pool-max" and "http20-streams-single-connection-max".

What does everyone think about process-wide counters for current HTTP/1.1 connections, HTTP/2 connections, HTTP/2 streams and System.Net.Sockets connections?

I agree counters for min/max/mean request duration are interesting. That's something we haven't added to ASP.NET yet but are considering.

@davidsh
Copy link
Contributor Author

davidsh commented Jun 8, 2020

I also think failed-requests is better than requests-aborted since the latter implies that it's only counting requests that were aborted by the client when it's counting other cases like when Send(Async) throws due to a 500 (I assume).

I think failed-requests is the right thing for a server like ASP.NET. Tracking 500 status codes, for example, being returned from a server shows proper health metrics.

But for the client-side like HttpClient, typically, aborted requests, i.e. those that throw exceptions are more useful than "failed" requests. HttpClient doesn't consider returning a valid status code (even 500) as an "failure/error". And the .NET Framework Performance Counters for networking have had HttpWebRequest track "aborted" as well (as a rate counter).

I find "per" in "http11-connections-per-pool-max" and "http20-streams-per-connection-max" a bit confusing. In my mind, "per" implies division. Is this the high-water mark of the average connections/pool? Or is this the high-water mark for any single pool? If it's the latter, I'd recommend "http11-connections-single-pool-max" and "http20-streams-single-connection-max".

Thx for the feedback on the naming. Yes, I was attempting to catch "high-water mark" of any pool. So, your suggestion for names like http11-connections-single-pool-max is good. I'll update the top post.

@davidsh
Copy link
Contributor Author

davidsh commented Jun 8, 2020

@stephentoub Regarding your comments from #37428 (comment)

requests-queue-duration
Is this only applicable when MaxConnectionsPerServer is set, or for HTTP/2 when we hit the server's stream limit? Otherwise I'm not sure what this means.

My original intent was that it mirrored .NET Framework "HttpWebRequest Average Queue Time" which was basically how long the request sat around waiting to be dispatched to a thread and a valid connection and begin sending out request header bytes. We get asked all the time about why does my request time out? What is it doing? And it would be good to be able to point to the reason as "it's just sitting around doing nothing".

So, in terms of whether it is going to be an HTTP/1.1 or HTTP/2.0 request, it doesn't matter.

@davidfowl did add a request-queue-length counter to ASP.NET in #21649 but that seems to be counting connections in a pool as described in the PR "The number of HTTP/2 and HTTP/3 connections queued in the thread pool."

Aborted means that an Exception occurred during the SocketsHttpHandler.Send(Async) call.
Presumably that includes due to explicit cancellation, timeouts, etc., right?

Yes.

failed-tls-handshakes (Failed TLS Handshakes) The number of TLS handshakes completed since the process started.
Is this meant to be just those that failed or all that completed? Copy/paste error?

Oops. I meant 'failed'. Thx. I will update top post.

The maximum number (high-water mark) of TCP connections across any HTTP 1.1 connection pool in SocketsHttpHandler.
What about current in addition to max?

I'm not sure how we would do that in a counter as it changes all the time and there are many different pools. But we can consider this later.

System.Net.Http
Anything about request latencies?

That is part of the #827 work that is tracking that.

The current number of TLS handshakes started but not yet completed.
I'm curious what one does with this information...?

Actually, this might end up being a rate counter, i.e. "Number of TLS handshakes started per second", etc. I will follow up in actual future PR.

@stephentoub
Copy link
Member

My original intent was that it mirrored .NET Framework "HttpWebRequest Average Queue Time" which was basically how long the request sat around waiting to be dispatched to a thread and a valid connection and begin sending out request header bytes. We get asked all the time about why does my request time out? What is it doing? And it would be good to be able to point to the reason as "it's just sitting around doing nothing".

That's all goodness. But there's a key difference here. In .NET Framework, there was a default max requests per server of 2 or 10 or something along those lines. In .NET Core, it defaults to unlimited, so someone would have to explicitly set MaxConnectionsPerServer for this counter to be meaningful, right?

I'm not sure how we would do that in a counter as it changes all the time and there are many different pools.

Just as we do for all of the rest of the PollingCounters we've defined, e.g.

_cpuTimeCounter ??= new PollingCounter("cpu-usage", this, () => RuntimeEventSourceHelper.GetCpuUsage()) { DisplayName = "CPU Usage", DisplayUnits = "%" };
_workingSetCounter ??= new PollingCounter("working-set", this, () => (double)(Environment.WorkingSet / 1000000)) { DisplayName = "Working Set", DisplayUnits = "MB" };
_gcHeapSizeCounter ??= new PollingCounter("gc-heap-size", this, () => (double)(GC.GetTotalMemory(false) / 1000000)) { DisplayName = "GC Heap Size", DisplayUnits = "MB" };
_gen0GCCounter ??= new IncrementingPollingCounter("gen-0-gc-count", this, () => GC.CollectionCount(0)) { DisplayName = "Gen 0 GC Count", DisplayRateTimeScale = new TimeSpan(0, 1, 0) };
_gen1GCCounter ??= new IncrementingPollingCounter("gen-1-gc-count", this, () => GC.CollectionCount(1)) { DisplayName = "Gen 1 GC Count", DisplayRateTimeScale = new TimeSpan(0, 1, 0) };
_gen2GCCounter ??= new IncrementingPollingCounter("gen-2-gc-count", this, () => GC.CollectionCount(2)) { DisplayName = "Gen 2 GC Count", DisplayRateTimeScale = new TimeSpan(0, 1, 0) };
_threadPoolThreadCounter ??= new PollingCounter("threadpool-thread-count", this, () => ThreadPool.ThreadCount) { DisplayName = "ThreadPool Thread Count" };
_monitorContentionCounter ??= new IncrementingPollingCounter("monitor-lock-contention-count", this, () => Monitor.LockContentionCount) { DisplayName = "Monitor Lock Contention Count", DisplayRateTimeScale = new TimeSpan(0, 0, 1) };
_threadPoolQueueCounter ??= new PollingCounter("threadpool-queue-length", this, () => ThreadPool.PendingWorkItemCount) { DisplayName = "ThreadPool Queue Length" };
_completedItemsCounter ??= new IncrementingPollingCounter("threadpool-completed-items-count", this, () => ThreadPool.CompletedWorkItemCount) { DisplayName = "ThreadPool Completed Work Item Count", DisplayRateTimeScale = new TimeSpan(0, 0, 1) };
_allocRateCounter ??= new IncrementingPollingCounter("alloc-rate", this, () => GC.GetTotalAllocatedBytes()) { DisplayName = "Allocation Rate", DisplayUnits = "B", DisplayRateTimeScale = new TimeSpan(0, 0, 1) };
_timerCounter ??= new PollingCounter("active-timer-count", this, () => Timer.ActiveCount) { DisplayName = "Number of Active Timers" };
#if !MONO
_exceptionCounter ??= new IncrementingPollingCounter("exception-count", this, () => Exception.GetExceptionCount()) { DisplayName = "Exception Count", DisplayRateTimeScale = new TimeSpan(0, 0, 1) };
_gcTimeCounter ??= new PollingCounter("time-in-gc", this, () => GC.GetLastGCPercentTimeInGC()) { DisplayName = "% Time in GC since last GC", DisplayUnits = "%" };
_gen0SizeCounter ??= new PollingCounter("gen-0-size", this, () => GC.GetGenerationSize(0)) { DisplayName = "Gen 0 Size", DisplayUnits = "B" };
_gen1SizeCounter ??= new PollingCounter("gen-1-size", this, () => GC.GetGenerationSize(1)) { DisplayName = "Gen 1 Size", DisplayUnits = "B" };
_gen2SizeCounter ??= new PollingCounter("gen-2-size", this, () => GC.GetGenerationSize(2)) { DisplayName = "Gen 2 Size", DisplayUnits = "B" };
_lohSizeCounter ??= new PollingCounter("loh-size", this, () => GC.GetGenerationSize(3)) { DisplayName = "LOH Size", DisplayUnits = "B" };
_pohSizeCounter ??= new PollingCounter("poh-size", this, () => GC.GetGenerationSize(4)) { DisplayName = "POH (Pinned Object Heap) Size", DisplayUnits = "B" };
_assemblyCounter ??= new PollingCounter("assembly-count", this, () => System.Reflection.Assembly.GetAssemblyCount()) { DisplayName = "Number of Assemblies Loaded" };
_ilBytesJittedCounter ??= new PollingCounter("il-bytes-jitted", this, () => System.Runtime.CompilerServices.RuntimeHelpers.GetILBytesJitted()) { DisplayName = "IL Bytes Jitted", DisplayUnits = "B" };
_methodsJittedCounter ??= new PollingCounter("methods-jitted-count", this, () => System.Runtime.CompilerServices.RuntimeHelpers.GetMethodsJittedCount()) { DisplayName = "Number of Methods Jitted" };
#endif

Timer.ActiveCounter fluctuates all the time, ThreadPool.ThreadCount fluctuates all the time, etc.

@davidsh
Copy link
Contributor Author

davidsh commented Jun 8, 2020

In .NET Core, it defaults to unlimited, so someone would have to explicitly set MaxConnectionsPerServer for this counter to be meaningful, right?

That assumes that the request will being right away (i. e. create a new connection if needed and immediately send out request header bytes). But in a loaded system we've seen things sit around waiting for threadpool threads just to begin doing any work. And I think it might be useful to measure that.

Perhaps we won't end up with an actual counter though. Implementing #827 will result in having events (not a counter, per se) with timestamps for each step of the request lifecycle. Perhaps aggregating those events can give us the same information about unexpected latencies.

@stephentoub
Copy link
Member

And I think it might be useful to measure that.

How are you planning to track that / differentiate it from other causes of Socket.ConnectAsync not completing quickly?

davidsh added a commit to davidsh/runtime that referenced this issue Jun 8, 2020
This is the first of several PRs to add telemetry focused events and counters.

A few things to note:

* Added a new dedicated namespace for telemetry. Current feedback from partners is that our
existing events are "confusing and a mess" due to their focus on low-level diagnostics and verbose
debugging. See discussion in dotnet#37428 regarding reasons for a new namespace.

* Please comment on the naming patterns for the events, counters, and
display names, etc. I looked at ASP.NET events/counters and @anurse older guidance in
https://gist.github.com/anurse/af1859663ac91c6cf69c820cebe92303. But there are inconsistencies
between all of that. So, I'd like to get agreement on naming patterns for events/counters
(i.e. noun-verb vs. verb-noun, present tense vs. past tense, etc.).

* No automated tests were added at this time. The current discussions with others (reverse proxy
team) is that automated tests are brittle in CI. But manual tests will be run.

Contributes to dotnet#37428
@noahfalk
Copy link
Member

noahfalk commented Jun 9, 2020

Thanks for the very nice writeup @davidsh ! I've been handling some other obligations last week and still not quite free of it, but I am hoping within a few days to have more time to dig into this.

@davidfowl
Copy link
Member

I may need to rename the kestrel event source. Everything is inconsistent now 😁. Can we agree dots > slashes?

@davidsh
Copy link
Contributor Author

davidsh commented Jun 9, 2020

Can we agree dots > slashes?

I think you mean "dashes" instead of "slashes"?

@noahfalk recommended against renaming existing EventSource/EventCounter namespaces. But perhaps that is general guidance.

@stephentoub
Copy link
Member

It is true that some of the previously ported "Microsoft-System-Net-" events we have are a mess and are inconsistent. I think we can address that. But I would prefer to do that with a separate discussion/issue and not bottleneck on getting this telemetry focused events/counters into dotnet/runtime.

Do we have an issue tracking cleaning that up? It really is a mess.

@davidsh
Copy link
Contributor Author

davidsh commented Jun 9, 2020

Do we have an issue tracking cleaning that up? It really is a mess.

#37655

davidsh added a commit that referenced this issue Jun 24, 2020
* Add telemetry to System.Net.Http

This is the first of several PRs to add telemetry focused events and counters.

A few things to note:

* Added a new dedicated namespace for telemetry. Current feedback from partners is that our
existing events are "confusing and a mess" due to their focus on low-level diagnostics and verbose
debugging. See discussion in #37428 regarding reasons for a new namespace.

* Please comment on the naming patterns for the events, counters, and
display names, etc. I looked at ASP.NET events/counters and @anurse older guidance in
https://gist.github.com/anurse/af1859663ac91c6cf69c820cebe92303. But there are inconsistencies
between all of that. So, I'd like to get agreement on naming patterns for events/counters
(i.e. noun-verb vs. verb-noun, present tense vs. past tense, etc.).

* No automated tests were added at this time. The current discussions with others (reverse proxy
team) is that automated tests are brittle in CI. But manual tests will be run.

Contributes to #37428

* Address PR feedback

* PR feedback
* Added RequestAbort as separate event method with dimensions (host, port)

* Address PR feedback 2

* RequestStart now includes host, port, scheme, path and version
* RequestStart and RequestStop now bracket the lifetime of the request including reading the
response body stream. If the response body is not complete read then disposing the stream
will trigger RequestStop as well.

* Address feedback 3

* Use exception filter to improve performance

* Avoid negative 'current-requests' counter value

* Address feedback 4
@davidsh davidsh assigned alnikola and unassigned davidsh Jun 26, 2020
@alnikola alnikola removed their assignment Jun 29, 2020
alnikola added a commit that referenced this issue Jul 14, 2020
Socket's connect start, stop and cancelled events are logged as activities via EventSource.

Contributes to #37428
alnikola added a commit that referenced this issue Jul 28, 2020
Counters in SocketsTelemetry measuring

`bytes-received` (Bytes Received)
The cumulative total number of bytes received by all Socket objects since the process started.

`bytes-sent` (Bytes Sent)
The cumulative number of bytes sent by all Socket objects  since the process started.

`outgoing-connections-established` (Outgoing Connections Established)
The cumulative total number of outgoing connections established by Socket objects for stream sockets since the process started.

`incoming-connections-established` (Incoming Connections Established)
The cumulative total number of incoming connections established by Socket objects for stream sockets since the process started.

`datagrams-received` (Datagrams Received)
The cumulative total number of datagram packets received by all Socket objects since the process started.

`datagrams-sent` (Datagrams Sent)
The cumulative total number of datagram packets sent by all Socket objects since the process started.

Contributes to #37428
@alnikola alnikola removed their assignment Jul 29, 2020
Jacksondr5 pushed a commit to Jacksondr5/runtime that referenced this issue Aug 10, 2020
Counters in SocketsTelemetry measuring

`bytes-received` (Bytes Received)
The cumulative total number of bytes received by all Socket objects since the process started.

`bytes-sent` (Bytes Sent)
The cumulative number of bytes sent by all Socket objects  since the process started.

`outgoing-connections-established` (Outgoing Connections Established)
The cumulative total number of outgoing connections established by Socket objects for stream sockets since the process started.

`incoming-connections-established` (Incoming Connections Established)
The cumulative total number of incoming connections established by Socket objects for stream sockets since the process started.

`datagrams-received` (Datagrams Received)
The cumulative total number of datagram packets received by all Socket objects since the process started.

`datagrams-sent` (Datagrams Sent)
The cumulative total number of datagram packets sent by all Socket objects since the process started.

Contributes to dotnet#37428
@MihaZupan
Copy link
Member

MihaZupan commented Aug 17, 2020

Current list of implemented events and counters in System.Net.*:
Unless otherwise specified:

  • Event level is Informational
  • Counter type is PollingCounter

System.Net.NameResolution

Events:

  • ResolutionStart(string hostNameOrAddress)
  • ResolutionStop()
  • ResolutionFailed()

Counters:

  • dns-lookups-requested - DNS Lookups Requested
  • dns-lookups-duration - Average DNS Lookup Duration - EventCounter

System.Net.Sockets

Events:

  • ConnectStart(string address)
  • ConnectStop()
  • ConnectFailed(SocketError error, string exceptionMessage) - Error
  • AcceptStart(string address)
  • AcceptStop()
  • AcceptFailed(SocketError error, string exceptionMessage) - Error

Counters:

  • outgoing-connections-established - Outgoing Connections Established
  • incoming-connections-established - Incoming Connections Established
  • bytes-received - Bytes Received
  • bytes-sent - Bytes Sent
  • datagrams-received - Datagrams Received
  • datagrams-sent - Datagrams Sent

System.Net.Security

Events:

  • HandshakeStart(bool isServer, string targetHost)
  • HandshakeStop(string protocol)
  • HandshakeFailed(bool isServer, double elapsedMilliseconds, string exceptionMessage) - Error

Counters:

  • tls-handshake-rate - TLS handshakes completed - IncrementingPollingCounter
  • total-tls-handshakes - Total TLS handshakes completed
  • current-tls-handshakes - Current TLS handshakes
  • failed-tls-handshakes - Total TLS handshakes failed
  • all-tls-sessions-open - All TLS Sessions Active
  • tls10-sessions-open - TLS 1.0 Sessions Active
  • tls11-sessions-open - TLS 1.1 Sessions Active
  • tls12-sessions-open - TLS 1.2 Sessions Active
  • tls13-sessions-open - TLS 1.3 Sessions Active
  • all-tls-handshake-duration - TLS Handshake Duration - EventCounter
  • tls10-handshake-duration - TLS 1.0 Handshake Duration - EventCounter
  • tls11-handshake-duration - TLS 1.1 Handshake Duration - EventCounter
  • tls12-handshake-duration - TLS 1.2 Handshake Duration - EventCounter
  • tls13-handshake-duration - TLS 1.3 Handshake Duration - EventCounter

System.Net.Http

Events:

  • RequestStart(string scheme, string host, int port, string pathAndQuery, byte versionMajor, byte versionMinor, HttpVersionPolicy versionPolicy)
  • RequestStop()
  • RequestFailed() - Error
  • ConnectionEstablished(byte versionMajor, byte versionMinor)
  • ConnectionClosed(byte versionMajor, byte versionMinor)
  • RequestLeftQueue(double timeOnQueueMilliseconds, byte versionMajor, byte versionMinor)
  • RequestHeadersStart()
  • RequestHeadersStop()
  • RequestContentStart()
  • RequestContentStop(long contentLength)
  • ResponseHeadersStart()
  • ResponseHeadersStop()
  • ResponseContentStart()
  • ResponseContentStart()

Counters:

  • requests-started - Requests Started
  • requests-started-rate - Requests Started Rate - IncrementingPollingCounter
  • requests-failed - Requests Aborted
  • requests-failed-rate - Requests Aborted Rate - IncrementingPollingCounter
  • current-requests - Current Requests
  • http11-connections-current-total - Current Http 1.1 Connections
  • http20-connections-current-total - Current Http 2.0 Connections
  • http11-requests-queue-duration - HTTP 1.1 Requests Queue Duration - EventCounter
  • http20-requests-queue-duration - HTTP 2.0 Requests Queue Duration - EventCounter

@karelz
Copy link
Member

karelz commented Aug 17, 2020

@MihaZupan is there anything left, or should we close this issue as everything we planned is now implemented?

@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

No branches or pull requests

9 participants