Skip to content

Performance: ASP.NET Core Tuning #160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
NickCraver opened this issue Mar 19, 2017 · 9 comments
Closed

Performance: ASP.NET Core Tuning #160

NickCraver opened this issue Mar 19, 2017 · 9 comments

Comments

@NickCraver
Copy link
Member

One of the goals of MiniProfiler has always been to add as little overhead as possible to get our timings. Luckily the ASP.NET team has a Benchmarks repo (aspnet/benchmarks). I've created a minimal fork to add MiniProfiler (I need to discuss with that team if this is something that's even welcome, and if so how we'd want to set it up) - that fork is here: NickCraver/benchmarks.

Here's a current benchmark comparison of aspnet/benchmarks with and without MiniProfiler Middleware activated to get a general idea of overhead as a current baseline.

Without (151,417.5 RPS over 4 runs)

-bash-4.2$ ./wrk -c 256 -t 32 -d 10 -s ./scripts/pipeline.lua http://10.8.2.111:5000/plaintext -- 16
Running 10s test @ http://10.8.2.111:5000/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    20.57ms   51.24ms 524.91ms   90.94%
    Req/Sec     4.83k   689.94    10.74k    82.19%
  1546313 requests in 10.10s, 1.70GB read
Requests/sec: 153109.54
Transfer/sec:    172.16MB
-bash-4.2$
-bash-4.2$ ./wrk -c 256 -t 32 -d 10 -s ./scripts/pipeline.lua http://10.8.2.111:5000/plaintext -- 16
Running 10s test @ http://10.8.2.111:5000/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    47.06ms  123.84ms   1.79s    92.94%
    Req/Sec     4.83k     1.05k   12.77k    79.97%
  1523088 requests in 10.10s, 1.67GB read
  Socket errors: connect 0, read 0, write 0, timeout 5
Requests/sec: 150808.75
Transfer/sec:    169.57MB
-bash-4.2$ ./wrk -c 256 -t 32 -d 10 -s ./scripts/pipeline.lua http://10.8.2.111:5000/plaintext -- 16
Running 10s test @ http://10.8.2.111:5000/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    23.20ms   49.25ms 409.99ms   89.02%
    Req/Sec     4.68k     0.86k   20.28k    86.68%
  1498335 requests in 10.10s, 1.65GB read
Requests/sec: 148345.87
Transfer/sec:    166.81MB
-bash-4.2$ ./wrk -c 256 -t 32 -d 10 -s ./scripts/pipeline.lua http://10.8.2.111:5000/plaintext -- 16
Running 10s test @ http://10.8.2.111:5000/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    22.80ms   49.43ms 414.15ms   89.17%
    Req/Sec     4.85k     0.87k   13.07k    84.68%
  1549396 requests in 10.10s, 1.70GB read
Requests/sec: 153405.85
Transfer/sec:    172.50MB

With MiniProfiler Enabled (115,415.35 RPS over 4 runs)

-bash-4.2$ ./wrk -c 256 -t 32 -d 10 -s ./scripts/pipeline.lua http://10.8.2.111:5000/plaintext -- 16
Running 10s test @ http://10.8.2.111:5000/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    49.97ms   87.79ms 435.86ms   83.68%
    Req/Sec     3.55k     1.22k    9.67k    77.44%
  1064787 requests in 10.10s, 1.32GB read
Requests/sec: 105422.74
Transfer/sec:    133.33MB
-bash-4.2$ ./wrk -c 256 -t 32 -d 10 -s ./scripts/pipeline.lua http://10.8.2.111:5000/plaintext -- 16
Running 10s test @ http://10.8.2.111:5000/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    54.88ms   92.87ms 493.74ms   82.86%
    Req/Sec     3.83k     1.17k   12.74k    81.83%
  1199836 requests in 10.10s, 1.48GB read
Requests/sec: 118794.48
Transfer/sec:    150.25MB
-bash-4.2$ ./wrk -c 256 -t 32 -d 10 -s ./scripts/pipeline.lua http://10.8.2.111:5000/plaintext -- 16
Running 10s test @ http://10.8.2.111:5000/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    69.93ms  108.49ms 926.36ms   81.69%
    Req/Sec     3.83k     1.32k   13.59k    77.07%
  1192225 requests in 10.10s, 1.47GB read
Requests/sec: 118075.61
Transfer/sec:    149.33MB
-bash-4.2$ ./wrk -c 256 -t 32 -d 10 -s ./scripts/pipeline.lua http://10.8.2.111:5000/plaintext -- 16
Running 10s test @ http://10.8.2.111:5000/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    52.13ms   94.91ms 908.83ms   83.97%
    Req/Sec     3.89k     1.29k   17.44k    83.52%
  1205621 requests in 10.10s, 1.49GB read
Requests/sec: 119368.59
Transfer/sec:    150.97MB

So it takes us from about 151,417.5 RPS down to 115,415.35 RPS, or a 23.8% overhead/cost on the /plaintext test. Let's see if we can get that down a bit. This issue is for tracking and suggestions.

@NickCraver
Copy link
Member Author

@DamianEdwards
Copy link
Contributor

That's a fairly low baseline. What environment was the test performed in?

@NickCraver
Copy link
Member Author

@DamianEdwards the runner is on our VM tier in CO (currently getting a revamp to Hyper-V) - so we're in flux atm, but results are consistent. The Kestrel server shouldn't be a weak machine; that tier currently has dual E5-2690v3 processors @ 3.07GHz and 64GB of RAM, with 20Gb of and ~0.17ms of latency between.

The bottleneck appears to be a pegged CPU on the 2012 R2 physical machine Kestrel's on. I need to see what's capping out so hard there for throughput, but should be somewhat valid for an apples/apples rough impact until I can spare time to dig into it. Any suggestions/common things you guys are seeing that may save some digging time?

@DamianEdwards
Copy link
Contributor

Ethernet specs?

@NickCraver
Copy link
Member Author

NickCraver commented Mar 19, 2017

@DamianEdwards VM is an aggregate trunk, F630s with internal 2x 10Gb (and idle), into dual IOAs each with 4x 10Gb uplinks, so each FX2s chassis has 80Gbps of aggregate, of which the VM can access 20Gb. The web server has a X540 NDC, so 2x 10Gb to the network. All systems are active/active LACP.

I can go into switch details if you want, but it seems we're limited by CPU here (99-100% pegged for the duration). I've just maxed out Tx/Rx queue sizes which somehow regressed in a recent change (will have out guys track that cause down) and re-ran, with no impact.

@DamianEdwards
Copy link
Contributor

We've seen many cases where seemingly CPU saturated loads are actually bottlenecked somewhere else and able to yield many times more the RPS. With machines of that spec you should be seeing vastly higher baseline numbers for the plaintext test.

@NickCraver
Copy link
Member Author

Agreed - and in earlier previews I was getting far higher, this was on the exact same hardware back in November: aspnet/KestrelHttpServer#386 (comment)

@NickCraver
Copy link
Member Author

There's a severe network issue we're tracking down (and by we, I mean @GABeech by himself while I'm out this week)...we'll re-test when we isolate what's happening with the central network and see how much of a factor it is.

@NickCraver
Copy link
Member Author

Lots of perf tuning in the last pass, this is as good as it gets for now, going to revisit after the view wrapping is removed (if we can do that, with diagnostics events having concrete types...we hope).

Each MiniProfiler (if created) is about 400 bytes with just a root timing and all associated information, not too bad. The ConcurrentDictionary<> props have been replaced with a lazy init locker on the IDbProfiler side as they were by far the biggest allocators for tracking start/return/end on the ADO.NET side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants