Skip to content

Initial commit benchmark2 (.not core) #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Anderman
Copy link
Contributor

Simple benchmark. Allow faster testing and easy to generate excel sheet.
randomize/cached testing is still manual

@nietras
Copy link
Contributor

nietras commented Aug 28, 2016

@Anderman wow there are some cool tricks in here. I will probably wait merging until I have some time to do refactoring and understand your changes. Will you give some new results on the latest changes you have done? Perhaps in the original clr thread?

Will you be doing more work on this?

@Anderman
Copy link
Contributor Author

@nietras.

Will you be doing more work on this?
Not this week. I think of a test with two threads on different cores. One writing / one reading
I think there maybe some problems in Kestrel if the wrong cache strategy is chosen.

@nietras
Copy link
Contributor

nietras commented Aug 29, 2016

@Anderman I gave this a quick look. The tricks to call assembly code inline in C# are very impressive. I do have one concern and that is the use of RDTSC, from Acquiring high-resolution time stamps, which I highly recommend reading:

We strongly discourage using the RDTSC or RDTSCP processor instruction to directly query the TSC because you won't get reliable results on some versions of Windows, across live migrations of virtual machines, and on hardware systems without invariant or tightly synchronized TSCs.

One way to avoid some issues with RDTSC is to set processor affinity, for example. But it still has issues.

@Anderman
Copy link
Contributor Author

@nietras Could be a problem. But I don't see why this depends on the version of windows. There is another way to get the same counters. http://stackoverflow.com/questions/26618991/measure-cpu-cycles-of-a-function-call/26619409#26619409

@nietras
Copy link
Contributor

nietras commented Aug 29, 2016

this is the same as RDTSC see
https://blogs.msdn.microsoft.com/oldnewthing/20160429-00/?p=93385 with the
same issues.

On Aug 29, 2016 21:50, "Thom Kiesewetter" [email protected] wrote:

@nietras https://github.com/nietras Could be a problem. But I don't see
why this depends on the version of windows. There is another way to get the
same counters. http://stackoverflow.com/questions/26618991/measure-
cpu-cycles-of-a-function-call/26619409#26619409


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKTG79QY1G4NWm9Um28I-nkTd-Ijki5Eks5qkzgOgaJpZM4Jqadf
.

@Anderman
Copy link
Contributor Author

Ok. I know that. Thats why some people say use the clock counter instead of nano sec. I found ä document from a prof, he did al lot of performance cpu testen for many years. Thats also why warmup test are needed.
This prof also say
Find the min duration. Because you want to test your code and not the influence of the system
Thats also why the test must be relative short.

I do 50 cycles to minimize the overhead and each test is 100ms. So there will be a lot of test and some are not influenced by the system

@nietras
Copy link
Contributor

nietras commented Aug 30, 2016

I found ä document from a prof, he did al lot of performance cpu testen for many years.

Are you referring to Agner Fog (http://www.agner.org/optimize/)?

Find the min duration. Because you want to test your code and not the influence of the system
Thats also why the test must be relative short.

Yes but his use cases are different, as far as I am concerned, he is profiling short native code. Not a managed runtime. Additionally, he can call RDTSC with very little overhead, this is not possible in .NET. The overhead or latency is "high". This is the reason why HPET isn't default in Windows too, the latency for this is pretty high compared to the "clock" frequency of the timer. Latencies are covered in the link I sent.

I do 50 cycles to minimize the overhead and each test is 100ms.

Does that mean that each measurement spans ~100ms or? Perhaps I should ask differently why did you want to change to using RDTSC from the Stopwatch? If you are measuring over 100ms there is no reason, so I am guessing that is not the case.

I am not saying that RDTSC can't be used for perf measuring it just has some drawbacks, and measuring perf in .NET versus native code is different. .NET has a JIT, GC etc. all influence how code is running.

You probably already have seen what BenchmarkDotNet does when measuring but you can see it here https://perfdotnet.github.io/BenchmarkDotNet/HowItWorks.htm or at least a high overview. I do agree, though, that they have taken it to extremes for simple code measurements when the default period is 200ms per measurement, this is why I am working on something a little more quick... too little time, though.

In any case, I think your approach with specific benchmarks for memory copying makes a lot of sense.

@Anderman
Copy link
Contributor Author

Anderman commented Aug 30, 2016

Are you referring to Agner Fog (http://www.agner.org/optimize/)?
Yes, I tried to find the document that say something about testing. But I could not find it anymore. );

Yes but his use cases are different, as far as I am concerned, he is profiling short native code. Not a managed runtime. Additionally, he can call RDTSC with very little overhead, this is not possible in .NET. The overhead or latency is "high". This is the reason why HPET isn't default in Windows too, the latency for this is pretty high compared to the "clock" frequency of the timer. Latencies are covered in the link I sent.

I don't see what managed runtime has todo with our memory tests. Qcalls and PInvoke has some overhead but it is only a number of extra instructions.
I use the following trick to see what assembly code is called. I add Debugger.Launch() statement between two calls

result = Foo();
Debugger.Launch();
result= Foo();

Run with F5 release build. When the exception is show. Open with VS and do 4 times step out (Shift F11)
Now you are in the assembly code which you want to debug. With this you can also step into native calls

HPET. hmm, Interesting. It looks that are better ways to measure.

Does that mean that each measurement spans ~100ms or? Perhaps I should ask differently why did you want to change to using RDTSC from the Stopwatch? If you are measuring over 100ms there is no reason, so I am guessing that is not the case.

The tick counter on the stopwatch is 100-200 slower then the RDTSC counter. I didn't get repeatable result when I used the stopwatch. The test must be longer because counter is running at a slower speed. But then other code that's runs on my computer has the change to influence the results.
example
anderman.memove(count=0) takes 8 clockcycles
Iterations 50

Total clockCycles code = 8*50 = 400
Total overhead (loop + 2 times call RPTSC) 200 clock cycles.

So the clock timer will measure 600 clockcycles. With the stopwatch I get only 2 or 3 ticks

So in 100ms the test will run about 100ms/600cycles is about 100ms/300ns = 333333 times.
The test will show the fastest result of all 333333 tests. because this was the test that wasn't disturbed by other code that was running on my system.

I hope this make thinks more clear why I took this solution

@Anderman
Copy link
Contributor Author

Anderman commented Aug 30, 2016

Maybe Interesting

@nietras
Copy link
Contributor

nietras commented Aug 31, 2016

Yes, it has an optimized copy function mentioned in one of the reports. Could be used for inspiration, although it uses aligned instructions etc that are not available in. net.

@nietras
Copy link
Contributor

nietras commented Sep 10, 2016

@Anderman there is a great (and long) blog post about timers and e.g. all the issues with TSC here: http://aakinshin.net/en/blog/dotnet/stopwatch/

Including latency and resolution issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants