-
Notifications
You must be signed in to change notification settings - Fork 18k
testing: autodetect appropriate benchtime #10930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have another idea. We can add two new interfaces to testing package, The other is specify how many times to run a given benchmark Then we can have an external tool driving the benchmark to I really want benchcmp or some other tool (e.g. Russ' benchstat) Then the benefit is that we decouple the statistic engine from If we want to do better, we can even make a benchmark server |
The prototype implementation I referred to was done by hacking in a stupid line-oriented benchmark server and writing a simple external driver program. So we're thinking along similar lines. The advantage of a server is that the overhead of executing the binary each time might prove significant when we're talking about 5ns-per-op benchmarks. I'm agnostic about whether the statistic engine should be internal or not. I just want it to be good, and I know that (1) that requires testing support and (2) that I personally lack the domain expertise to make it awesome. :) |
Oh, and I have a simple driver script that invokes benchmarks in a loop and uses benchstat to print rolling results. I'd love to invest to make it sophisticated and generally useful (right now it is very tuned to my personal setup and habits), but again I lack the statistics expertise to design it correctly. I agree that accepting two test binaries is the right api. |
I have a feeling that we can write a standalone package to
achieve the benchmark server effect without any testing
package support.
All you have to do is to blank import my package in one of
the test files and set -test.server flag.
Of course, some unsafe hackery will be required, but not
much. (It needs access to main.benchmarks and main.tests
slice and some private methods of testing.(*B))
Are you interested?
|
I'm interested in doing this, but not using unsafe. |
The unsafe hacks will be pretty limited.
We need to access main.benchmarks, main.tests and
testing.(*B).runN (with //go:linkname or asm stubs)
All the other parts of the testing.B type can be accessed
with reflect.
|
Some simple codegen could provide the flag and the list of benchmarks. The basic implementation of benchmarking is pretty simple and could be copy/pasted; CL 10020 makes it simpler yet. I'd still rather design proper support for core building blocks into the stdlib, but I guess this would be better than nothing. Any interest in working on this with me? |
@minux well, a combination of unsafe, reflection, and testing.M seemed like the best combination in the end. Please see package benchserve (godoc) for an initial implementation. Feedback most welcome. If it looks good to you, then we can turn our attention to making a good test driver, probably at a first pass a combination of bench and rod. |
The only issue of using TestMain is that we need to handle
the case where the test is already using TestMain to do
tests setups.
My approach is very hacky, it injects a new test into the list
of tests during package init() and then set -test.run to run that
test (which is actually the jsonrpc bench server)
I realized that I probably used too much unsafe hackery in
the code though.....
|
I've updated benchserve to support the case in which there is already a TestMain. I also added an (aspirational, unwritten) client API to make writing drivers easier. |
What about adding a -minbenchiterations that takes precendence over benchtime so faster benchmarks can complete quickly and longer ones can still run enough times for a meaningful number? |
@josharian Say I have two benchmarks A and B, A takes 10s to get to 1k iterations and B 1s to get to 1k iterations. I'd like it to run, just like that. |
I think when we have different sizes of benchmarks, it would be much easier to have basic control from inside of benchmark itself. For example:
As for me, it is much convenient to adjust specific variables than apply command line options to all benchmarks together. |
For discussion:
There is tension in how long to run benchmarks for. You want to run long, in order to make any overhead irrelevant and to reduce run-to-run variance. You want to run short, so that it takes less time; if you have a fixed amount of computing time, it'd be better to run multiple short tests, so you can do better analysis than taking the mean, perhaps by using benchstat.
Right now we use a fixed duration, which is ok, but we could do better. For example, many of the microbenchmarks in strconv appear stable at 10ms, which is 100x faster than the default of 1s.
Rough idea, input welcomed:
The time to run a benchmark is V+C*b.N, where b.N is the number of iterations and V and C are random variables -- V for overhead, C for code execution time. We can take measurements using different b.N (starting small and growing) and estimate V and N. Based on that, we can calculate what b.N value is required to reduce the contribution of V to the sum to some fixed limit, say 1%.
This should allow stable, fast benchmarks to execute very quickly. Slower benchmarks would get slower (you have to execute with b.N=1 and 2 at a bare minimum), but that's better than accidentally misleading the user into thinking that they have a meaningful performance number, which is what can currently happen.
We would probably want to change benchtime to be a cap on running time and increase the default value substantially. If stable numbers are not achievable within the provided benchtime, we would warn the user, who could increase the benchtime or change the benchmark.
I put together a quick-and-dirty version of this using linear regression to estimate V and C. It almost immediately caught a badly behaved benchmark (fixed in CL 10053), when it estimated that the benchmark would take hours to run in order to be reliable. I haven't run it outside the
encoding/json
package; I imagine that there are other benchmarks that need fixing.Again, input welcomed. I'm not a statistician; I don't even play one on TV.
The text was updated successfully, but these errors were encountered: