proposal: cmd/go: run benchmarks for one iteration when testing #41824

zeebo · 2020-10-06T21:01:01Z

Currently, benchmarks are only run when passing the -bench flag to go test. Usually, CI checks don't run benchmarks because of the extra time required and noise reducing the usefulness of the results. This leads to situations where benchmark code can rot, causing it to become a liability instead of an asset. Instead, go test without the -bench flag could also run all of the benchmarks for a single iteration (b.N = 1) and just report PASS or FAIL like a normal test. Then benchmark code would be exercised at least as often as test code, preventing rot.

Some details to work out:

What about -run? I would suggest it additionally filters the benchmarks to run when -bench is not passed.

The text was updated successfully, but these errors were encountered:

davecheney · 2020-10-06T21:27:16Z

go test -run=. -bench=. -benchtime=1x

zeebo · 2020-10-06T21:31:56Z

That's a good way of saying the default behavior this proposes.

Are you trying to make an argument one way or the other with that statement? If so, what argument?

bcmills · 2020-10-07T03:33:00Z

To give a recent example: cmd/api.TestBenchmark was broken from CL 164623 to CL 224619, and nobody had noticed because the benchmark wasn't covered in the Go project's own CI checks until CL 224038.

mvdan · 2020-10-07T12:13:33Z

I generally agree with this; I've been bitten by benchmarks going stale as well. I think benchmarks should be run as part of go test for the same reason that most of vet is now also run when testing; because otherwise, users will forget to check regularly. go test -run=. -bench=. -benchtime=1x exists, but it's not easy to type nor remember.

The only trouble I can imagine with this change is that it would cause pain if there are any low-quality benchmarks which would now suddenly run as part of go test. For example:

Benchmarks which do a lot of work upfront, before looping over N. Imagine one that takes a full second before the loop starts, so even with N=1, the added cost would be high.
Benchmarks where each iteration is expensive. Imagine a benchmark where each iteration takes a full second, so N=1 is still heavy.

You could say "you should never do either of those things", but sometimes what you need to benchmark is just very costly, especially when benchmarking a feature end-to-end. And I've seen a number of such heavy benchmarks in the past, where even with -benchtime=5s you'd get results with N=10.

I still think it's good to incentivize developers to write cheaper benchmarks, so I'm still in favor of this change. But for those cases where heavy benchmarks are useful, we should have a recommendation for keeping them around without slowing down go test. Perhaps skip them with testing.Short()? Put them in separate packages, so that go test ./... runs them in parallel? Use build tags like // +build slowbench?

The concern above reminds me of #36586 too, because runnable examples and benchmarks are similar in how they can't run in parallel with each other, or with tests.

mvdan · 2020-10-07T12:18:48Z

What about -run? I would suggest it additionally filters the benchmarks to run when -bench is not passed.

I would further suggest that go test -short should ignore benchmarks entirely, too.

bcmills · 2020-10-07T12:48:38Z

But for those cases where heavy benchmarks are useful, we should have a recommendation for keeping them around without slowing down go test.

Perhaps skip them with testing.Short()?

Yes.

Put them in separate packages, so that go test ./... runs them in parallel?

That seems worse than skipping with testing.Short(): go test ./... would run them in parallel, but go test -bench=. foo wouldn't give any hint that some benchmarks of foo were skipped. (And go test -bench=. all wouldn't find those benchmarks when the package is imported from some other module.)

Use build tags like // +build slowbench?

That seems excessive, given the option of using testing.Short()..?

seebs · 2020-10-07T14:57:04Z

I endorse a default of -bench=. -benchtime=1x for go test, it would make my life better.

rsc · 2020-10-07T16:59:52Z

There are a bunch of benchmarks that do a lot of setup work anticipating that they will be run for the full 2s or more.
I don't believe this change is well-advised at all. If your benchmark is testing something, make a test.

rsc · 2020-10-07T17:01:13Z

Usually, CI checks don't run benchmarks because of the extra time required and noise reducing the usefulness of the results. This leads to situations where benchmark code can rot, causing it to become a liability instead of an asset. Instead, go test without the -bench flag could also run all of the benchmarks for a single iteration (b.N = 1) and just report PASS or FAIL like a normal test.

CI systems today can use -bench=1x. Do any CI systems do that?

josharian · 2020-10-07T17:08:38Z

One way to make this opt-in would be a method on testing.T. Maybe something like:

// RunBenchmark runs the benchmark b once with b.N == 1.
// Any TB method called on b, such as Errorf or Skip, are treated as though they were called on b.
func (t *T) RunBenchmark(bench func(b *B))

Basically, sub-tests but crossing the test/benchmark line. This would also provide an opportunity for the calling testing.T to do other work, like calling Parallel.

(You could also imagine doing this for examples, although then the examples would be executed twice during a normal run.)

cespare · 2020-10-07T18:58:12Z

@mvdan I disagree with your implication that any benchmarks which do expensive setup work are "low-quality".

One system I work on, for instance, is a database which includes benchmarks that initialize themselves with a large synthetic table; the benchmark loops run queries against the table. It takes a lot of CPU and RAM to do the initial setup and I wouldn't want to run these every time I do the usual go test ./.... In CI we use -benchtime 1x to ensure that the code is exercised.

From my perspective, things work well as-is and I would prefer we not change the existing behavior.

zeebo · 2020-10-07T23:21:35Z

I don't believe this change is well-advised at all. If your benchmark is testing something, make a test.

Of course tests should be tests, but this proposal is to make it so that benchmarks don't go stale and become broken over time because no one runs them. Phrased differently, it's about testing the benchmark, not having the benchmark test the code.

CI systems today can use -bench=1x. Do any CI systems do that?

Picking a sample of systems I'm aware of, plus whatever was at the top of the github marketplace for CI

Travis does not by default: https://docs.travis-ci.com/user/languages/go/#default-build-script
Codefresh's example does not: https://codefresh.io/docs/docs/learn-by-example/golang/golang-hello-world/#run-unit-tests-as-part-of-the-pipeline
CircleCI's example for Go uses gotestsum which does not: https://circleci.com/docs/2.0/language-go/#sample-configuration and https://github.com/gotestyourself/gotestsum#custom-go-test-command
SemaphoreCI's example does not: https://docs.semaphoreci.com/examples/golang-continuous-integration/#sample-configuration

I would be surprised if running the benchmarks as part of go test was included in any documentation for any CI system. Furthermore, many examples do include more advanced features like generating code coverage reports, so it can't be explained by keeping examples simple.

As far as users overriding the default configuration to run benchmarks, I could not find a single .travis.yaml that executes benchmarks as part of go test using github search:

In my personal experience, 3 out of 3 different workplaces did not run benchmarks as part of the tests. Indeed, as @bcmills said, the Go project itself had a broken benchmark for a period of over a year.

I believe this is strong evidence that running benchmarks as part of testing is rare.

kortschak · 2020-10-07T23:45:06Z

I think this should be opt in, as it currently is with -benchtime=1x. I just ran the Gonum test suite with -benchtime=1x and it goes from completing in 2 minutes to taking more than half an hour. It may well be that we are an outlier (we have a lot of benchmarks and due to the nature of the things we do, we need to test a variety of problem sizes with algorithms that have — intrinsically — fairly unpleasant time complexity).

rsc · 2020-10-12T18:41:04Z

@zeebo

I believe this is strong evidence that running benchmarks as part of testing is rare.

Indeed. The fact that CI systems can today opt in to running 1 iteration of benchmarks (using -benchtime=1x) but do not seems to me strong evidence that the Go command should not start doing it either.

Benchmarks and tests are separate by design, for good reasons (benchmarks can be slow, even 1x, and that hurts go test latency, which is very important to keep low).

It seems to me that we should keep them separate. As has been pointed out, CI systems and users have the ability to opt in even today, using -benchtime=1x. You could even set GOFLAGS=-benchtime=1x to force it into all your go test commands.

seebs · 2020-10-12T18:52:22Z

Many CI systems predate the existence of "1x" as a benchtime option. For a long time, benchmarks could only be run in a loop until they took at least some amount of time, defaulting to a second, which in practice often required quite a few iterations per benchmark. Now that it exists, it's much more likely that people would start adopting it, but I think the lack of people doing it may reflect in part that it wasn't available when they specced out their setup.

rsc · 2020-10-12T19:10:57Z

The CI systems are not the primary reason not to do this. The primary reason is the overall slowdown.
The fact that the CI systems already can do this but do not is just icing atop the cake.

zeebo · 2020-10-12T20:09:21Z

None of the CI systems run go vet either, yet that was included to be a part of go test even though it too would slow down execution. Is it a matter of the degree of slow down? If so, would it be useful to collect data about what the expected slow down is?

zeebo · 2020-10-12T20:11:25Z

Benchmarks and tests are separate by design, for good reasons

Examples and tests are also separate for good reasons, yet examples are run during go test not as a form of testing the library, but to ensure that the documented examples do not become stale or broken. Why should that not also apply to benchmarks?

kortschak · 2020-10-12T20:46:05Z

If so, would it be useful to collect data about what the expected slow down is?

That's essentially open-ended; there is one data point above. Given the wide range of possible impacts on test times, the question then becomes what is the work impact on those project where the time impact is too high. For us, if there were no additional knobs in the form of command flags or env vars, the work would be onerous to prevent a blow-out of testing time by preventing running of benchmarks, and given the Go project's dislike of knobs, that additional work would be guaranteed.

davecheney · 2020-10-12T20:48:26Z

@zeebo you mentioned

This leads to situations where benchmark code can rot, causing it to become a liability instead of an asset.

What does this rotting look like? Is it mis compilation, or the benchmarks returning incorrect results?

zikaeroh · 2020-10-12T20:56:37Z

The searches in #41824 (comment) aren't complete, as they are searching through ".travis.yaml", not ".travis.yml", which is the massively more common filename. There are many examples of go test being run directly. Some run benchmarks, often with short benchtimes.

E.g., this is an old repo that only tests up to 1.6, but they were using a very short benchtime (100ms) to get something close to 1x before it was even supported: https://github.com/encryptio/go-meetup/blob/755fb270c209065a5242e60431973ff2ce2f4307/.travis.yml#L8

I'd be wary of running benchmarks by default, even at 1x; on my main project my benchmarks often spin up a temporary database. I run with 1x in my CI, but I've gone to great lengths to ensure that this can happen quickly (using a special image that caches the initialization, plus pooling and DB templating). I'm not certain everyone has gone through that effort.

kortschak · 2020-10-12T21:18:49Z

@davecheney It can't be mis-compilation because irrespective of the flags, the benchmarks are still built.

In looking into the time impact on Gonum's test suite, I did find one set of cases where a change in the API had resulted in a rotted benchmark resulting in a panic (so I am torn by this, I don't want to see rotted benchmarks, but a greater than one order of magnitude increase in test suite running time is too much).

davecheney · 2020-10-12T22:12:54Z

@kortschak that's what I figured, thanks for confirming.

josharian · 2020-10-12T23:02:36Z

It seems like there's consensus here (1) that there's a problem with benchmarks getting stale and (2) that running with -benchtime=1x by default isn't the solution.

I offered one suggestion above for making it easy to explicitly opt in on a benchmark by benchmark basis, and a related approach in #41878.

Anyone else care to throw out other possible solutions? This seems like as good a location as any to brainstorm.

zeebo · 2020-10-12T23:52:28Z

There are many examples of go test being run directly. Some run benchmarks, often with short benchtimes.

Thank you @zikaeroh for spotting the errors in the analysis I made.

I offered one suggestion above for making it easy to explicitly opt in on a benchmark by benchmark basis, and a related approach in #41878.

Anything that is opt-in may as well be during the invocation of the go tool, I think. In other words, I don't see the benefit of encoding the opt-in in the source code versus the arguments passed to go test.

randall77 · 2020-10-13T00:19:06Z

What about just using names? Currently Test* are tests, and Benchmark* are benchmarks. How about anything named BenchmarkTest* is both a benchmark and a test?
A BenchmarkTest would have to take a *testing.B like benchmarks do. But it would also be run with b.N=1 when testing. Conceivably we could add a method like TestMode() bool to *testing.B so if BenchmarkTests cared they could distinguish test mode (e.g. Skip, or avoid setting up a database, etc.).

The only downside is that anything currently named BenchmarkTest* would now run when testing, when it didn't before.

Happy to entertain better names.

josharian · 2020-10-13T00:19:22Z

I don't see the benefit of encoding the opt-in in the source code versus the arguments passed to go test.

You can specify a subset of benchmarks to include/exclude. You can enable b.Parallel. You can set it once in one place and not have to manage external configuration (CI) or training/documentation (teammates).

josharian · 2020-10-13T00:25:37Z

@randall77 naming: BenchTest would have the dual advantages of being shorter and not ambiguous with existing benchmarks.

Another option along those lines would be to keep existing Benchmark* name but allow functions to accept a *testing.BT instead of a *testing.B. The param type would be the opt in. Then you could hang a Parallel and a TestMode/IsBenchmark off that new type.

A *testing.BT would also need a B() *B method for use with sub-benchmarks.

josharian · 2020-10-13T00:28:26Z

And if we wanted, with *testing.BT, instead of exposing b.N directly, we could expose b.Next(). But that might be a bridge too far, since it'd make opt-in more than a one character change. :)

randall77 · 2020-10-13T00:47:37Z

There's already the interface testing.TB, so introducing testing.BT struct to distinguish by argument type seems ripe for confusion.
But I like the idea if we can think of a better name.

rsc · 2020-10-14T15:12:10Z

If you want to run a benchmark during a test:

func BenchmarkB(b *testing.B) { ... }
func TestBenchmarkB(t *testing.T) { testing.Benchmark(BenchmarkB) }

I don't see why new API is needed here.

josharian · 2020-10-14T16:59:21Z

@rsc that will do the full benchmark run, increasing b.N incrementally, which is much slower than only b.N=1.

rsc · 2020-10-16T17:45:19Z

@josharian, indeed, I was clearly tired. :-)

The point remains, though: if you want to opt-in to testing certain benchmark code during tests, factor it out and call it from a Test function.

josharian · 2020-10-16T18:15:09Z

:) Here's a worked example of that approach.

Here's the simplest possible benchmark:

func BenchmarkFoo(b *testing.B) {
  for i := 0; i < b.N; i++ {
     foo()
  }
}

Here's what it looks like refactored to be a test and a benchmark:

func benchmarkFoo(tb testing.TB) {
  bN  := 1
  if b, ok := tb.(*testing.B); ok {
    bN = b.N
  }
  for i := 0; i < bN; i++ {
     foo()
  }
}

func BenchmarkFoo(b *testing.B) {
  benchmarkFoo(b)
}

func TestFoo(t *testing.T) {
  benchmarkFoo(t)
}

Maybe I'm missing a simpler way, but that's a lot of boilerplate, particularly repeated over many benchmarks.

seebs · 2020-10-16T18:35:02Z

This, plus my rejected proposal for testing middleware, are sort running into similar things -- there's no good way to programatically express "for each test" or "for each benchmark", or request that specific things be run, or influence
settings. For instance, "if no benchmark runs were requested, run each benchmark once with N=1" seems like a thing that would be lovely to be able to express concisely in code.

There's a lot of things like this where it's certainly possible to do the thing by hand-refactoring, but in a large test suite, it's hundreds of instances of boilerplate, and that means hundreds of things that could get out of sync.

Maybe this is or should be a code generation problem.

That said, adding RunTests, RunBenchmarks, and a way to interact with flags to testing.M would also solve a lot of this:

func
TestMain(m *testing.M) {
    var mFlags flag.FlagSet = m.Flags()
    bench := mFlags.Lookup("bench")
    if bench.String() == "" {
        bench.Set(".")
        mFlags.Lookup("benchtime").Set("1x")
    }
    m.RunTests()
    m.RunBenchmarks()
}

(Of course, I'd prefer wrapper funcs that take (t *testing.T, name string, fn func(*testing.T)) that you could pass to RunTests and RunBenchmarks, so you could intercept each individual test-or-benchmark, but that's sort of orthogonal.)

josharian · 2020-10-16T19:43:46Z

There's already the interface testing.TB

We could use testing.TB, like in my refactored code above. Code could then type switch to *testing.T or *testing.B for test-specific code (t.Parallel) or benchmark-specific code (b.ReportAllocs). Calculating b.N would be wordy, though, as above.

bitfield · 2020-10-17T17:25:29Z

If you do have benchmarks, and you need to run them, you'll run them. If you don't need to run them, you don't need to run them.

rsc · 2020-10-20T03:23:27Z

@josharian, isn't it:

func BenchmarkFoo(b *testing.B) {
  for i := 0; i < b.N; i++ {
     foo()
  }
}

func TestFoo(t *testing.T) {
  foo()
}

?

josharian · 2020-10-20T03:44:51Z

@rsc in my trivial example, yes. And that's a common case, and everything is obviously fine now in those cases.

But there's often a fair amount more work (or at least LOC) in the benchmark.

My example was trivial to attempt to highlight the structure of the refactoring under the assumption that completely copy/pasting the benchmark is not the right solution.

Note that if you copy/paste the benchmark into a test, and the test fails, you also need to remember to make the corresponding benchmark fix.

You could also take similar approaches, like factoring out setup and loop code. But that refactor will be awkward because of the testing.T/testing.B mismatch, and refactoring the loop into a function call may alter performance characteristics.

rsc · 2020-11-04T18:53:36Z

I guess there are two important questions here.

If we were starting from scratch, would it be the right thing to do to run every benchmark for one iteration to check to see if it has any failures?
If the answer to (1) is "yes", then it is the right thing to do to change the default at this late stage?

As I've noted above, I don't see the argument for (1) being "yes", and I'm even more skeptical that (2) is "yes".

We also don't run examples with no output, for good reason. I don't think we should start running those either.
Not every benchmark is written expecting to be able to run during tests. They're just different.
And it is easy to refactor code as needed to test what a benchmark needs to test, in a separate test.

nightlyone · 2020-11-05T01:27:06Z

I believe documenting the existing solution as example for a go test invocation seems useful. Maybe repeat that in the testing package documentation and in the blog article when introducing benchmarks.

rsc · 2020-11-11T18:15:06Z

Given how potentially disruptive this change would be and that people who do want to run benchmarks as tests can already use

go test -bench=. -benchtime=1x

it seems like we should probably leave things as is otherwise. Thoughts?

rsc · 2020-11-18T18:09:48Z

Based on the discussion, this seems like a likely decline.

mvdan · 2020-11-19T10:15:15Z

people who do want to run benchmarks as tests can already use

I'm not sure I understand this argumeent, because it applied to go vet when it was included as part of go test, too.

I assume the only difference here is potentially making go test much slower by default. I can buy that argument, but not the "this is already available if you want it" one.

zeebo · 2020-11-20T16:44:54Z

Given how potentially disruptive this change would be and that people who do want to run benchmarks as tests can already use

(emphasis mine)

The conversation keeps going in the direction of "running benchmarks as tests" which is not at all what is being proposed. It's testing the benchmark, not using the benchmark to test. In other words, this proposal would still make sense to me even if the *testing.B type had no way to indicate failures.

That said, it does seem like it would have a high probability of being too disruptive. I do think it's worthwhile to document that the -benchtime=1x flag can be used to quickly check benchmarks, given all the evidence that benchmarks do rot presented in this thread and elsewhere.

rsc · 2020-12-02T18:05:38Z

No change in consensus, so declined.

gopherbot added this to the Proposal milestone Oct 6, 2020

gopherbot added the Proposal label Oct 6, 2020

josharian mentioned this issue Oct 11, 2020

proposal: testing: provide TestMain with stats so that it can print a custom summary. #41878

Open

mvdan mentioned this issue Oct 21, 2020

doc: introduce a page for CI best practices #42119

Open

rsc mentioned this issue Nov 4, 2020

proposal: review meeting minutes #33502

Open

rsc added the Proposal-FinalCommentPeriod label Nov 19, 2020

rsc closed this as completed Dec 2, 2020

rsc removed the Proposal-FinalCommentPeriod label Dec 2, 2020

bcmills mentioned this issue Feb 22, 2021

cmd/go: provide a way to test that benchmarks work without invalidating cache results #44499

Open

golang locked and limited conversation to collaborators Dec 2, 2021

gopherbot added the FrozenDueToAge label Dec 2, 2021

rsc moved this to Declined in Proposals Aug 10, 2022

rsc added this to Proposals Aug 10, 2022

rsc removed this from Proposals Oct 19, 2022

proposal: cmd/go: run benchmarks for one iteration when testing #41824

proposal: cmd/go: run benchmarks for one iteration when testing #41824

Comments

zeebo commented Oct 6, 2020 • edited Loading

davecheney commented Oct 6, 2020

zeebo commented Oct 6, 2020

bcmills commented Oct 7, 2020

mvdan commented Oct 7, 2020

mvdan commented Oct 7, 2020

bcmills commented Oct 7, 2020

seebs commented Oct 7, 2020

rsc commented Oct 7, 2020

rsc commented Oct 7, 2020

josharian commented Oct 7, 2020

cespare commented Oct 7, 2020 • edited Loading

zeebo commented Oct 7, 2020 • edited Loading

kortschak commented Oct 7, 2020

rsc commented Oct 12, 2020

seebs commented Oct 12, 2020

rsc commented Oct 12, 2020

zeebo commented Oct 12, 2020

zeebo commented Oct 12, 2020 • edited Loading

kortschak commented Oct 12, 2020

davecheney commented Oct 12, 2020

zikaeroh commented Oct 12, 2020 • edited Loading

kortschak commented Oct 12, 2020 • edited Loading

davecheney commented Oct 12, 2020

josharian commented Oct 12, 2020

zeebo commented Oct 12, 2020

randall77 commented Oct 13, 2020

josharian commented Oct 13, 2020

josharian commented Oct 13, 2020 • edited Loading

josharian commented Oct 13, 2020

randall77 commented Oct 13, 2020

rsc commented Oct 14, 2020

josharian commented Oct 14, 2020

rsc commented Oct 16, 2020

josharian commented Oct 16, 2020 • edited Loading

seebs commented Oct 16, 2020

josharian commented Oct 16, 2020

bitfield commented Oct 17, 2020

rsc commented Oct 20, 2020

josharian commented Oct 20, 2020

rsc commented Nov 4, 2020

nightlyone commented Nov 5, 2020

rsc commented Nov 11, 2020

rsc commented Nov 18, 2020

mvdan commented Nov 19, 2020

zeebo commented Nov 20, 2020 • edited Loading

rsc commented Dec 2, 2020

zeebo commented Oct 6, 2020 •

edited

Loading

cespare commented Oct 7, 2020 •

edited

Loading

zeebo commented Oct 7, 2020 •

edited

Loading

zeebo commented Oct 12, 2020 •

edited

Loading

zikaeroh commented Oct 12, 2020 •

edited

Loading

kortschak commented Oct 12, 2020 •

edited

Loading

josharian commented Oct 13, 2020 •

edited

Loading

josharian commented Oct 16, 2020 •

edited

Loading

zeebo commented Nov 20, 2020 •

edited

Loading