-
Notifications
You must be signed in to change notification settings - Fork 152
Auto-post results in PRs that significantly affect perf #711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
1% seems like a pretty low bar. Maybe 3% so we're sure it's not noise? |
I think we should start at 1% and go up (or add exceptions) from there, since false positives are pretty low impact here IMO. |
And maybe for |
Higher threshold sounds fine. 90% of doc time normally comes from running |
I guess the exception is enormous single crates ... How awful would it be to add winapi or stm32ral to the benchmarks? |
We do not currently have budget to add any more benchmarks, indeed there is active discussion about culling our benchmarks (or optimizing what we collect etc, see https://rust-lang.zulipchat.com/#narrow/stream/247081-t-compiler.2Fperformance/topic/script-servo-2.20takes.20way.20too.20long). |
I was just about to write up an issue that suggested posting the perf results for all merge commits (or all merge commits receiving a perf run--not sure those are different). Perhaps we could post either a comment indicating the final perf results, or one indicating the final perf results and that there was a potential regression. The usefulness of the former would be more for posterity and investigation. It would be nice to be able to go to a PR and easily see what its final merged perf results were, especially if the heuristic we use to indicate regressions doesn't catch all forms of them. |
When a PR lands and is measured, if it improves or regresses performance by more than a certain amount, it would be good for the results to be automatically added in a comment on the PR. That way we'll get faster feedback to PR authors.
Choosing the threshold of significance is the hard part. When doing manual perf triage I usually record ones where at least one run changed by more than 1%. But occasionally I ignore ones, because some of our benchmarks are noisy. So maybe we should require that 3 or more runs change by 1% or more.
The text was updated successfully, but these errors were encountered: