-
Notifications
You must be signed in to change notification settings - Fork 536
[minibench] Drop outliers from benchmark result #8919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently the result has large variance from outliers, so only use 80% samples in the middle (trimmean 0.2)
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8919
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ce0902f with merge base 2ee3ffa ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@kirklandsign has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
The result is not bad. Now the difference between different runs is reduced to about 2%. For load time, unfortunately it cannot be addressed yet, because we only have 1 load time measurement during a run. @huydhn now I use "avg_inference_latency" field, but actually it's trimmean. Please let me know if you are unhappy with using the existing field. Honestly I think it's ok 😜 ic4 |
@kirklandsign has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@kirklandsign has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Again, very promising result after I tried one more time. ic4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I'm ok with this too although I don't know enough statistics to decide if 0.2 is a reasonable value to use, i.e. why not 0.1. For the load time, I think we could consider running minibench via abd multiple times but it feels overkill, maybe it's ok to have a load time with high variance and just use a higher alert threshold for that metric on the dashboard. |
Unfortunately I tried 0.1 but it is not so good. It's quite left skewed. |
Currently the result has large variance from outliers, so only use 80% samples in the middle (trimmean 0.2)
Summary
Currently the result has large variance from outliers, so only use 80% samples in the middle (trimmean 0.2)
Test plan
CI