Skip to content

[minibench] Drop outliers from benchmark result #8919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 4, 2025
Merged

Conversation

kirklandsign
Copy link
Contributor

Summary

Currently the result has large variance from outliers, so only use 80% samples in the middle (trimmean 0.2)

Test plan

CI

Currently the result has large variance from outliers, so only use
80% samples in the middle (trimmean 0.2)
@kirklandsign kirklandsign requested a review from tarun292 as a code owner March 4, 2025 04:55
Copy link

pytorch-bot bot commented Mar 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8919

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ce0902f with merge base 2ee3ffa (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 4, 2025
@facebook-github-bot
Copy link
Contributor

@kirklandsign has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kirklandsign kirklandsign temporarily deployed to upload-benchmark-results March 4, 2025 05:32 — with GitHub Actions Inactive
@kirklandsign
Copy link
Contributor Author

@guangy10 @huydhn

The result is not bad. Now the difference between different runs is reduced to about 2%.

For load time, unfortunately it cannot be addressed yet, because we only have 1 load time measurement during a run.

@huydhn now I use "avg_inference_latency" field, but actually it's trimmean. Please let me know if you are unhappy with using the existing field. Honestly I think it's ok 😜

ic4
xnnpack_q8
Samsung Galaxy S24 (Android 14)
24.74 → 25.03
51.46 → 64.21
ic4
xnnpack_q8
Samsung Galaxy S24 Ultra (Android 14)
23.4 → 23.72
51.05 → 56.44
ic4
xnnpack_q8
Samsung Galaxy S24+ (Android 14)
26.17 → 26.48
64.59 → 56.06

@kirklandsign kirklandsign requested review from huydhn and guangy10 March 4, 2025 05:43
@facebook-github-bot
Copy link
Contributor

@kirklandsign has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kirklandsign has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kirklandsign kirklandsign temporarily deployed to upload-benchmark-results March 4, 2025 06:59 — with GitHub Actions Inactive
@kirklandsign
Copy link
Contributor Author

Again, very promising result after I tried one more time.

ic4
xnnpack_q8
Samsung Galaxy S24 (Android 14)
23.53 → 24.74
76.1 → 51.46
ic4
xnnpack_q8
Samsung Galaxy S24 Ultra (Android 14)
23.69 → 23.4
55.37 → 51.05
ic4
xnnpack_q8
Samsung Galaxy S24+ (Android 14)
27.44 → 26.17
55.61 → 64.59

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@huydhn
Copy link
Contributor

huydhn commented Mar 4, 2025

@huydhn now I use "avg_inference_latency" field, but actually it's trimmean. Please let me know if you are unhappy with using the existing field. Honestly I think it's ok 😜

I'm ok with this too although I don't know enough statistics to decide if 0.2 is a reasonable value to use, i.e. why not 0.1.

For the load time, I think we could consider running minibench via abd multiple times but it feels overkill, maybe it's ok to have a load time with high variance and just use a higher alert threshold for that metric on the dashboard.

@kirklandsign
Copy link
Contributor Author

I'm ok with this too although I don't know enough statistics to decide if 0.2 is a reasonable value to use, i.e. why not 0.1.

Unfortunately I tried 0.1 but it is not so good. It's quite left skewed.

@kirklandsign kirklandsign merged commit 09ad20a into main Mar 4, 2025
61 checks passed
@kirklandsign kirklandsign deleted the kirk-use-trimmean branch March 4, 2025 17:42
zonglinpeng pushed a commit that referenced this pull request Mar 6, 2025
Currently the result has large variance from outliers, so only use
80% samples in the middle (trimmean 0.2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants