-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Mcc add perf tests improve performance #3699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mcc add perf tests improve performance #3699
Conversation
modules/mcc/src/utils.hpp
Outdated
const int num_elements = (int)src.total()*channel; | ||
const double *psrc = (double*)src.data; | ||
double *pdst = (double*)dst.data; | ||
const int batch = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "batch" optimization improves performance in Windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which are common values of num_elements
? We can make batch
dependent on number of threads:
const int batch = num_elements / max(1, getNumThreads());
or
const int batch = num_elements / (getNumThreads() > 1 ? getNumThreads() * 4 : 1);
instead of 4 you may choose another constant to get batch=128
in you configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your second sample I got the same performance (47 ms) with a constant of 1024.
const int batch = std::max(1, getNumThreads() > 1 ? num_elements / (1024*getNumThreads()) : num_elements);
// if getNumThreads() == 1 -> batch = num_elements
In your first sample const int batch = num_elements / max(1, getNumThreads());
a regression in performance appears (from 47 ms to 57 ms).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest using batch 128, but your second sample would also work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Batch - the minimum required number of consecutive elements in an array that a thread can process at one time.
b77f40d
to
8ca90eb
Compare
8ca90eb
to
5b829da
Compare
Added perf tests to mcc module.
Also these optimizations have been added:
parallel_for_
toperformThreshold()
toL
/fromL
and addeddst
to avoid copy dataparallel_for_
toelementWise()
("batch" optimization improves performance of Windows version, Linux without changes).Configuration:
Ryzen 5950X, 2x16 GB 3000 MHz DDR4
OS: Windows 10, Ubuntu 20.04.5 LTS
Performance results in milliseconds:
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.