You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
simd::unfilter_paethN: Load 4 (or 8) bytes at a time (faster than 3 or 6).
This CL loads RGB data using 4-bytes-wide loads (and RRGGBB data using
8-byte-wide loads), because:
* This is faster as measured by the microbenchmarks below
* It doesn't change the behavior - before and after these changes we
were ignoring the 4th SIMD lane when processing RGB data (after this
change the 4th SIMD lane will contain data from the next pixel, before
this change it contained a 0 value)
* This is safe as long as we have more than 4 bytes of remaining input
data (we have to fall back to a 3-bytes-wide load for the last pixel).
Results of running microbenchmarks on the author's machine:
```
$ bench --bench=unfilter --features=unstable,benchmarks -- --baseline=simd1 Paeth/bpp=[36]
...
unfilter/filter=Paeth/bpp=3
time: [18.755 µs 18.761 µs 18.767 µs]
thrpt: [624.44 MiB/s 624.65 MiB/s 624.83 MiB/s]
change:
time: [-16.148% -15.964% -15.751%] (p = 0.00 < 0.05)
thrpt: [+18.696% +18.997% +19.258%]
Performance has improved.
...
unfilter/filter=Paeth/bpp=6
time: [18.991 µs 19.000 µs 19.009 µs]
thrpt: [1.2041 GiB/s 1.2047 GiB/s 1.2052 GiB/s]
change:
time: [-15.161% -15.074% -14.987%] (p = 0.00 < 0.05)
thrpt: [+17.629% +17.750% +17.871%]
Performance has improved.
```
0 commit comments