Skip to content

[zstd][cli] Add performance counters support to bench mode #4354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

Adenilson
Copy link

** NOT FOR LANDING**

Adding an extra parameter (-y) while running in benchmark mode to allow collecting processor performance counters, as that will allow next to know performance stats per operation (i.e. compression vs decompression).

We can collect the following performance counters using the Linux perf API: CPU cycles, instructions, branch misses, cache hits and cache misses.

One advantage of leveraging the Linux perf API is that it should work on any processor that runs Linux, therefore should work fine on x86-64 (Intel and AMD), Arm (arm32/aarch64) and RISC-V.

The counters will allow to generate new interesting stats like cycles/byte, a measure that is helpful to compare different CPU micro architectures with the benefit of being independent of clock speed.

Plus, any I/O operations (i.e. reading files from the disk) that will waste cycles displayed in a regular 'perf stat' will not be counted, since we only capture counters during the main benchmark loop.

This patch is still in its early stages as the idea is to listen to feedback and properly address its current short comings to progress towards a contribution that can be landed on zstd.

Adding an extra parameter while running in benchmark mode to
allow collecting processor performance counters, as that will allow
next to know performance stats per operation (i.e. compression vs decompression).

We can collect the following performance counters using the Linux
perf API: CPU cycles, instructions, branch misses, cache hits and
cache misses.

One advantage of leveraging the Linux perf API is that it should work on any
processor that runs Linux, therefore should work fine on x86-64 (Intel and AMD),
Arm (arm32/aarch64) and RISC-V.

The counters will allow to generate interesting stats like cycles/byte,
a measure that is helpful to compare different CPU micro architectures
with the benefit of being independent of clock speed.

Plus, any I/O operations (i.e. reading files from the disk) that will
waste cycles displayed in a regular 'perf stat' will *not* be counted,
since we only capture counters during the main benchmark loop.

This patch is still in its early stages as the idea is to listen to feedback
and properly address its current shortcommings to progress towards
a contribution that can be landed on zstd.
@Adenilson
Copy link
Author

Runnning with the help flag should print this:
adenilson@aquario:~/compression/my-fork-zstd$ ./programs/zstd --help
*** Zstandard CLI (64-bit) v1.5.8, by Yann Collet ***

Compress or decompress the INPUT file(s); reads from STDIN if INPUT is - or not provided.

Usage: zstd [OPTIONS...] [INPUT... | -] [-o OUTPUT]
...
Benchmark options:
-b# Perform benchmarking with compression level #. [Default: 3]
-e# Test all compression levels up to #; starting level is -b#. [Default: 1]
-i# Set the minimum evaluation to time # seconds. [Default: 3]
-y# Collect CPU counters.

@Adenilson
Copy link
Author

Two examples when the flag is enabled:

a) Synthetic:
adenilson@aquario:~/compression/my-fork-zstd$ ./programs/zstd -b1y

Perf cycles: 326893971910 -> 3239077 (x3.087), 487.4 MB/s, 2636.7 MB/s

1#

b) With file input:
adenilson@aquario:~/compression/my-fork-zstd$ ./programs/zstd -b1y ~/corpus/linux-5.6-rc3.tar

Perf cycles: 427627890230 -> 190860020 (x5.017), 851.1 MB/s, 2906.7 MB/s

1#

@Adenilson
Copy link
Author

The basic idea is to add into the benchmark mode a way to know more precisely the CPU stats operations (e.g. compression vs decompression), remove from the equation cycles spent on I/O and allow to calculate some extra stats (e.g. cycles/byte).

@Adenilson
Copy link
Author

Adenilson commented Mar 29, 2025

If this is a feature that could be helpful to zstd, I can further develop the patch to get into a "land-able" state.

This is just an early draft with the basic idea, a PoC (Proof of Concept).

@Adenilson
Copy link
Author

Adenilson commented Mar 29, 2025

I considered using the RDPMC instruction, but its behavior is different between x86-64 implementations (i.e. Intel vs AMD), plus it would be x86-64 only.

On the other hand, it may be possible to collect some extra counters not available using the Linux perf API.

@Cyan4973 thoughts?

@Cyan4973
Copy link
Contributor

I believe this is a good topic.
Benchmark mode is indeed useful to measure performance differences,
and adding counters to this stage is contributing to this objective.
I would just note that current -b already removes I/O operations, so it's purely a buffer-to-buffer operation.
There are also many kind of counters that could be collected, so I guess implementation still has a lot of choices to make.
Given it's an advanced feature, not enabled by default, I'm fine with non-portable counters that only exist on some platforms but not others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants