Profiling tools and initial optimizations #245

zdykstra · 2022-01-10T03:16:41Z

If other languages can do performance profiling, why can't bash ? Our debug log format already had the bulk of the information we need to generate per-function timing information.

run.sh -F or run.sh -E can now introduce a build-time variable, trace_term, which points to a serial device in the VM. On the host side, the serial port is attached to a FIFO which dumps data into a log file in the test environment directory. On the ZBM side, a debug trap points to the serial device. Every command executed in the interpreter causes the trap to fire, which dumps the current function stack and source (file) stack along with the current unix TS with nanosecond precision to the serial port.

Once the VM exits, a small perl tool re-assembles the trace/performance data to the format expected by Flamegraph. flamegraph.pl should be in $PATH on your local install. It'll generate ${FLAMECHART}/flamechart.svg and upload it if webify is on your system.

To protect against profiling code landing in a production release, a no-op /lib/profiling-lib.sh is installed by default that simply executes return 0 when sourced. The variable zfsbootmenu_enable_tracing must be enabled in a Dracut configuration file for the full-weight profiling library to be used.

Based on the flamechart generated, I've made the following changes:

Omit network-legacy in the test environment, it spends a bunch of time messing with the KCL, etc
Omit rootfs-block, dm, dmraid in the test environment - nothing we do needs what they provide and they're slow
Add a dirty fast-fail to get_zbm_arg. When one lookup key is sent, use bashre to check if it's present in the KCL and return immediately if it's not.
Remove zlog - preferring to directly echo to /dev/kmsg in anything but zdebug . zinfo has gone from ~24ms to 4ms.
Move the getarg loglevel check to earlier in the KCL parsing, so we don't potentially miss zinfo logs (controlling terminal zinfo now works again!)
~~Gate all zdebug calls in functions in the quick boot path with zdebug &&, avoiding building a string~~
~~Replace a get_zbm_bool primary secondary call with two separate calls, to take advantage of fast fails~~
~~Don't go into the countdown prompt if timeout=0 (zbm.skip).~~

Before: https://somebits.link/u/p/d62bdcd0da/stream
After: Flamegraph Flamechart

This is generally tested, but it could use a serious review of how the timing diff is generated and if it's valid. The downside to the profiler is that the trap, even as minimal as it is, adds a computation overhead. The timing values are consistent between runs, but they shouldn't be taken as absolute truth - they're inflated due to the time the trap takes to execute.

There are a number of Dracut optimizations that can probably still be done.

ahesford

Nice work, this will be nice to have in the test setup.

90zfsbootmenu/lib/zfsbootmenu-kcl.sh

90zfsbootmenu/libexec/zfsbootmenu-init

90zfsbootmenu/lib/zfsbootmenu-core.sh

90zfsbootmenu/module-setup.sh

90zfsbootmenu/hook/zfsbootmenu-parse-commandline.sh

90zfsbootmenu/lib/kmsg-log-lib.sh

ahesford

Even without the profiling stuff, this refactors a lot of things quite nicely.

Create /lib/profiling-lib.sh, which can be sourced by any Bash script. Once sourced, a debug trap is installed that dumps the stack (but not the command being executed) to a serial port, which collects the data on the host side into a log file. After the VM exits, the performance data is converted into a format expected by flamegraph.pl (https://github.com/brendangregg/FlameGraph) and then both a flamegraph and a flamechart are created. If profiling isn't enabled, a no-op /lib/profiling-lib.sh is created. This file must always exist, and must always be sourceable. Optimizations/performance changes include: - setting the loglevel as from the KCL as early as possible, to preserve all log messages > zwarn - validating that the loglevel is an integer, otherwise defaulting to 4 - validating that zbm.timeout/timeout is an integer, otherwise defaulting to 10 - adding re-import guards to libraries - removing the zlog() function and doing the logging work in the zerror/zwarn/... function instead. This means that everything but zdebug can be as lean as possible. - gating a couple serial zdebug calls behind a single zdebug check - added a fast-fail to timed_prompt to return 0 if delay isn't greater than 0. Any default values for delay/prompt are assigned after this test. - add a short-circuit to get_zbm_arg that uses bashre to check if the kcl option string is present at all in the KCL. If it's not, the next option passed to get_zbm_arg is evaluated. - Avoid pre-computing a return value until it's actually needed - cut i18n out of testing builds unless a graphical interface is requested - cut crypt-ssh and network-legacy out of testing builds unless SSH access is requested - cut out 'nvdimm fs-lib rootfs-block dm dmraid lunmask' from testing builds. These can likely be cut from release builds as well.

zdykstra requested a review from ahesford January 10, 2022 03:16

zdykstra force-pushed the but-why-dot-jpg branch from 2350047 to 66e486c Compare January 10, 2022 15:32

ahesford reviewed Jan 10, 2022

View reviewed changes

90zfsbootmenu/lib/zfsbootmenu-kcl.sh Outdated Show resolved Hide resolved

90zfsbootmenu/libexec/zfsbootmenu-init Outdated Show resolved Hide resolved

zdykstra force-pushed the but-why-dot-jpg branch 24 times, most recently from 6d76f0c to 7cd4af7 Compare January 11, 2022 23:26

ahesford requested changes Jan 12, 2022

View reviewed changes

zdykstra force-pushed the but-why-dot-jpg branch 2 times, most recently from 848558e to 45b8f3c Compare January 13, 2022 01:44

ahesford approved these changes Jan 13, 2022

View reviewed changes

zdykstra force-pushed the but-why-dot-jpg branch from 45b8f3c to 60771c8 Compare January 13, 2022 03:10

zdykstra closed this in 89b6273 Jan 13, 2022

zdykstra deleted the but-why-dot-jpg branch January 14, 2022 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Profiling tools and initial optimizations #245

Profiling tools and initial optimizations #245

Uh oh!

zdykstra commented Jan 10, 2022 •

edited

Loading

Uh oh!

ahesford left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ahesford left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Profiling tools and initial optimizations #245

Profiling tools and initial optimizations #245

Uh oh!

Conversation

zdykstra commented Jan 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahesford left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ahesford left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zdykstra commented Jan 10, 2022 •

edited

Loading