Skip to content

Conversation

@zdykstra
Copy link
Member

@zdykstra zdykstra commented Jan 10, 2022

If other languages can do performance profiling, why can't bash ? Our debug log format already had the bulk of the information we need to generate per-function timing information.

run.sh -F or run.sh -E can now introduce a build-time variable, trace_term, which points to a serial device in the VM. On the host side, the serial port is attached to a FIFO which dumps data into a log file in the test environment directory. On the ZBM side, a debug trap points to the serial device. Every command executed in the interpreter causes the trap to fire, which dumps the current function stack and source (file) stack along with the current unix TS with nanosecond precision to the serial port.

Once the VM exits, a small perl tool re-assembles the trace/performance data to the format expected by Flamegraph. flamegraph.pl should be in $PATH on your local install. It'll generate ${FLAMECHART}/flamechart.svg and upload it if webify is on your system.

To protect against profiling code landing in a production release, a no-op /lib/profiling-lib.sh is installed by default that simply executes return 0 when sourced. The variable zfsbootmenu_enable_tracing must be enabled in a Dracut configuration file for the full-weight profiling library to be used.

Based on the flamechart generated, I've made the following changes:

  • Omit network-legacy in the test environment, it spends a bunch of time messing with the KCL, etc
  • Omit rootfs-block, dm, dmraid in the test environment - nothing we do needs what they provide and they're slow
  • Add a dirty fast-fail to get_zbm_arg. When one lookup key is sent, use bashre to check if it's present in the KCL and return immediately if it's not.
  • Remove zlog - preferring to directly echo to /dev/kmsg in anything but zdebug . zinfo has gone from ~24ms to 4ms.
  • Move the getarg loglevel check to earlier in the KCL parsing, so we don't potentially miss zinfo logs (controlling terminal zinfo now works again!)
  • Gate all zdebug calls in functions in the quick boot path with zdebug &&, avoiding building a string
  • Replace a get_zbm_bool primary secondary call with two separate calls, to take advantage of fast fails
  • Don't go into the countdown prompt if timeout=0 (zbm.skip).

Before: https://somebits.link/u/p/d62bdcd0da/stream
After: Flamegraph Flamechart

This is generally tested, but it could use a serious review of how the timing diff is generated and if it's valid. The downside to the profiler is that the trap, even as minimal as it is, adds a computation overhead. The timing values are consistent between runs, but they shouldn't be taken as absolute truth - they're inflated due to the time the trap takes to execute.

There are a number of Dracut optimizations that can probably still be done.

Copy link
Member

@ahesford ahesford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, this will be nice to have in the test setup.

@zdykstra zdykstra force-pushed the but-why-dot-jpg branch 24 times, most recently from 6d76f0c to 7cd4af7 Compare January 11, 2022 23:26
@zdykstra zdykstra force-pushed the but-why-dot-jpg branch 2 times, most recently from 848558e to 45b8f3c Compare January 13, 2022 01:44
Copy link
Member

@ahesford ahesford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without the profiling stuff, this refactors a lot of things quite nicely.

Create /lib/profiling-lib.sh, which can be sourced by any Bash script.
Once sourced, a debug trap is installed that dumps the stack (but not
the command being executed) to a serial port, which collects the data on
the host side into a log file. After the VM exits, the performance data
is converted into a format expected by flamegraph.pl
(https://github.com/brendangregg/FlameGraph) and then both a flamegraph
and a flamechart are created.

If profiling isn't enabled, a no-op /lib/profiling-lib.sh is created.
This file must always exist, and must always be sourceable.

Optimizations/performance changes include:

- setting the loglevel as from the KCL as early as possible, to preserve
  all log messages > zwarn
- validating that the loglevel is an integer, otherwise defaulting to 4
- validating that zbm.timeout/timeout is an integer, otherwise
  defaulting to 10
- adding re-import guards to libraries
- removing the zlog() function and doing the logging work in the
  zerror/zwarn/... function instead. This means that everything but
  zdebug can be as lean as possible.
- gating a couple serial zdebug calls behind a single zdebug check
- added a fast-fail to timed_prompt to return 0 if delay isn't greater
  than 0. Any default values for delay/prompt are assigned after this
  test.
- add a short-circuit to get_zbm_arg that uses bashre to check if the
  kcl option string is present at all in the KCL. If it's not, the next
  option passed to get_zbm_arg is evaluated.
- Avoid pre-computing a return value until it's actually needed
- cut i18n out of testing builds unless a graphical interface is
  requested
- cut crypt-ssh and network-legacy out of testing builds unless SSH
  access is requested
- cut out 'nvdimm fs-lib rootfs-block dm dmraid lunmask' from testing
  builds. These can likely be cut from release builds as well.
@zdykstra zdykstra closed this in 89b6273 Jan 13, 2022
@zdykstra zdykstra deleted the but-why-dot-jpg branch January 14, 2022 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants