Skip to content

Add documentation for new features #185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions docs/source/analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,100 @@ platforms. Plugging these numbers into the equation for code divergence gives
not to (more on that later).


Running ``cbi-tree``
####################

Running ``codebasin`` provides an overview of divergence and coverage, which
can be useful when we want to familiarize ourselves with a new code base,
compare the impact of different code structures upon certain metrics, or track
specialization metrics over time. However, it doesn't provide any *actionable*
insight into how to improve a code base.

To understand how much specialization exists in each source file, we can
substitute ``codebasin`` for ``cbi-tree``::

$ cbi-tree analysis.toml

This command performs the same analysis as ``codebasin``, but produces a tree
annotated with information about which files contain specialization:

.. code-block:: text
:emphasize-lines: 8,9,11,16

Legend:
A: cpu
B: gpu

Columns:
[Platforms | SLOC | Coverage (%) | Avg. Coverage (%)]

[AB | 33 | 93.94 | 72.73] o /home/username/code-base-investigator/docs/sample-code-base/src/
[AB | 13 | 100.00 | 92.31] ├── main.cpp
[A- | 7 | 85.71 | 42.86] ├─o cpu/
[A- | 7 | 85.71 | 42.86] │ └── foo.cpp
[AB | 6 | 100.00 | 100.00] ├─o third-party/
[AB | 1 | 100.00 | 100.00] │ ├── library.h
[AB | 5 | 100.00 | 100.00] │ └── library.cpp
[-B | 7 | 85.71 | 42.86] └─o gpu/
[-B | 7 | 85.71 | 42.86] └── foo.cpp

.. tip::

Running ``cbi-tree`` in a modern terminal environment producers colored
output to improve usability for large code bases.

Each node in the tree represents a source file or directory in the code
base and is annotated with four pieces of information:

1. **Platforms**

The set of platforms that use the file or directory.

2. **SLOC**

The number of source lines of code (SLOC) in the file or directory.

3. **Coverage (%)**

The amount of code in the file or directory that is used by all platforms,
as a percentage of SLOC.

4. **Avg. Coverage (%)**

The amount of code in the file or directory that is used by each platform,
on average, as a percentage of SLOC.

The root of the tree represents the entire code base, and so the values in
the annotations match the ``codebasin`` results: two platforms (``A`` and
``B``) use the directory, there are 33 lines in total, 93.94% of those lines
(i.e., 31 lines) are used by at least one platform, and each platform uses
72.73% of those lines (i.e., 24 lines) on average. By walking the tree, we can
break these numbers down across the individual files and directories in the
code base.

Starting with ``main.cpp``, we can see that it is used by both platforms
(``A`` and ``B``), and that 100% of the 13 lines in the file are used by at
least one platform. However, the average coverage is only 92.31%, reflecting
that each platform uses only 12 of those lines.

Turning our attention to ``cpu/foo.cpp`` and ``gpu/foo.cpp``, we can see
that they are each specialized for one platform (``A`` and ``B``,
respectively). The coverage for both files is only 85.71% (i.e., 6 of the 7
lines), which tells us that both files contain some unused code (i.e., 1 line).
The average coverage of 42.86% highlights the extent of the specialization.

.. tip::

Looking at average coverage is the best way to identify highly specialized
regions of code. As the number of platforms targeted by a code base
increases, the average coverage for files used by only a small number of
platforms will approach zero.

The remaining files all have a coverage of 100.00% and an average coverage
of 100.00%. This is our ideal case: all of the code in the file is used by
at least one platform, and all of the platforms use all of the code.


Filtering Platforms
###################

Expand All @@ -125,3 +219,9 @@ platform as follows:
.. code:: sh

$ codebasin -p cpu analysis.toml

or

.. code:: sh

$ cbi-tree -p cpu analysis.toml
73 changes: 72 additions & 1 deletion docs/source/cmd.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,15 @@ Command Line Interface
``-q, --quiet``
Decrease verbosity level.

``--debug``
Enable debug mode.

``-R <report>``
Generate a report of the specified type.

- ``summary``: code divergence information
- ``clustering``: distance matrix and dendrogram
- ``duplicates``: detected duplicate files
- ``files``: information about individual files

``-x <pattern>, --exclude <pattern>``
Exclude files matching this pattern from the code base.
Expand All @@ -41,3 +43,72 @@ Command Line Interface
Include the specified platform in the analysis.
May be specified multiple times.
If not specified, all platforms will be included.

Tree Tool
---------

The tree tool generates a visualization of the code base where each file and
directory is annotated with information about platform usage and coverage.

.. code-block:: text

cbi-tree [-h] [--version] [-x <pattern>] [-p <platform>] [--prune] [-L <level>] <analysis-file>

**positional arguments:**

``analysis-file``
TOML file describing the analysis to be performed, including the codebase and platform descriptions.

**options:**

``-h, --help``
Display help message and exit.

``--version``
Display version information and exit.

``-x <pattern>, --exclude <pattern>``
Exclude files matching this pattern from the code base.
May be specified multiple times.

``-p <platform>, --platform <platform>``
Include the specified platform in the analysis.
May be specified multiple times.
If not specified, all platforms will be included.

``--prune``
Prune unused files from the tree.

``-L <level>, --levels <level>``
Print only the specified number of levels.

Coverage Tool
-------------

The coverage tool reads a JSON compilation database and generates a JSON
coverage file that is suitable to be read by other tools.

.. code-block:: text

cbi-cov compute [-h] [-S <path>] [-x <pattern>] [-o <output path>] <input path>

**positional arguments:**

``input path``
Path to compilation database JSON file.

**options:**

``-h, --help``
Display help message and exit.

``-S <path>, --source-dir <path>``
Path to source directory.

``-x <pattern>, --exclude <pattern>``
Exclude files matching this pattern from the code base.
May be specified multiple times.

``-o <output path>, --output <output path>``
Path to coverage JSON file.
If not specified, defaults to 'coverage.json'.
61 changes: 54 additions & 7 deletions docs/source/emulating-compiler-behavior.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ that are not reflected on the command line (such as their default include
paths, or compiler version macros).

If we believe (or already know!) that these behaviors will impact the
divergence calculation for a code base, we can use a configuration file to
instruct CBI to append additional options when emulating certain compilers.
CBI's analysis of a code base, we can use a configuration file to append
additional options when emulating certain compilers.

.. attention::

Expand Down Expand Up @@ -76,12 +76,12 @@ but there is not enough information to decide what the value of
:code:`__GNUC__` should be.


Defining Behaviors
------------------
Defining Implicit Options
-------------------------

``codebasin`` searches for a file called ``.cbi/config``, and uses the
information found in that file to determine implicit compiler behavior. Each
compiler definition is a TOML `table`_, of the form shown below:
CBI searches for a file called ``.cbi/config``, and uses the information found
in that file to determine implicit compiler options. Each compiler definition
is a TOML `table`_, of the form shown below:

.. _`table`: https://toml.io/en/v1.0.0#table

Expand Down Expand Up @@ -124,3 +124,50 @@ becomes:
Coverage (%): 100.00
Avg. Coverage (%): 70.37
Total SLOC: 27


Parsing Compiler Options
------------------------

In more complex cases, emulating a compiler's implicit behavior requires CBI to
parse the command-line arguments passed to the compiler. Such emulation
requires CBI to understand which options are important and how they impact
compilation.

CBI ships with a number of compiler definitions included (see `here`_), and the
same syntax can be used to define custom compiler behaviors within the
``.cbi/config`` file.

.. _`here`: https://github.com/intel/code-base-investigator/tree/main/codebasin/compilers

For example, the TOML file below defines behavior for the ``gcc`` and ``g++`` compilers:

.. code-block:: toml

[compiler.gcc]
# This example does not define any implicit options.

# g++ inherits all options of gcc.
[compiler."g++"]
alias_of = "gcc"

# The -fopenmp flag enables a dedicated OpenMP compiler "mode".
[[compiler.gcc.parser]]
flags = ["-fopenmp"]
action = "append_const"
dest = "modes"
const = "openmp"

# In OpenMP mode, the _OPENMP macro is defined.
[[compiler.gcc.modes]]
name = "openmp"
defines = ["_OPENMP"]

This functionality is intended for expert users. In most cases, we expect that
defining implicit options or relying on CBI's built-in compiler emulation
support will be sufficient.

.. attention::

If you encounter a common case where a custom compiler definition is
required, please `open an issue`_.
41 changes: 32 additions & 9 deletions docs/source/features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,18 @@ Although limited, this functionality is sufficient to support analysis of many
HPC codes, and CBI has been tested on C, C++, CUDA and some Fortran code bases.


Computing Code Divergence
#########################
Computing Specialization Metrics
################################

CBI computes code divergence by building a *specialization tree*, like the one
shown below:
CBI computes code divergence and platform coverage by building a
*specialization tree*, like the one shown below:

.. image:: specialization-tree.png
:alt: An example of a specialization tree.

CBI can then walk and evaluate this tree for different platform definitions, to
produce a divergence report providing a breakdown of how many lines of code
are shared between different platform sets.
produce a report providing a breakdown of how many lines of code are shared
between different platform sets.

.. code:: text

Expand All @@ -46,9 +46,7 @@ are shared between different platform sets.
Avg. Coverage (%): 42.44
Total SLOC: 41

Future releases of CBI will provide additional ways to visualize the results of
this analysis, in order to highlight exactly *which* lines of code correspond
to different platform sets.
For more information about these metrics, see :doc:`here <specialization>`.


Hierarchical Clustering
Expand Down Expand Up @@ -76,3 +74,28 @@ hierarchical clustering by platform similarity.

.. image:: example-dendrogram.png
:alt: A dendrogram representing the distance between platforms.


Visualizing Platform Coverage
#############################

To assist developers in identifying exactly which parts of their code are
specialized and for which platforms, CBI can produce an annotated tree showing
the amount of specialization within each file.

.. code:: text

Legend:
A: cpu
B: gpu

Columns:
[Platforms | SLOC | Coverage (%) | Avg. Coverage (%)]

[AB | 1.0k | 2.59 | 1.83] o /path/to/sample-code-base/src/
[-- | 1.0k | 0.00 | 0.00] |-- unused.cpp
[AB | 13 | 100.00 | 92.31] |-- main.cpp
[A- | 7 | 100.00 | 50.00] |-o cpu/
[A- | 7 | 100.00 | 50.00] | \-- foo.cpp
[-B | 7 | 100.00 | 50.00] \-o gpu/
[-B | 7 | 100.00 | 50.00] \-- foo.cpp