Add option to define platforms on the command line #49

Pennycook · 2024-02-06T17:05:18Z

Similar to #42 , the goal of this PR is to make it simpler for users to interact with Code Base Investigator by providing a command-line option that minimizes YAML interaction.

Instead of this:

codebase:
    files: []
    platforms: [ "CPU", "GPU" ]

CPU:
    commands: "cpu_compile_commands.json"

GPU:
    commands: "gpu_compile_commands.json"

codebasin -c config.yaml

It is now possible to write this:

codebasin -p cpu.json -p gpu.json

Or, if a configuration file is still being used, it is possible to include only a subset of the files in the analysis:

codebasin -c config.yaml -p CPU

Related issues

N/A

Proposed changes

Adds --platform (-p) option to specify platform names and/or compilation database paths.
Interprets -p name as a request to use a pre-defined platform, effectively acting as a filter.
Interprets -p path.json as a request to use an implicitly defined platform, simplifying configuration.
Adjusts warnings and errors to help users reason about -p behavior, and to disable config file warnings when absent.

Pennycook · 2024-02-14T15:53:28Z

@laserkelvin - What do you think of the new API proposed here? I'm a little worried that it might be hard to understand, because it's not necessarily obvious why -p or --platform expects multiple arguments. I'm also worried that it might be hard to extend, because we won't be able change the number of arguments without breaking existing command lines.

I had a few ideas for alternatives, but none of them leap out as being significantly better...

Context/Order-aware Options

codebasin --platform CPU --commands cpu_compile_commands.json --platform GPU --commands gpu_compile_commands.json
codebasin -p CPU -C cpu_compile_commands.json -p GPU -C gpu_compile_commands.json

Separate Lists of Commands and (Optional?) Platform Names

codebasin --commands cpu_compile_commands.json,gpu_compile_commands.json --platform-names CPU,GPU
codebasin -C cpu_compile_commands.json,gpu_compile_commands.json -p CPU,GPU

codebasin --commands cpu_compile_commands.json gpu_compile_commands.json --platform-names CPU GPU
codebasin -C cpu_compile_commands.json gpu_compile_commands.json -p CPU GPU

Mappings from Platform Names to Commands

codebasin --platform CPU:cpu_compile_commands.json --platform GPU:gpu_compile_commands.json
codebasin -p CPU:cpu_compile_commands.json -p GPU:gpu_compile_commands.json

codebasin --platform CPU=cpu_compile_commands.json --platform GPU=gpu_compile_commands.json
codebasin -p CPU=cpu_compile_commands.json -p GPU=gpu_compile_commands.json

Can you think of any better way to do this?

laserkelvin · 2024-02-14T16:21:35Z

Honestly this kind of thing is near the boundary of "should we configure something this complicated via the CLI" so i'm not 100% sure, in that this level of composability is probably better suited for a config/input file.

If I had to choose, the mapping is probably closest to what I like, since it's explicit in what it wants albeit more verbose for the user. This post demos providing key/value pairs by creating a custom argparse.Action, and I would think the mode of interaction would then look like this:

codebasin --platforms CPU=cpu_compile_commands.json GPU=gpu_compile_commands.json

So the user will just specify all of the platform/config mappings in one go, which is more strongly hinted by using platforms (plural) as the flag. You can add this as an example to the help string for it as well.

Pennycook · 2024-02-14T16:39:58Z

Honestly this kind of thing is near the boundary of "should we configure something this complicated via the CLI" so i'm not 100% sure, in that this level of composability is probably better suited for a config/input file.

I agree. My preference here is actually the multi-command version, because I think it's quite explicit:

cbi define CPU cpu_compile_commands.json
cbi define GPU gpu_compile_commands.json
cbi analyze

Maybe I've been too hasty in suggesting we throw away our analysis configuration files completely? Maybe the issue is that they're just too complicated, because of 1) all the old glob stuff; and 2) YAML? Something like the below would be quite a bit nicer:

[codebase]
source-dir = "."

[platform.cpu]
commands = "cpu_compile_commands.json"
build-dir = "./build/cpu/"

[platform.gpu]
commands = "gpu_compile_commands.json"
build-dir = "./build/gpu/"

...but I can't see an easy way to replace our existing YAML file with a TOML file without breaking lots of stuff. And ideally, I'd still like a way to specify all this stuff without creating a file!

If I had to choose, the mapping is probably closest to what I like, since it's explicit in what it wants albeit more verbose for the user. This post demos providing key/value pairs by creating a custom argparse.Action, and I would think the mode of interaction would then look like this:
codebasin --platforms CPU=cpu_compile_commands.json GPU=gpu_compile_commands.json
So the user will just specify all of the platform/config mappings in one go, which is more strongly hinted by using platforms (plural) as the flag. You can add this as an example to the help string for it as well.

I'll spend some time prototyping this (and keep thinking about this question of flags vs config files).

Pennycook · 2024-02-15T15:34:09Z

Apologies in advance for the long post, @laserkelvin...

After our discussion yesterday, I did some playing around with argparse and did some research. It turns out there's a very old bug in argparse that affects its handling of options with an unknown number of values. The GNU documentation says that POSIX convention is to only accept one value per argument (presumably because somebody realized the alternative would be hard to parse in certain corner cases). So, while we could make something like --platforms A=B C=D work, it might make it harder/impossible for us to support certain other options in future in a user-friendly way.

So, I took a step back and tried to think about why I wanted a -p flag in the first place, and what it is that I don't like about the configuration file workflow. I think I've come up with something that will satisfy us both.

Use-Case 1

User has a configuration file with many platform definitions in it (e.g., SYCL, CUDA, HIP, OpenMP, Kokkos).
User wants to run the analysis for some subset of those platforms.

With the current interface, the only way to do this is to edit the configuration file to change the platform list, or maintain multiple configuration files that differ only in their platform list. I found this really annoying when working on the P3HPC paper.

To address this, I think what we want is a way to filter the platforms that will be used in the analysis. We can do that by accepting only a name, since the rest of the platform configuration is described in the config file already:

# Run the analysis using all platforms from the config file
codebasin -c config.yaml

# Run the analysis for platforms "cpu" and "gpu", which must exist in the config file
codebasin -c config.yaml -p cpu -p gpu

Use-Case 2

User wants to invoke codebasin as part of a pipeline that produces compile_commands.json files.
User does not want to create an analysis file.

We can build on the version of -p that accepts only one argument. If it's a string, then we treat it as a name to be found in the configuration file. If it's a path to a .json file, then we treat it as an inline platform definition:

# Run the analysis for two implicitly defined platforms
codebasin -p /path/to/cpu_commands.json -p /path/to/gpu_commands.json

# Run the analysis for a mix of pre-defined and implicitly defined platforms
codebasin -c config.yaml -p cpu -p /path/to/gpu_commands.json

We'd need to come up with some rules/conventions for how we name platforms, but that doesn't seem too bad. Even if we just used the filename (or part of it) the user would still have control over how the platform appeared in CBI output.

Use-Case 3

Similar to Use-Case 1, user has many different platform definitions.
User wants to build up an analysis from modular platform definitions, instead of maintaining one giant file.

This time, we teach -p to accept a standalone platform definition in a yet-to-be-defined TOML format:

# Run the analysis for two explicitly defined platforms from different files
codebasin -p /path/to/platforms/cpu.toml -p /path/to/platforms/gpu.toml

# Do all the things
codebasin -c config.yaml -p cpu -p /path/to/gpu_commands.json -p /path/to/platforms/fpga.toml

In 99% of cases we should be able to distinguish between the different cases from context (e.g., whether things look like paths, the file extension). If a path is ambiguous, we can check to see if it validates against the compilation database or platform schema (and it will only ever match one). If we still can't tell, we can throw an error.

The documentation just needs to say that we accept -p <PLATFORM>, explain the valid PLATFORM values, and say that if no -p flags are supplied then the analysis uses all platforms it knows about.

What do you think? My proposal is that we implement the functionality for use-cases 1) and 2) initially, and leave the functionality associated with use-case 3) for later. We can address it as part of a move from YAML to TOML for the configuration files.

The set of platforms in an analysis can now be specified using: --platform <platform> When combined with a configuration file, this new option can be used to limit analysis to a subset of platforms, as below: -c config.yaml --platform CPU When the platform specification is a .json file, it is interpreted as a definition of a new platform: --platform /path/to/cpu.json --platform /path/to/gpu.json Both options are intended to improve user productivity, by minimizing the amount of time spent editing configuration files when making minor adjustments to an existing analysis. Signed-off-by: John Pennycook <[email protected]>

Pennycook · 2024-02-21T19:46:31Z

@laserkelvin - I think I've implemented what we discussed. Please take a look and let me know what you think.

Note that although it's not really necessary right now, the ".json" requirement is groundwork for a future PR. If we want -p to eventually accept .toml files as well, it's a good idea to require explicit file extensions from the get-go.

laserkelvin

Just had two minor things

codebasin.py

argparse will always interpret values passed to -p as strings. Signed-off-by: John Pennycook <[email protected]>

After internal testing and review, we reached the conclusion that using -p to both define and filter platforms was too confusing and difficult to teach. Additionally, there are concerns that use of flags to define platforms may not scale, since real-life use-cases may require more than a list of commands. This commit is a partial revert of intel#49: the ability to filter platforms remains, but the ability to define platforms directly via JSON is removed. Signed-off-by: John Pennycook <[email protected]>

Pennycook added the enhancement New feature or request label Feb 6, 2024

Pennycook added this to the 1.2.0 milestone Feb 6, 2024

Pennycook requested a review from laserkelvin February 6, 2024 17:05

Pennycook changed the title ~~Platform option~~ Add option to define platforms on the command line Feb 6, 2024

Pennycook marked this pull request as draft February 14, 2024 16:42

Pennycook force-pushed the platform-option branch from 5390075 to ffb87d9 Compare February 21, 2024 19:38

Pennycook marked this pull request as ready for review February 21, 2024 19:44

laserkelvin reviewed Feb 23, 2024

View reviewed changes

codebasin.py Outdated Show resolved Hide resolved

codebasin.py Show resolved Hide resolved

Remove unnecessary type check

fee9460

argparse will always interpret values passed to -p as strings. Signed-off-by: John Pennycook <[email protected]>

laserkelvin approved these changes Feb 23, 2024

View reviewed changes

Pennycook merged commit 0ebef0f into intel:main Feb 23, 2024

Pennycook deleted the platform-option branch February 23, 2024 19:46

Pennycook mentioned this pull request Mar 6, 2024

Simplify platform option #76

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to define platforms on the command line #49

Add option to define platforms on the command line #49

Pennycook commented Feb 6, 2024 •

edited

Loading

Pennycook commented Feb 14, 2024

laserkelvin commented Feb 14, 2024

Pennycook commented Feb 14, 2024 •

edited

Loading

Pennycook commented Feb 15, 2024

Pennycook commented Feb 21, 2024

laserkelvin left a comment

Add option to define platforms on the command line #49

Add option to define platforms on the command line #49

Conversation

Pennycook commented Feb 6, 2024 • edited Loading

Related issues

Proposed changes

Pennycook commented Feb 14, 2024

Context/Order-aware Options

Separate Lists of Commands and (Optional?) Platform Names

Mappings from Platform Names to Commands

laserkelvin commented Feb 14, 2024

Pennycook commented Feb 14, 2024 • edited Loading

Pennycook commented Feb 15, 2024

Use-Case 1

Use-Case 2

Use-Case 3

Pennycook commented Feb 21, 2024

laserkelvin left a comment

Choose a reason for hiding this comment

Pennycook commented Feb 6, 2024 •

edited

Loading

Pennycook commented Feb 14, 2024 •

edited

Loading