torchroofline

Analytical roofline analysis for PyTorch models — compute + memory theoretical limits, no GPU required.

Given a PyTorch program (in KernelBench format) and a target GPU, this estimates theoretical roofline bounds (for compute and memory), with some estimates on fusion.

NOTE: This is a quick prototype. Please use with caution. I am still iterating on the design.

Setup

Development (with KernelBench integration)

git clone --recurse-submodules https://github.com/your-org/torchroofline.git
cd torchroofline
uv sync

If you already cloned without submodules:

git submodule update --init --recursive
uv sync

Install as package

uv add torchroofline
# or
pip install torchroofline

For visualization:

uv add torchroofline[viz]

Quick Start

import torch
import torch.nn as nn
import torchroofline

model = nn.Sequential(
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
)

x = torch.randn(1, 512)
result = torchroofline.analyze(model, (x,), gpu="h100-sxm")
result.print()

Output:

================================================================
  torchroofline — model
  GPU: H100   Precision: fp32
================================================================
  FLOPs          :       0.30 GFLOPs
  Params         :       0.16 M

  HBM traffic (unfused) :     0.002 GB  <- eager upper bound
  HBM traffic (fused)   :     0.001 GB  <- inductor lower bound
  Fusion reduction      :      50.0 %

  Arithmetic intensity  :     300.0 FLOPs/Byte
  Ridge point (fp32)    :      20.0 FLOPs/Byte

  Compute SoL    :     0.0045 ms
  Memory  SoL    :     0.0003 ms
  *** Bottleneck : COMPUTE BOUND ***
================================================================

Method

Compute Roofline (FLOPs)

FLOPs are counted using ptflops, which traces the model and counts multiply-accumulate operations (MACs) for each layer. We report FLOPs = MACs × 2.

Memory Roofline (HBM Traffic)

Memory traffic is estimated by tracing the model with torch.fx and analyzing the operation graph:

Unfused estimate: Every operation reads its inputs and writes its output to HBM (eager execution upper bound)
Fused estimate: Simulates operator fusion by grouping consecutive pointwise operations, where only boundary tensors hit HBM (inductor-style lower bound)

Operations are classified as:

Zero-copy: view, reshape, permute, etc. (no memory traffic)
Pointwise: relu, add, mul, etc. (fusible)
Reduction: sum, mean, softmax, etc.
Compute: matmul, conv, attention, etc.

Precision Handling

The precision argument ("fp32", "fp16", "bf16") affects analysis in two ways:

Memory traffic: Model and inputs are cast to the target dtype. fp16/bf16 use 2 bytes per element vs 4 bytes for fp32, halving memory traffic estimates.
Peak FLOPS: Uses precision-specific peak (e.g., H100: 67 TFLOPS fp32 vs 989 TFLOPS fp16 tensor core).

FLOP count itself is precision-independent (same operations regardless of dtype).

Theoretical Max / Speed of Light (SoL)

Compute SoL = FLOPs / Peak FLOPS
Memory SoL  = Fused Bytes / Peak Bandwidth
Bottleneck  = max(Compute SoL, Memory SoL)

Empirical Profiling

For actual runtime measurements, use KernelBench profiling utilities. Pass measured times to analyze(actual_ms=...) or compare_kernelbench(ref_ms=..., gen_ms=...) to compute efficiency metrics.

Limitations

No cache modeling: Assumes all data comes from HBM (pessimistic for small tensors)
Tensor core assumption: fp16/bf16 peak FLOPS assume full tensor core utilization, but only matmul/conv ops use tensor cores. Pointwise ops run on CUDA cores at lower throughput, making fp16/bf16 compute SoL optimistic for pointwise-heavy workloads.
Approximate fusion: Greedy pointwise-chain fusion, may differ from actual Inductor behavior
Static shapes only: Dynamic shapes not supported

See METHOD.md for detailed methodology.

Available GPUs

from torchroofline import list_gpus
print(list_gpus())

NVIDIA: H100, H200, A100, L40S, L40, L4, T4, A10G, RTX 4090/4080/3090/3080, V100

AMD: MI300X, MI325X, MI350X, MI355X

GPU specs are loaded from KernelBench when available (via git submodule), with fallback defaults.

Running Tests

uv run pytest tests/ -v

Project Structure

torchroofline/
├── pyproject.toml
├── external/
│   └── kernelbench/          # git submodule
├── src/torchroofline/
│   ├── __init__.py           # analyze(), exports
│   ├── hardware.py           # GPU specs (from KernelBench)
│   ├── flop_counter.py       # ptflops wrapper
│   ├── mem_traffic.py        # FX trace + fusion simulation
│   ├── roofline.py           # RooflineResult dataclass
│   ├── kernelbench.py        # compare_kernelbench()
│   └── viz.py                # roofline plotting
└── tests/

Contributing

GPU specs are sourced from external/kernelbench/src/kernelbench/prompts/hardware/gpu_specs.py. To add or update GPUs, submit a PR to KernelBench or add fallback specs in src/torchroofline/hardware.py.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
external		external
src/torchroofline		src/torchroofline
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
METHOD.md		METHOD.md
README.md		README.md
future_KB_update.md		future_KB_update.md
plan.md		plan.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

torchroofline

Setup

Development (with KernelBench integration)

Install as package

Quick Start

Method

Compute Roofline (FLOPs)

Memory Roofline (HBM Traffic)

Precision Handling

Theoretical Max / Speed of Light (SoL)

Empirical Profiling

Limitations

Available GPUs

Running Tests

Project Structure

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

torchroofline

Setup

Development (with KernelBench integration)

Install as package

Quick Start

Method

Compute Roofline (FLOPs)

Memory Roofline (HBM Traffic)

Precision Handling

Theoretical Max / Speed of Light (SoL)

Empirical Profiling

Limitations

Available GPUs

Running Tests

Project Structure

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages