Analytical roofline analysis for PyTorch models — compute + memory theoretical limits, no GPU required.
Given a PyTorch program (in KernelBench format) and a target GPU, this estimates theoretical roofline bounds (for compute and memory), with some estimates on fusion.
NOTE: This is a quick prototype. Please use with caution. I am still iterating on the design.
git clone --recurse-submodules https://github.com/your-org/torchroofline.git
cd torchroofline
uv syncIf you already cloned without submodules:
git submodule update --init --recursive
uv syncuv add torchroofline
# or
pip install torchrooflineFor visualization:
uv add torchroofline[viz]import torch
import torch.nn as nn
import torchroofline
model = nn.Sequential(
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 128),
)
x = torch.randn(1, 512)
result = torchroofline.analyze(model, (x,), gpu="h100-sxm")
result.print()Output:
================================================================
torchroofline — model
GPU: H100 Precision: fp32
================================================================
FLOPs : 0.30 GFLOPs
Params : 0.16 M
HBM traffic (unfused) : 0.002 GB <- eager upper bound
HBM traffic (fused) : 0.001 GB <- inductor lower bound
Fusion reduction : 50.0 %
Arithmetic intensity : 300.0 FLOPs/Byte
Ridge point (fp32) : 20.0 FLOPs/Byte
Compute SoL : 0.0045 ms
Memory SoL : 0.0003 ms
*** Bottleneck : COMPUTE BOUND ***
================================================================
FLOPs are counted using ptflops, which traces the model and counts multiply-accumulate operations (MACs) for each layer. We report FLOPs = MACs × 2.
Memory traffic is estimated by tracing the model with torch.fx and analyzing the operation graph:
- Unfused estimate: Every operation reads its inputs and writes its output to HBM (eager execution upper bound)
- Fused estimate: Simulates operator fusion by grouping consecutive pointwise operations, where only boundary tensors hit HBM (inductor-style lower bound)
Operations are classified as:
- Zero-copy: view, reshape, permute, etc. (no memory traffic)
- Pointwise: relu, add, mul, etc. (fusible)
- Reduction: sum, mean, softmax, etc.
- Compute: matmul, conv, attention, etc.
The precision argument ("fp32", "fp16", "bf16") affects analysis in two ways:
-
Memory traffic: Model and inputs are cast to the target dtype. fp16/bf16 use 2 bytes per element vs 4 bytes for fp32, halving memory traffic estimates.
-
Peak FLOPS: Uses precision-specific peak (e.g., H100: 67 TFLOPS fp32 vs 989 TFLOPS fp16 tensor core).
FLOP count itself is precision-independent (same operations regardless of dtype).
Compute SoL = FLOPs / Peak FLOPS
Memory SoL = Fused Bytes / Peak Bandwidth
Bottleneck = max(Compute SoL, Memory SoL)
For actual runtime measurements, use KernelBench profiling utilities. Pass measured times to analyze(actual_ms=...) or compare_kernelbench(ref_ms=..., gen_ms=...) to compute efficiency metrics.
- No cache modeling: Assumes all data comes from HBM (pessimistic for small tensors)
- Tensor core assumption: fp16/bf16 peak FLOPS assume full tensor core utilization, but only matmul/conv ops use tensor cores. Pointwise ops run on CUDA cores at lower throughput, making fp16/bf16 compute SoL optimistic for pointwise-heavy workloads.
- Approximate fusion: Greedy pointwise-chain fusion, may differ from actual Inductor behavior
- Static shapes only: Dynamic shapes not supported
See METHOD.md for detailed methodology.
from torchroofline import list_gpus
print(list_gpus())NVIDIA: H100, H200, A100, L40S, L40, L4, T4, A10G, RTX 4090/4080/3090/3080, V100
AMD: MI300X, MI325X, MI350X, MI355X
GPU specs are loaded from KernelBench when available (via git submodule), with fallback defaults.
uv run pytest tests/ -vtorchroofline/
├── pyproject.toml
├── external/
│ └── kernelbench/ # git submodule
├── src/torchroofline/
│ ├── __init__.py # analyze(), exports
│ ├── hardware.py # GPU specs (from KernelBench)
│ ├── flop_counter.py # ptflops wrapper
│ ├── mem_traffic.py # FX trace + fusion simulation
│ ├── roofline.py # RooflineResult dataclass
│ ├── kernelbench.py # compare_kernelbench()
│ └── viz.py # roofline plotting
└── tests/
GPU specs are sourced from external/kernelbench/src/kernelbench/prompts/hardware/gpu_specs.py. To add or update GPUs, submit a PR to KernelBench or add fallback specs in src/torchroofline/hardware.py.