Skip to content

simonguozirui/simple-torchroofline

Repository files navigation

torchroofline

Analytical roofline analysis for PyTorch models — compute + memory theoretical limits, no GPU required.

Given a PyTorch program (in KernelBench format) and a target GPU, this estimates theoretical roofline bounds (for compute and memory), with some estimates on fusion.

NOTE: This is a quick prototype. Please use with caution. I am still iterating on the design.

Setup

Development (with KernelBench integration)

git clone --recurse-submodules https://github.com/your-org/torchroofline.git
cd torchroofline
uv sync

If you already cloned without submodules:

git submodule update --init --recursive
uv sync

Install as package

uv add torchroofline
# or
pip install torchroofline

For visualization:

uv add torchroofline[viz]

Quick Start

import torch
import torch.nn as nn
import torchroofline

model = nn.Sequential(
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
)

x = torch.randn(1, 512)
result = torchroofline.analyze(model, (x,), gpu="h100-sxm")
result.print()

Output:

================================================================
  torchroofline — model
  GPU: H100   Precision: fp32
================================================================
  FLOPs          :       0.30 GFLOPs
  Params         :       0.16 M

  HBM traffic (unfused) :     0.002 GB  <- eager upper bound
  HBM traffic (fused)   :     0.001 GB  <- inductor lower bound
  Fusion reduction      :      50.0 %

  Arithmetic intensity  :     300.0 FLOPs/Byte
  Ridge point (fp32)    :      20.0 FLOPs/Byte

  Compute SoL    :     0.0045 ms
  Memory  SoL    :     0.0003 ms
  *** Bottleneck : COMPUTE BOUND ***
================================================================

Method

Compute Roofline (FLOPs)

FLOPs are counted using ptflops, which traces the model and counts multiply-accumulate operations (MACs) for each layer. We report FLOPs = MACs × 2.

Memory Roofline (HBM Traffic)

Memory traffic is estimated by tracing the model with torch.fx and analyzing the operation graph:

  1. Unfused estimate: Every operation reads its inputs and writes its output to HBM (eager execution upper bound)
  2. Fused estimate: Simulates operator fusion by grouping consecutive pointwise operations, where only boundary tensors hit HBM (inductor-style lower bound)

Operations are classified as:

  • Zero-copy: view, reshape, permute, etc. (no memory traffic)
  • Pointwise: relu, add, mul, etc. (fusible)
  • Reduction: sum, mean, softmax, etc.
  • Compute: matmul, conv, attention, etc.

Precision Handling

The precision argument ("fp32", "fp16", "bf16") affects analysis in two ways:

  1. Memory traffic: Model and inputs are cast to the target dtype. fp16/bf16 use 2 bytes per element vs 4 bytes for fp32, halving memory traffic estimates.

  2. Peak FLOPS: Uses precision-specific peak (e.g., H100: 67 TFLOPS fp32 vs 989 TFLOPS fp16 tensor core).

FLOP count itself is precision-independent (same operations regardless of dtype).

Theoretical Max / Speed of Light (SoL)

Compute SoL = FLOPs / Peak FLOPS
Memory SoL  = Fused Bytes / Peak Bandwidth
Bottleneck  = max(Compute SoL, Memory SoL)

Empirical Profiling

For actual runtime measurements, use KernelBench profiling utilities. Pass measured times to analyze(actual_ms=...) or compare_kernelbench(ref_ms=..., gen_ms=...) to compute efficiency metrics.

Limitations

  • No cache modeling: Assumes all data comes from HBM (pessimistic for small tensors)
  • Tensor core assumption: fp16/bf16 peak FLOPS assume full tensor core utilization, but only matmul/conv ops use tensor cores. Pointwise ops run on CUDA cores at lower throughput, making fp16/bf16 compute SoL optimistic for pointwise-heavy workloads.
  • Approximate fusion: Greedy pointwise-chain fusion, may differ from actual Inductor behavior
  • Static shapes only: Dynamic shapes not supported

See METHOD.md for detailed methodology.

Available GPUs

from torchroofline import list_gpus
print(list_gpus())

NVIDIA: H100, H200, A100, L40S, L40, L4, T4, A10G, RTX 4090/4080/3090/3080, V100

AMD: MI300X, MI325X, MI350X, MI355X

GPU specs are loaded from KernelBench when available (via git submodule), with fallback defaults.

Running Tests

uv run pytest tests/ -v

Project Structure

torchroofline/
├── pyproject.toml
├── external/
│   └── kernelbench/          # git submodule
├── src/torchroofline/
│   ├── __init__.py           # analyze(), exports
│   ├── hardware.py           # GPU specs (from KernelBench)
│   ├── flop_counter.py       # ptflops wrapper
│   ├── mem_traffic.py        # FX trace + fusion simulation
│   ├── roofline.py           # RooflineResult dataclass
│   ├── kernelbench.py        # compare_kernelbench()
│   └── viz.py                # roofline plotting
└── tests/

Contributing

GPU specs are sourced from external/kernelbench/src/kernelbench/prompts/hardware/gpu_specs.py. To add or update GPUs, submit a PR to KernelBench or add fallback specs in src/torchroofline/hardware.py.

About

experimental torch program roofline tracing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages