Torch-MPS-Bench is a lightweight benchmarking suite for running deep learning model performance tests on Apple Silicon GPUs via Metal Performance Shaders (MPS). It compares CPU vs MPS, FP32 vs FP16 precision, across multiple batch sizes — and produces CSV logs, plots, and Markdown reports.
- 🔹 Benchmark popular models (ResNet, DistilBERT, etc.) on CPU vs MPS
- 🔹 Supports FP32 and FP16 precision
- 🔹 Logs results to CSV with latency (P50/P90/P99) and throughput
- 🔹 Generates plots for latency/throughput
- 🔹 Auto-generates a Markdown report with best configs + CPU→MPS speedups
- 🔹 Extensible — add your own models easily
.
├── bench.py # Run benchmarks (single model/config)
├── plot_results.py # Generate latency/throughput plots
├── gen_report.py # Create Markdown summary report
├── requirements.txt # Python deps (pandas, torch, transformers, tabulate, matplotlib)
├── results/
│ ├── bench.csv # Collected benchmark results
│ ├── plots/ # Auto-generated plots
│ └── summary.md # Auto-generated Markdown report
└── README.md
# Create env (Python 3.10+ recommended)
python -m venv .venv
source .venv/bin/activate
# Install deps
pip install -r requirements.txtRun CPU vs MPS for ResNet50:
python bench.py --model resnet50 --device cpu --precision fp32 --batch 4 --out_csv results/bench.csv
python bench.py --model resnet50 --device mps --precision fp16 --batch 4 --out_csv results/bench.csvRun DistilBERT (seq length 128):
python bench.py --model distilbert --device cpu --precision fp32 --batch 2 --seq_len 128 --out_csv results/bench.csv
python bench.py --model distilbert --device mps --precision fp16 --batch 2 --seq_len 128 --out_csv results/bench.csvpython plot_results.py --csv results/bench.csv --out results/plotsThis produces:
results/plots/resnet50_latency_p50.pngresults/plots/resnet50_throughput.png- etc.
python gen_report.py --csv results/bench.csv --out results/summary.md --plots_dir results/plotsThis creates results/summary.md with:
- 🔹 Environment info (PyTorch, Python, OS)
- 🔹 CPU→MPS speedup table
- 🔹 Per-model best configs (latency/throughput)
- 🔹 Compact results table
- 🔹 Auto-embedded plots
# 🧪 Torch-MPS-Bench — Summary
_Generated: 2025-08-18 19:00:00_
- **PyTorch**: 2.5.0
- **Python**: 3.10.14
- **System**: Apple M2 Pro
---
## CPU → MPS Latency Speedup (P50)
| model | batch | cpu_p50_ms | mps_p50_ms | speedup_x | pair |
|-----------|-------|------------|------------|-----------|-----------------------|
| resnet50 | 4 | 25.20 | 6.30 | 4.00 | cpu-fp32 vs mps-fp16 |
| distilbert| 2 | 112.00 | 40.00 | 2.80 | cpu-fp32 vs mps-fp16 |
---
## Model: resnet50
**Best Latency (P50)**
- Device: `mps` Precision: `fp16` Batch: `4` P50: **6.30 ms**
**Best Throughput**
- Device: `mps` Precision: `fp16` Batch: `8` Throughput: **120.5 samples/s**
**All Runs**
| device | precision | batch | p50_ms | p90_ms | p99_ms | throughput_sps |
|--------|-----------|-------|--------|--------|--------|----------------|
| cpu | fp32 | 4 | 25.2 | 26.1 | 27.9 | 39.7 |
| mps | fp16 | 4 | 6.3 | 6.5 | 6.8 | 126.4 |

- Add more models → extend
bench.pywith HuggingFace or TorchVision APIs - Add more devices (CUDA, ROCm) → plug into same CSV schema
- Add CI → run sanity benchmarks on CPU & upload report
MIT — feel free to fork, extend, and contribute 🚀