07 Apr 01:01

331f6f7

Latest

AIPerf — Release 0.7.0

Summary

AIPerf 0.7.0 adds a FastAPI service with metrics, progress, results, and workers APIs, a service CLI for running individual services, and a lightweight health server for Kubernetes probes. Dataset and trace workflows expand with Bailian traces, Mooncake messages, public dataset loader plugins, Hugging Face + AIMO loading, random_pool batching and single-line JSONL, and zstd for the memory-map backing store. Multimodal work gains synthetic audio on synthetic video and a NIM image retrieval endpoint. Benchmarking gains multi-turn conversation context, convergence / early stopping, multi-run confidence reporting, and accuracy stubs. Security and correctness improve via centralized credential redaction, treating all 2xx responses as success (including 202), and fixes for SSE, tokenizers (including Kimi), metrics processing, and macOS child/bootstrap behavior. Documentation moves toward Fern with publishing automation; the CLI is refactored into per-command modules with a performance pass on hot paths.

Key highlights

Service & ops: FastAPI app with plugin routers (#717); Prometheus-style metrics (#718); progress / results / workers APIs (#721); service subcommand (#668); K8s health server (#629).
Data & traces: Bailian loader + BaseTraceDatasetLoader (#723); Mooncake messages (#728); public dataset loader plugins (#758); Hugging Face + AIMO (#768); random_pool batch size + single-line JSONL (#745); zstd mmap store (#698).
Endpoints & media: NIM image retrieval (#725); synthetic audio for multimodal video (#714).
Load & analysis: Conversation context for multi-turn (#739); convergence / early stopping (#761); multi-run confidence stats (#623); accuracy stubs (#684).
Messaging: ZMQ dual-bind IPC+TCP (#660); wildcard subscriptions (#700).
Security & HTTP: API key / credential redaction (#766); 2xx success including 202 (#777); trust_env on metrics collector sessions (#755).
Performance & CLI: Dataclasses + SSE buffer tuning (#746); CLI split into per-command files (#748).
Docs: Fern docs, tests, publish Action, org/URL/basepath fixes (#676, #754, #759, #757, #760, #762).

Features and enhancements

API service and CLI

FastAPI service with plugin-based router system — #717 (@ajcasagrande)
Metrics router (Prometheus exposition) — #718 (@ajcasagrande)
Progress, results, workers routers with tests — #721 (@ajcasagrande)
service subcommand for individual services — #668 (@ajcasagrande)
Lightweight health server for Kubernetes probes — #629 (@ajcasagrande)

Datasets, traces, and storage

Plugin infrastructure for public dataset loaders — #758 (@lkomali)
Hugging Face dataset loader infrastructure + AIMO — #768 (@lkomali)
Bailian trace loader; BaseTraceDatasetLoader — #723 (@ajcasagrande)
messages in Mooncake trace format — #728 (@lvogel04)
random_pool: batch size + single-line JSONL — #745 (@a-rich)
zstd compression for dataset memory-map backing store — #698 (@ajcasagrande)

Endpoints and multimodal

NIM image retrieval endpoint — #725 (@ajcasagrande)
Synthetic audio track embedding for multimodal video — #714 (@ajcasagrande)

Benchmarking and statistics

Conversation context mode for multi-turn history — #739 (@ajcasagrande)
Convergence early stopping (adaptive criteria and utilities) — #761 (@dferguson992)
Multi-run confidence reporting with statistical aggregation — #623 (@Lokiiiiii)
Accuracy stubs — #684 (@debermudez)

ZMQ and infrastructure

Dual-bind IPC + TCP proxy — #660 (@ajcasagrande)
Wildcard subscription support — #700 (@ajcasagrande)

Security, correctness, and HTTP behavior

Centralized API key and credential redaction — #766 (@ajcasagrande)
Treat all 2xx responses as success (202 Accepted) — #777 (@ajcasagrande)
Stop leaking internal Text.name into chat completion payloads — #770 (@ajcasagrande)

Performance and refactoring

Dataclasses on hot-path models + SSE buffer optimizations — #746 (@ajcasagrande)
WorkerTracker extraction + FinalResultsMixin — #701 (@ajcasagrande)
Pre-compute display units and filtering in summarize() — #662 (@ajcasagrande)
Move CLI commands into per-command files; co-locate domain CLIs — #748 (@ajcasagrande)

Bug fixes and robustness

Change	PR
SSE: literal newlines in `data` fields	#705
Pickle: always pass `module` to `create_enum`	#669
Prefix synthesis with `text_input` trace	#670
Prefix mult with fractional multiplier	#691
Plugins: remove alphabetical ordering constraint	#678
macOS: None stdio in child process bootstrap	#706
macOS: semaphore leak warning	#686
Do not unregister semaphores with no name	#690
Dataloader: stdio at OS level; daemon flag (replace billiard hack)	#709
Metrics: `sum` field; preserve INTERNAL metrics	#703
Metrics HTTP: pass `trust_env` to collector sessions	#755
Health check off event loop	#740
Tokenizers: non-standard kwargs (e.g. Kimi)	#727
Kimi K2.5: `trust_remote_code` + `revision` in `parallel_decode`	#744
Preserve special tokens during decode (prompt block separators)	#737
SIGTERM handler	#752
Dashboard: expose copy all logs keybinding `c`	#730
Tests: TimeTraveler recursion when looptime disabled	#683
Tests: mock `parent_process` instead of subprocess in health tests	#682
Tests: fixture scopes session → package	#747

Documentation

Simplify architecture diagram — #632 (@lkomali)
**Serve...

Contributors

saturley-hall, FrankD412, and 13 other contributors

Assets 2

12 Mar 20:49

saturley-hall

v0.6.0-post.1

e9cf064

AIPerf v0.6.0-post.1 Release

AIPerf - Release 0.6.0-post.1

Summary

This is a minor documentation correction to the main README.

For full update of the features in this release please refer to Release v0.6.0.

What's Changed

fix: Remove duplicate example. by @FrankD412 in #753

Full Changelog: v0.6.0...v0.6.0-post.1

Contributors

FrankD412

Assets 2

10 Mar 19:36

saturley-hall

v0.6.0

0746a7c

AIPerf v0.6.0 Release

AIPerf - Release 0.6.0

Summary

AIPerf 0.6.0 delivers a substantial update focused on plugin architecture, multimodal and video support, robustness, and developer experience. This release introduces a Complete Plugin Registry System with YAML definitions and schema-driven tooling, expands endpoint support with NIM Embeddings (text + image), vLLM multimodal chat embeddings, and SGLang video generation, and adds an LLM TCO calculator plus improved GPU telemetry and metrics. Critical fixes address dataloader multiprocessing, security, prefix synthesis, and cross-platform behavior (including macOS). Documentation is expanded with metrics guides, plugin tutorials, and LLM benchmarking updates.

Key Highlights

Plugin System Overhaul

AIPerf 0.6.0 introduces a Complete Plugin Registry System (#616) that replaces the previous factory-based approach. Plugins can now be defined via YAML (#613), with schema files, enums, and an auto-generator tool (#614), plus ExtensibleStrEnum for dynamic enum generation (#615). Plugin lookup normalizes dashes and underscores (#635), and the alphabetical ordering constraint was removed for flexibility (#679).

Multimodal & Video Support

NIM Embeddings: New NIM Embeddings endpoint with multimodal (text + image) support (#560).
vLLM Multimodal Chat Embeddings: Multimodal chat embeddings endpoint for vLLM (#637).
SGLang Video Generation: SGLang /v1/video generation with polling support (#658).

Multi-URL, Telemetry & Tooling

Multiple --url Endpoints: Support for multiple --url endpoints (#606), with a fix so multi-url advances URL correctly beyond the first turn (#618).
Local GPU Telemetry: Support for local GPU telemetry via pynvml (#647); warmup is filtered out from GPU telemetry and counter deltas (#596).
LLM TCO Calculator: New LLM TCO (Total Cost of Ownership) calculator tool (#559).
Pre-Flight Tokenizer: Pre-flight tokenizer auto-detection and error display (#640).
OSL Mismatch Detection: OSL mismatch detection and warning in metrics (#644).

Stability & Security

Dataloader multiprocessing reworked: parallel processing disabled by default (#583), then billiard-based multiprocessing (#584), and finally stdio redirect at OS level with daemon-flag approach replacing billiard (#711).
Security: Multiple security fixes (#590), SonarQube medium/high/critical cleanups (#605), and regex backtracking fix (#643).
macOS: Handling of None stdio streams in child process bootstrap (#708).

Features & Enhancements

Plugin System

Complete Plugin Registry System: Replaces factory-based plugin loading with a registry (#616).
YAML Plugin Definitions: Plugins can be defined via YAML (#613).
Plugin Schema & Tooling: Schema files, enums, and auto-generator tool for plugins (#614).
ExtensibleStrEnum: Dynamic enum generation for plugin and config use (#615).
Normalize Names: Dashes and underscores normalized in enum and plugin lookup (#635).
Plugin Ordering: Removed alphabetical ordering constraint for plugins (#679).

Endpoints & Backends

NIM Embeddings: NIM Embeddings endpoint with multimodal (text + image) support (#560).
vLLM Multimodal Chat Embeddings: Multimodal chat embeddings endpoint for vLLM (#637).
SGLang Video: SGLang /v1/video generation with polling support (#658).
Multiple URLs: Support for multiple --url endpoints (#606).

Telemetry & Metrics

Local GPU Telemetry: Local GPU telemetry via pynvml (#647).
Warmup Filtering: Filter warmup from GPU telemetry and counter deltas (#596).
OSL Mismatch: OSL mismatch detection and warning in metrics (#644).

Tooling & UX

LLM TCO Calculator: New LLM TCO calculator tool (#559).
Pre-Flight Tokenizer: Pre-flight tokenizer auto-detection and error display (#640).
TTY Auto-Detection: Auto-detect TTY for UI type and log formatting (#650).
Markdown-Accuracy-Auditor: Claude Code agent for markdown accuracy auditing (#641).

Prefix & Synthesis

Num Prefix Groups: Num prefix groups in prefix analyzer (#627).

Network & Configuration

IP Version & aiohttp: Ability to specify IP version and trust_env for aiohttp (#625).
Metrics Collector: Use create_tcp_connector for metrics collector HTTP sessions (#652).

Refactoring & Infrastructure

Shared Generator Infrastructure: Extract shared generator infrastructure in tools (#621).
Remove mkinit: Remove mkinit and all generated init files (#646).
Version Bump: 0.6.0 version bump and canonical aiperf.__version__ (#665).

Bug Fixes

Dataloader & Multiprocessing

Disable Parallel Processing: Disable parallel processing for dataloader to avoid instability (#583).
Billiard for Multiprocessing: Use billiard for multiprocessing in dataloader (#584).
Stdio Redirect & Daemon Flag: Redirect stdio at OS level and replace billiard with daemon-flag approach (#711).

Scheduling & Benchmarks

Fixed Schedule Completion: Fix errors with benchmark completion in fixed schedule (#585).

Security

Various Security Issues: Fix various security issues (#590).
SonarQube: Clean up SonarQube medium, high, and critical security findings (#605).
Regex Backtracking: Fix regex backtracking issue (#643).

Dashboard & Binding

Dashboard Host Binding: Expose the host to adjust the dashboard service binding address (#572).

Logging & Debug

Debug Logging: Fix issues with debug logging (#594).

Prefix & Synthesis

prefix-root-mult: Fix prefix-root-mult feature to behave correctly (#593).
max_isl / max_osl: Fix max_isl and max_osl for synthesis (#592).
Prefix Synthesizer Parity: Prefix synthesizer partial feature parity with Dynamo (#636).
Prefix Synthesis text_input: Fix prefix synthesis error with text_input trace (#673).
Prefix Mult Fractional: Fix prefix mult with fractional multiplier (#692).

Metrics & Messaging

ISL Token Count & ZMQ: Correct ISL token count and fix ZMQ message size (#597).

Tests & CI

Curl Not Found: Fix test failure due to curl not found (#600).

Multi-URL

URL Advancement: Multi-url only advances URL on first turn — fixed (#618).

Dependencies & Build

Numpy Versioning: Update numpy versioning (#634).
Dockerfile: Upgrade Dockerfile base image and remove setuptools ([#651](https:...

Contributors

brandonpelfrey, oliverholworthy, and 6 other contributors

Assets 2

12 Feb 00:31

ajcasagrande

v0.5.0

a2e9060

AIPerf v0.5.0 Release

AIPerf 0.5.0 Release Notes

Overview

AIPerf 0.5.0 delivers a major timing system overhaul with multi-phase benchmarking, user-centric rate modes, and sticky session routing. This release also introduces prefix synthesis capabilities for realistic workload generation, video file support for vision endpoints, significant performance improvements, and endpoint documentation.

Highlights

Complete Timing System Overhaul — Multi-phase benchmarking with warmup, user-centric rate modes, and sticky routing
Prefix Generator Integration — Analyze traces and synthesize workloads with realistic prefix sharing patterns
Mooncake Trace Data Loader Performance — Load massive datasets without waiting
Video File Support — Load video files and URLs for vision endpoint benchmarking
Memory-mapped datasets — Remove DatasetManager network bottleneck at high QPS
Server Token Count Support — Measure the actual throughput of your server
SSL Certificate Verification Bypass — Benchmark in development and staging environments without SSL verification
Improved Error Messages — Debug faster with actionable errors
Connection Reuse Strategies — Match your production connection behavior
Endpoint Documentation — 5+ New Endpoint Documents

New Features

Timing System Overhaul (#565)

A comprehensive redesign of the timing and scheduling infrastructure that enables production-accurate benchmarking:

Feature	Why It Matters
Multi-phase benchmarking	Eliminate cold-start pollution. Warmup phases absorb JIT compilation, KV cache allocation, CUDA kernel compilation, and connection establishment—so your measurements reflect true steady-state performance.
User-centric rate mode	Precise KV cache TTL testing. Control exact per-user turn gaps to test whether your cache retains entries for specific durations. Simulates steady-state from the start with virtual history—no cold-start transient.
Arrival patterns	Simulate real-world traffic. Poisson and gamma distributions model natural request clustering and bursts, unlike unrealistic constant-interval benchmarks. Compatible with vLLM's `--burstiness` parameter.
Prefill concurrency	Fine-grained TTFT control. Limit simultaneous prefill operations to prevent memory exhaustion and directly control time-to-first-token behavior—especially valuable for disaggregated serving architectures.
Dynamic ramping	Prevent server overwhelm. Gradually increase load to avoid connection storms, memory spikes, and misleading metrics from sudden full-load starts. Detect capacity limits early.
HTTP trace metrics	Nanosecond-precision debugging. k6/HAR-compatible metrics for DNS lookup, TCP connect, TLS handshake, TTFB, and transfer phases. Correlate with logs using wall-clock timestamps.
Sticky user-session routing	Fair multi-turn load balancing. Route all turns of a conversation to the same worker, ensuring consistent KV cache behavior and eliminating cross-worker cache misses.
Memory-mapped datasets	Zero-copy dataset access eliminates the DatasetManager network bottleneck at high QPS. Workers read directly from shared files in O(1) time.

New CLI Options:

Warmup: --warmup-request-count, --warmup-duration, --warmup-concurrency
Arrival patterns: --arrival-pattern, --arrival-smoothness
Prefill control: --prefill-concurrency
Ramping: --concurrency-ramp-duration, --request-rate-ramp-duration
User-centric: --user-centric-rate, --num-users

Documentation:

Timing Modes Reference — Comprehensive timing configuration guide
Warmup Tutorial — Configure warmup phases
Arrival Patterns Tutorial — Request arrival distributions
User-Centric Timing Tutorial — Multi-user simulation
Ramping Tutorial — Dynamic load ramping
Prefill Concurrency Tutorial — Concurrent prefill tuning
HTTP Trace Metrics Tutorial — Request-level timing analysis
Request Cancellation Tutorial — Cancellation simulation

Prefix Generator Integration (#534)

Benchmark with production-realistic KV cache behavior. Generic synthetic prompts don't capture how real workloads share prefixes—leading to misleading cache hit rate measurements. The prefix synthesis feature lets you:

Analyze production traces — Extract prefix-sharing patterns, cache hit rates, and sequence length distributions from real request logs
Synthesize realistic workloads — Generate benchmarking datasets that preserve the statistical properties of your production traffic
Scale while preserving patterns — Create larger or smaller datasets while maintaining the same prefix reuse characteristics
Mooncake trace support — Direct compatibility with Mooncake-format trace files used in KV cache research

New CLI Command:

aiperf analyze-trace --input-file trace.jsonl --output-file analysis.json

Documentation:

Prefix Synthesis Tutorial — Generate realistic workloads
Synthesis API Reference — API documentation

Mooncake Trace Data Loader Performance (#570)

Load massive datasets without waiting. Production trace files can contain millions of requests. The new parallel data loader dramatically reduces startup time:

Parallel decoding — Multi-process trace file loading fully utilizes available CPU cores
Batch processing — Optimized Mooncake trace parsing with efficient batch operations
Faster iteration — Spend more time benchmarking and less time waiting for data to load

Documentation:

Trace Replay Guide — Working with trace datasets

Video and Vision Enhancements (#567)

Benchmark Vision Language Models (VLMs) with real media. Test multimodal models with actual video and image content instead of just synthetic data:

Local video files — Load MP4, WebM, and other video formats directly from your filesystem
Remote video URLs — Reference videos hosted on CDNs or object storage in your dataset configurations
Automatic frame extraction — AIPerf handles video decoding and frame sampling for vision endpoint testing
Synthetic image generation — Generate images with configurable dimensions for controlled benchmarking

Documentation:

Vision Tutorial — Vision and multimodal endpoints
Synthetic Video Tutorial — Video data generation

Connection Reuse Strategies (#531)

Match your production connection behavior. Different deployment scenarios require different connection patterns—now you can benchmark with the exact strategy you use in production:

Strategy	Use Case
`pooled`	Standard deployments with connection pooling (default)
`never`	Serverless or per-request connection scenarios
`sticky-user-sessions`	Load balancers with session affinity—connection persists across all turns of a multi-turn conversation, enabling accurate KV cache benchmarking behind sticky load balancers

aiperf profile --connection-reuse-strategy sticky-user-sessions --url http://localhost:8000 -m Qwen/Qwen3-0.6B

Documentation:

Multi-Turn Tutorial — Multi-turn conversation benchmarking
CLI Options Reference — Full CLI documentation

Server Token Count Support (#496)

Measure what the server actually sees. When your inference server uses a different tokenizer or tokenization strategy than AIPerf's local tokenizer, metrics like tokens-per-second become misleading. Server token count support ensures your benchmarks measure the actual throughput:

Accurate metrics — Use the server's own token counts for precise throughput calculations
Reduced overhead — Skip local tokenization entirely, lowering client-side CPU usage
Reasoning token support — Track reasoning tokens separately when servers report them

aiperf profile --use-server-token-count --url http://localhost:8000 -m Qwen/Qwen3-0.6B

Documentation:

CLI Options Reference — See --use-server-token-count option

SSL Certificate Verification Bypass (#536)

Benchmark in development and staging environments. Self-signed certificates and internal CAs shouldn't block your performance testing. This option lets you benchmark endpoints in pre-production environments:

export AIPERF_HTTP_SSL_VERIFY=false

Warning: Disabling SSL verification is insecure and should only be used for testing in trusted environments.

Documentation:

Environment Variables Reference — See AIPERF_HTTP_SSL_VERIFY

Improved Error Messages (#563)

Debug faster with actionable errors. When profiling fails due to tokenizer issues, AIPerf now helps you fix the problem:

Clear descriptions — Understand exactly what went wrong
Troubleshooting steps — Get specific actions to resolve the issue
Documentation links — Jump directly to relevant guides

Bug Fixes

Issue	Fix	PR
TemplateEndpoint streaming token counting	Corrected output token counting for streaming responses	#542
Plot dashboard crashes	Fixed crashes and incorrect legend labels in dashboard	#526
Multi-run plot crash	Re...

Assets 2

16 Jan 00:25

nv-anants

v0.4.0

54cd6dc

AIPerf v0.4.0 Release

AIPerf 0.4.0 Release Notes

Major Features

AIPerf Plot Feature (#511)

New aiperf plot command for visualizing benchmark results. See Visualization & Plotting Tutorial.

Interactive dashboard via --dashboard flag with dynamic metric switching, run filtering, and plot customization
Static PNG export with NVIDIA brand styling (default behavior)
Automatic mode detection distinguishes single-run time-series from multi-run comparisons
GPU telemetry integration displaying power, utilization, memory, and throughput correlation
Timeslice analysis for performance evolution across time windows
Experiment classification for baseline/treatment color assignment in A/B testing
Theme support with --theme dark option
YAML configuration via ~/.aiperf/plot_config.yaml

Server-Side Prometheus Metrics Collection (#488)

Collect server-side metrics from LLM inference server Prometheus endpoints (vLLM, SGLang, TRT-LLM, Dynamo). See Server Metrics Documentation.

Automatic endpoint discovery from inference server base URL + /metrics
Collection at configurable intervals (default 333ms) with reachability testing before profiling
Multiple export formats: JSON (aggregated stats), CSV (tabular), JSONL (time-series), Parquet (with delta calculations)
Supports Prometheus types: counter, gauge, histogram with percentile estimation from buckets
Timesliced statistics when used with --slice-duration for windowed analysis
CLI usage: --server-metrics URL [URL...] for additional endpoints
Disable with: --no-server-metrics

Shared System Prompt and User Context Prompt (#506)

New CLI options for prompt composition:

--shared-system-prompt-length: Single system prompt shared across all conversations
--user-context-prompt-length: Per-conversation user context prompts for KV-Cache testing

OpenAI Image Generation Endpoint Support (#468)

Native support for benchmarking OpenAI-compatible image generation endpoints (e.g., SGLang). New image_generation endpoint type with response parsing for both base64 and URL-based image outputs. See SGLang Image Generation Tutorial.

Vision Endpoint and Image Metrics (#450)

New metrics for vision and multimodal workloads:

num_images: Total images processed across conversation turns
image_throughput: Images processed per unit time
image_latency: Latency per individual image

Rankings Enhancements

See Rankings Tutorial.

Synthetic Data for Rankings (#440): Generate synthetic ranking workloads without external datasets
Rankings Prompt and Query Token Options (#498): Configure prompt and query token lengths for ranking benchmarks

GPU Telemetry Improvements

See GPU Telemetry Tutorial.

JSONL Export (#441): Export GPU telemetry data to JSONL format for external analysis
Custom Metrics via CSV (#424): Define custom GPU metric configurations using CSV files

Video Generation Enhancements

See Synthetic Video Tutorial.

WebM and VP9 Support (#434): Generate synthetic video in WebM format with VP9 codec for video benchmarking workloads

Auto-Detect Custom Dataset Type (#399)

Automatic inference of --custom-dataset-type from --input-file content. See Benchmark Datasets.

Examines file structure to classify datasets without explicit user specification
Auto-selects appropriate --dataset-sampling-strategy based on loader capabilities
Supports single-turn, multi-turn, mooncake trace, and random pool formats
Manual override still available when needed

New Metrics

Total Token Throughput (#500): New aggregate throughput metric across all tokens
Time to First Output (TTFO) added to dashboard (#446)
Renamed prefill_throughput to prefill_throughput_per_user for clarity

Usability Improvements

Legacy Max Tokens Option (#481): CLI flag --use-legacy-max-tokens for compatibility with older API versions
API Error Parsing (#482): Parse and display helpful error messages when max_completions_tokens is not supported by the server

Bug Fixes

Set default dataset sampling strategy (#519, #521)
Fixed stack trace when error message is not JSON string (#505)
Removed reasoning content from multi-turn conversations (#499)
Fixed concurrency validation when request_count is not set (#480)
Fixed invalid parsed response records conversion to error records (#477)
Fixed error when concurrency exceeds request count (#475)
Fixed timeout when dataset configuration takes too long (#471)
Fixed duplicate requests in fixed schedule for multi-turn conversations (#444)
Fixed ZMQ context termination deadlock issue (#469)

Documentation

Auto-generated CLI Options Reference from cyclopts app (#476)
Auto-generated Environment Variables documentation (#487)

Infrastructure & Maintenance

Upgraded Python base container to 3.13.11
Added Python 3.13 support to GitHub Actions
Updated dependencies: matplotlib 3.10.0+, aiohttp 3.13.3+, pydantic 2.10+, cyclopts v4
Updated container ffmpeg to 8.0.1
Added Contributor License Agreement
Optimized base pydantic models with exclude_none (#426)

New Contributors

Lei Gao (@leigao97) - First external community contributor! (#444)
Anant Sharma (@nv-anants)

Known Issues

Server metrics timeout on unreachable endpoints: When a server metrics endpoint is not reachable, the benchmark may timeout instead of gracefully continuing. Workaround: use --no-server-metrics to disable server metrics collection if the Prometheus endpoint is unavailable.
Shared/user context prompts not included in ISL: Tokens from --shared-system-prompt-length and --user-context-prompt-length are not included in input sequence length (ISL) metric calculations.
MP4 video generation incompatible with NVIDIA NIM: Generated MP4 videos use fragmented empty_moov format, which is incompatible with NVIDIA NIM video endpoints. Workaround: use --video-format webm instead.

Contributors

leigao97 and nv-anants

Assets 2

20 Nov 22:11

saturley-hall

v0.3.0

3626694

AIPerf v0.3.0 Release

AIPerf - Release 0.3.0

Summary

AIPerf 0.3.0 focuses on advanced metrics and analytics, endpoint ecosystem expansion, and developer experience improvements. In this release, timeslice metrics enable fine-grained temporal analysis of LLM performance, multi-turn conversation support reflects real-world chat patterns, and GPU telemetry provides comprehensive observability. The endpoint ecosystem expands with Hugging Face, Cohere, and Solido integrations, while infrastructure improvements enhance reproducibility, cross-platform support, and extensibility. AIPerf seamlessly supports benchmarking across all major LLM serving platforms including OpenAI-compatible endpoints, custom HTTP APIs via Jinja templates, and specialized endpoints for embeddings, rankings, and RAG systems.

Advanced Metrics & Analytics

AIPerf 0.3.0 introduces timeslice metrics for temporal performance analysis, allowing users to slice benchmark results by time duration for identifying performance degradation and anomalies. The new time-to-first-output (non-reasoning) metric provides accurate measurement of user-perceived latency by excluding reasoning tokens. Enhanced server token count parsing enables direct comparison with client-side measurements, while raw request/response export facilitates debugging and analysis of LLM interactions.

Endpoint Ecosystem Expansion

This release expands AIPerf's compatibility with major LLM serving platforms through native support for Hugging Face TEI (Text Embeddings Inference), Hugging Face TGI (Text Generation Inference), Cohere Rankings API, and Solido RAG endpoints. The new custom payload template system with Jinja support enables benchmarking of arbitrary HTTP APIs, while the decoupled endpoint/transport architecture accelerates plugin development for new platforms.

Reproducibility & Developer Experience

AIPerf 0.3.0 strengthens reproducibility with an order-independent RandomGenerator system that ensures consistent results across runs regardless of execution order. Infrastructure modernization includes moving to a src/ directory layout, comprehensive e2e integration tests with a built-in mock server, and cross-platform support for Python 3.10-3.13 on Ubuntu, macOS, and Windows. Dataset flexibility improves with new sampler implementations and separation of dataset entries from conversation count configuration.

Major Features & Improvements

Timeslice Metrics

Timeslice Duration Option: Added --slice-duration option for time-sliced metric analysis (#300), enabling performance monitoring over configurable time windows for detecting degradation patterns and anomalies.
Timeslice Export Formats: Implemented JSON and CSV output formats for timeslice metrics (#411), providing flexible data export for visualization and analysis tools.
Timeslice Calculation Pipeline: Added timeslice metric result calculation and handover to ExportManager (#378), integrating temporal analysis into the core metrics pipeline.
Timeslice Documentation: Comprehensive tutorial documentation for timeslice metrics feature (#420), including usage examples and interpretation guidance.

Multi-Turn Conversations

Multi-Turn Support: Full implementation of multi-turn conversation benchmarking (#360), enabling realistic evaluation of chatbot and assistant workloads with conversation context and state management.
Inter-Turn Delays: Configurable delays between conversation turns (#452, #455), simulating realistic user think time and typing patterns for accurate throughput modeling.

Custom Endpoint Integration

Jinja Template Payloads: Fully custom Jinja template support for endpoint payloads (#406) with autoescape security (#461), enabling benchmarking of arbitrary HTTP APIs and custom LLM serving frameworks.
Endpoint/Transport Decoupling: Refactored architecture to decouple endpoints and transports (#389), accelerating development of new endpoint plugins and improving code maintainability.
URL Flexibility: Support for /v1 suffix in URLs (#349), simplifying endpoint configuration for OpenAI-compatible servers.

Endpoint Integrations

Hugging Face TEI: Added support for Hugging Face Text Embeddings Inference endpoints with rankings API (#398, #419), enabling benchmarking of embedding and ranking workloads.
Hugging Face TGI: Native support for Hugging Face Text Generation Inference generate endpoints (#412, #419), expanding compatibility with popular open-source serving.
Cohere Rankings API: Integration with Cohere Rankings API (#398, #419) for benchmarking reranking and retrieval-augmented generation pipelines.
Solido RAG Endpoints: Support for Solido RAG endpoints (#396), enabling evaluation of retrieval-augmented generation systems.

GPU Telemetry & Observability

Real-Time Dashboard: GPU telemetry real-time dashboard display (#370) with live metrics visualization for monitoring GPU utilization, memory, power, and temperature during benchmarks.
DCGM Simulator: Realistic DCGM metrics simulator (#361) for testing telemetry pipelines without physical GPUs, improving development workflows.
Endpoint Reachability: Improved GPU telemetry endpoint reachability logging (#397) with better error messages when DCGM endpoints are unavailable.
Default Endpoints: Added http://localhost:9400/metrics to default telemetry endpoints (#369) for easier local development.

Video Generation

WebM/VP9 Support: Added WebM container and VP9 codec support to video generator (#460), enabling efficient video compression for multimodal benchmarking.
Video Tutorial: Comprehensive video generation tutorial documentation (#409), covering configuration and usage patterns.

Metrics & Accuracy

Time to First Output (Non-Reasoning): New metric excluding reasoning tokens (#359) with migration guide (#365), providing accurate measurement of user-perceived latency for reasoning models.
Server Token Counts: Parse and report server-provided usage data (#405), enabling validation of client-side token counting and detecting discrepancies.
Error Record Conversion: Convert invalid parsed responses to error records (#416), ensuring proper tracking of malformed responses in metrics calculations.
SSE Error Parsing: Enhanced SSE parsing to detect and handle error events from Dynamo and other servers (#385), improving error attribution.
Nested Input Parsing: Fixed parsing of nested lists/tuples for extra inputs (#318), enabling complex structured inputs in benchmarks.

Reproducibility

Order-Independent RNG: Hardened reproducibility with order-independent RandomGenerator system (#415), ensuring consistent results across runs regardless of async execution order and message arrival timing.

Dataset & Configuration

Dataset Samplers: New dataset sampler implementations (#395) for flexible sampling strategies including random, sequential, and weighted selection.
Dataset Entries Option: Separated --dataset-entries CLI option from --num-conversations (#421) with updated documentation (#430), clarifying configuration semantics and enabling independent control.
Environment Settings: Moved constants to Pydantic environment settings (#390), improving configurability and enabling environment-based overrides.

Developer Experience

Project Structure: Moved aiperf into src/ directory (#387) following Python community conventions, improving packaging and import semantics.
Mock Server Auto-Install: make install auto-installs mock server (#382), streamlining local development setup.
E2E Integration Tests: Comprehensive e2e integration tests with mock server covering all endpoints (#377), improving test coverage and catching integration regressions.
Cross-Platform Support:
- Auto-detect and disable uvloop on Windows (#413) for seamless Windows development
- macOS semaphore cleanup fixes (#379) preventing resource leaks
- Fixed spurious test errors on macOS due to incorrect patching (#422)
Python Version Support: Unit tests on Python 3.10, 3.11, 3.12 across Ubuntu and macOS (#356), ensuring broad compatibility.
Docker Compliance: Dockerfile OSRB compliance (#337) and Python 3.13 support (#454, #478).
Verbose Logging: -v and -vv flags auto-enable simple UI mode (#401), with override capability for customization.

User Experience

Fullscreen Logs: Show logs fullscreen until first progress messages (#402), improving visibility of startup diagnostics and errors.
Dashboard Screenshot: Added dashboard screenshot to README (#371), helping users understand telemetry capabilities.
Request-Rate Documentation: Comprehensive documentation on request-rate with max concurrency (#380), clarifying load generation behavior.

Performance & Stability

Goodput Calculation: Fixed goodput release calculation issues (#373), ensuring accurate reporting of successful request throughput.
SSE Chunk Parsing: Fixed SSE parsing when multiple messages arrive in a single buffered chunk (#368), preventing message loss and corruption.
Task Cancellation: Wait for flush tasks to finish before cancelling (#404), preventing data loss during shutdown.
Log Queue Cleanup: Added timeout for log queue cleanup (#393), preventing deadlocks during service shutdown.
ZMQ Context Termination: Fixed ZMQ context termination and TimeoutError issues (#474), improving clean shutdown behavior.
GPU Telemetry Timing: Fixed Telemetry Manager shutdown race condition (#367), preventing profile start failures.

Documentation

Timeslice Tutorial: Tutoria...

Assets 2

24 Oct 22:21

saturley-hall

v0.2.0

f14fde3

AIPerf v0.2.0

AIPerf Release Notes

Summary

AIPerf v0.2.0 introduces time-based benchmarking with configurable grace periods and request cancellation capabilities. The release adds advanced metrics including goodput measurement for SLO compliance, GPU telemetry, and inter-chunk latency tracking.

New Features

Time-Based Benchmarking

Time-based benchmarking support - Run benchmarks for a specified duration with configurable grace periods for more realistic testing scenarios
Benchmark grace period - Added grace period functionality to allow for proper warmup and cooldown phases during benchmarking

Request Management & Control

Request cancellation - Added ability to cancel requests during benchmarking to test timeout behavior and service resilience
Fixed-schedule for Mooncake traces - Enhanced trace replay with fixed-schedule detection and support for non-fixed-schedule trace formats
Request rate with concurrency limits - Added ability to limit HTTP connections and control request concurrency for more realistic load testing

Advanced Metrics & Monitoring

Goodput metric - Added goodput metric to measure throughput of requests meeting user-defined SLOs, with comprehensive tutorial support
GPU Telemetry - Integrated GPU monitoring and telemetry collection for comprehensive performance analysis
Inter-chunk-latency metric - Added inter-chunk latency tracking using raw value lists for detailed streaming performance analysis
Total ISL/OSL metrics - Added total input/output sequence length metrics with improved CSV/JSON export support
Per-record metrics export - Enhanced profile export with per-record metrics in profile_export.jsonl
Mixed ISL/OSL distributions - Support for mixed input/output sequence length distributions in benchmarking

Video & Multimedia Support

Synthetic video support - Added support for video benchmarking and synthetic video generation

Enhanced Data Management

Inputs.json file for dataset traceability - Added dataset traceability through inputs.json file generation
Request traceability headers - Added X-Request-Id and X-Correlation-Id headers for improved request tracking

Bug Fixes

Core Functionality

ZMQ graceful termination - Fixed graceful termination of ZMQ context to prevent hanging processes
Worker count limits - Capped default maximum workers to 32 to prevent resource exhaustion
Race conditions in credit issuing - Fixed race conditions in credit issuing strategy for more stable performance
Startup error handling - Improved error handling during startup with clear error messages and proper process exit

Request Processing

Empty choices array handling - Fixed IndexError when OpenAI choices array is empty
Request metadata validation - Fixed bug with request metadata validation for failed requests

Export & Data Handling

CSV export logic - Fixed CSV export parsing to ensure correct data formatting
JSONL file writing - Resolved issues with writing to JSONL files
GenAI-Perf JSON format compatibility - Fixed JSON summary export to match GenAI-Perf format for better compatibility

Platform-Specific Fixes

macOS Textual UI Dashboard - Fixed compatibility issues with Textual UI Dashboard on macOS systems
Image test random seed - Set proper random seeds for image tests to fix sporadic test failures

Telemetry & Performance

Telemetry Manager shutdown timing - Fixed issue where Telemetry Manager shuts down before profile configuration finishes
Goodput release issues - Cherry-picked fix for goodput-related release problems
CPU usage warnings - Added warnings when worker CPU usage exceeds 85% to help identify performance bottlenecks

Documentation & Tutorials

New Documentation

Goodput tutorial - Complete tutorial on using the goodput metric for SLO validation
Advanced features tutorials - Tutorials covering advanced benchmarking features
Trace replay tutorial with real data - Updated trace replay tutorial with real Mooncake data examples
Feature comparison with GenAI-Perf - Added detailed feature comparison matrix between AIPerf and GenAI-Perf

Infrastructure & Development

Build & Dependencies

Flexible dependencies - Made package dependencies more flexible for better compatibility
PyProject.toml cleanup - Cleaned up and organized pyproject.toml configuration
License field compliance - Updated pyproject.toml license field for wheeltamer compliance
Dependency updates - Removed pandas dependency (now using numpy only) and updated numpy to 1.26.4

Refactoring & Performance

Core Components

Credit processor refactoring - Refactored credit processing system for better performance and maintainability
Console output enhancements - Added median values to console output for better statistical insight

Performance Optimizations

Performance test marking - Properly marked SSE tests as performance tests for better test organization

Known Issues

InvalidStateError - Logs show an InvalidStateError during benchmarking. This is handled gracefully and will not impact benchmark results.

Assets 2

18 Sep 20:52

saturley-hall

v0.1.1

c2776b1

Initial release of AIPerf v0.1.1

Release Highlights:

The initial release of AIPerf, the successor to GenAI-Perf, delivers extensive benchmarking capabilities.
AIPerf is written entirely in Python, offering an easy installation with a modular design for user extensibility.

Major Features of AIPerf

Comprehensive Benchmarking

Detailed Performance Metrics: Measures throughput, latency, and comprehensive token-level metrics for generative AI models
Flexible Data Sources: Supports both synthetic and dataset-driven input modes

Scalable Load Generation

Parallel Processing: Multiprocess support for local scaling
Configurable Load Patterns: High concurrency and request-rate modes with configurable patterns

Trace Replay

Production Workload Simulation: Reproduce real-world or synthetic workload traces for validation and stress testing
Industry Standard Formats: Supports Mooncake trace format and custom JSONL datasets when using the --fixed-schedule option

Flexible Model and Endpoint Support

Universal Compatibility: Works with OpenAI-compatible APIs, including vLLM, Dynamo, and other compatible services
OpenAI APIs: Chat completions, completions, and embeddings supported

Advanced Input and Output Configuration

Granular Token Control: Fine-grained control over input/output token counts and streaming
Extended Request Support: Pass extra inputs and custom payloads to endpoints

Rich Reporting and Export

Multiple Export Options: Console, CSV, and JSON output formats for results
Artifact Management: Artifact directory support for saving logs and metrics

Automation and Integration

CLI-First Design: CLI-first workflow for scripting and automation
Deployment Flexibility: Compatible with containerized and cloud environments

Security and Customization

Security and Authentication: Support for custom headers, authentication, and advanced API options
Deterministic Testing: Random seed and reproducibility controls

Console UI Options

Real-Time Monitoring: Real-time metrics dashboard with live progress tracking and worker status monitoring
Multiple UI Modes: Simple UI mode for streamlined monitoring and headless mode for automated environments

Key Improvements Over GenAI-Perf

AIPerf introduces several enhancements over GenAI-Perf:

Performance & Scaling

Distributed Architecture: Scalable service-oriented design built for horizontal scalability
Python Multiprocessing: Native multiprocessing implementation with automatic worker provisioning and lifecycle management, enabling true parallel load generation from a single node
Request-Rate with Max Concurrency: Combine request-rate control with concurrency limits to throttle requests or provide controlled ramp-up to prevent burst traffic

User Experience

Live Dashboard: Interactive terminal-based UI with real-time metrics visualization, progress tracking, and worker status monitoring
Multiple UI Modes: Dashboard mode for interactive use, simple mode for streamlined monitoring, and headless mode for automation

Observability & Control

API Error Analytics: Comprehensive tracking and categorization of request failures with detailed error summaries grouped by failure reason
Early Termination Support: Cancel benchmarks mid-run while preserving all completed results and metrics

Extensibility & Integration

Pure Python Architecture: Eliminates complex mixed-language dependencies for simpler installation, deployment, and customization
ShareGPT Integration: Automatic download, caching, and conversation processing of public datasets

Installation

pip install aiperf

Migration from GenAI-Perf

AIPerf is designed to be a drop-in replacement of GenAI-Perf for currently supported features. To migrate your existing GenAI-Perf commands, please refer to the Migrating from GenAI-Perf documentation.

Getting Started

Please refer to the Tutorials documentation for information on how to use AIPerf.

Additional Information

Known Issues

Output sequence length constraints (--output-tokens-mean) cannot be guaranteed unless you pass ignore_eos and/or min_tokens via --extra-inputs to an inference server that supports them.
A couple of options in the CLI help text use underscore instead of a hyphen inconsistently.
Very high concurrency settings (typically >15,000 concurrency) may lead to port exhaustion on some systems, causing connection failures during benchmarking. If encountered, consider adjusting system limits or reducing concurrency.
Startup errors caused by invalid configuration settings can cause AIPerf to hang indefinitely. If AIPerf appears to freeze during initialization, terminate the process and check configuration settings.
Mooncake trace format currently requires the --fixed-schedule option to be set.
Dashboard UI may cause corrupted ANSI sequences on macOS or certain terminal environments, making the terminal unusable. Run reset command to restore normal terminal functionality, or switch to --ui simple for a lightweight progress bar interface.

Assets 2

Releases: ai-dynamo/aiperf

AIPerf v0.7.0 Release

AIPerf — Release 0.7.0

Summary

Key highlights

Features and enhancements

API service and CLI

Datasets, traces, and storage

Endpoints and multimodal

Benchmarking and statistics

ZMQ and infrastructure

Security, correctness, and HTTP behavior

Performance and refactoring

Bug fixes and robustness

Documentation

Contributors

Uh oh!

AIPerf v0.6.0-post.1 Release

AIPerf - Release 0.6.0-post.1

Summary

What's Changed

Contributors

Uh oh!

AIPerf v0.6.0 Release

AIPerf - Release 0.6.0

Summary

Key Highlights

Plugin System Overhaul

Multimodal & Video Support

Multi-URL, Telemetry & Tooling

Stability & Security

Features & Enhancements

Plugin System

Endpoints & Backends

Telemetry & Metrics

Tooling & UX

Prefix & Synthesis

Network & Configuration

Refactoring & Infrastructure

Bug Fixes

Dataloader & Multiprocessing

Scheduling & Benchmarks

Security

Dashboard & Binding

Logging & Debug

Prefix & Synthesis

Metrics & Messaging

Tests & CI

Multi-URL

Dependencies & Build

Contributors

Uh oh!

AIPerf v0.5.0 Release

AIPerf 0.5.0 Release Notes

Overview

Highlights

New Features

Timing System Overhaul (#565)

Prefix Generator Integration (#534)

Mooncake Trace Data Loader Performance (#570)

Video and Vision Enhancements (#567)

Connection Reuse Strategies (#531)

Server Token Count Support (#496)

SSL Certificate Verification Bypass (#536)

Improved Error Messages (#563)

Bug Fixes

Uh oh!

AIPerf v0.4.0 Release

AIPerf 0.4.0 Release Notes

Major Features

AIPerf Plot Feature (#511)

Server-Side Prometheus Metrics Collection (#488)

Shared System Prompt and User Context Prompt (#506)

OpenAI Image Generation Endpoint Support (#468)

Vision Endpoint and Image Metrics (#450)

Rankings Enhancements

GPU Telemetry Improvements

Video Generation Enhancements

Auto-Detect Custom Dataset Type (#399)

New Metrics