Model name(s) to be benchmarked. Can be a comma-separated list or a single model name.
When multiple models are specified, this is how a specific model should be assigned to a prompt. round_robin: nth prompt in the list gets assigned to n-mod len(models). random: assignment is uniformly random.
Choices: [round_robin, random]
Default: round_robin
Set a custom endpoint that differs from the OpenAI defaults.
The endpoint type to send requests to on the server.
Choices: [chat, completions, cohere_rankings, embeddings, hf_tei_rankings, huggingface_generate, image_generation, nim_rankings, solido_rag, template]
Default: chat
An option to enable the use of the streaming API.
URL of the endpoint to target for benchmarking.
Default: localhost:8000
The timeout in floating-point seconds for each request to the endpoint.
Default: 600.0
The API key to use for the endpoint. If provided, it will be sent with every request as a header: Authorization: Bearer <api_key>.
The transport to use for the endpoint. If not provided, it will be auto-detected from the URL.This can also be used to force an alternative transport or implementation.
Choices: [http]
Use the legacy 'max_tokens' field instead of 'max_completion_tokens' in request payloads. The OpenAI API now prefers 'max_completion_tokens', but some older APIs or implementations may require 'max_tokens'.
Provide additional inputs to include with every request. Inputs should be in an 'input_name:value' format. Alternatively, a string representing a json formatted dict can be provided.
Default: []
Adds a custom header to the requests. Headers must be specified as 'Header:Value' pairs. Alternatively, a string representing a json formatted dict can be provided.
Default: []
The file or directory path that contains the dataset to use for profiling. This parameter is used in conjunction with the custom_dataset_type parameter to support different types of user provided datasets.
Specifies to run a fixed schedule of requests. This is normally inferred from the --input-file parameter, but can be set manually here.
Specifies to automatically offset the timestamps in the fixed schedule, such that the first timestamp is considered 0, and the rest are shifted accordingly. If disabled, the timestamps will be assumed to be relative to 0.
Specifies the offset in milliseconds to start the fixed schedule at. By default, the schedule starts at 0, but this option can be used to start at a reference point further in the schedule. This option cannot be used in conjunction with the --fixed-schedule-auto-offset. The schedule will include any requests at the start offset.
Specifies the offset in milliseconds to end the fixed schedule at. By default, the schedule ends at the last timestamp in the trace dataset, but this option can be used to only run a subset of the trace. The schedule will include any requests at the end offset.
The public dataset to use for the requests.
Choices: [sharegpt]
The type of custom dataset to use. This parameter is used in conjunction with the --input-file parameter. [choices: single_turn, multi_turn, random_pool, mooncake_trace].
The strategy to use for sampling the dataset. sequential: Iterate through the dataset sequentially, then wrap around to the beginning. random: Randomly select a conversation from the dataset. Will randomly sample with replacement. shuffle: Shuffle the dataset and iterate through it. Will randomly sample without replacement. Once the end of the dataset is reached, shuffle the dataset again and start over.
Choices: [sequential, random, shuffle]
The seed used to generate random values. Set to some value to make the synthetic data generation deterministic. It will use system default if not provided.
Specify service level objectives (SLOs) for goodput as space-separated 'KEY:VALUE' pairs, where KEY is a metric tag and VALUE is a number in the metric's display unit (falls back to its base unit if no display unit is defined). Examples: 'request_latency:250' (ms), 'inter_token_latency:10' (ms), output_token_throughput_per_user:600 (tokens/s). Only metrics applicable to the current endpoint/config are considered. For more context on the definition of goodput, refer to DistServe paper: https://arxiv.org/pdf/2401.09670 and the blog: https://hao-ai-lab.github.io/blogs/distserve.
The batch size of audio requests AIPerf should send. This is currently supported with the OpenAI chat endpoint type.
Default: 1
The mean length of the audio in seconds.
Default: 0.0
The standard deviation of the length of the audio in seconds.
Default: 0.0
The format of the audio files (wav or mp3).
Choices: [wav, mp3]
Default: wav
A list of audio bit depths to randomly select from in bits.
Default: [16]
A list of audio sample rates to randomly select from in kHz. Common sample rates are 16, 44.1, 48, 96, etc.
Default: [16.0]
The number of audio channels to use for the audio data generation.
Default: 1
The mean width of images when generating synthetic image data.
Default: 0.0
The standard deviation of width of images when generating synthetic image data.
Default: 0.0
The mean height of images when generating synthetic image data.
Default: 0.0
The standard deviation of height of images when generating synthetic image data.
Default: 0.0
The image batch size of the requests AIPerf should send.
Default: 1
The compression format of the images.
Choices: [png, jpeg, random]
Default: png
The video batch size of the requests AIPerf should send.
Default: 1
Seconds per clip (default: 5.0).
Default: 5.0
Frames per second (default/recommended for Cosmos: 4).
Default: 4
Video width in pixels.
Video height in pixels.
Synthetic generator type.
Choices: [moving_shapes, grid_clock]
Default: moving_shapes
The video format of the generated files.
Choices: [mp4, webm]
Default: webm
The video codec to use for encoding. Common options: libvpx-vp9 (CPU, BSD-licensed, default for WebM), libx264 (CPU, GPL-licensed, widely compatible), libx265 (CPU, GPL-licensed, smaller files), h264_nvenc (NVIDIA GPU), hevc_nvenc (NVIDIA GPU, smaller files). Any FFmpeg-supported codec can be used.
Default: libvpx-vp9
The batch size of text requests AIPerf should send. This is currently supported with the embeddings and rankings endpoint types.
Default: 1
The mean of number of tokens in the generated prompts when using synthetic data.
Default: 550
The standard deviation of number of tokens in the generated prompts when using synthetic data.
Default: 0.0
The block size of the prompt.
Default: 512
Sequence length distribution specification for varying ISL/OSL pairs.
The mean number of tokens in each output.
The standard deviation of the number of tokens in each output.
Default: 0
The total size of the prefix prompt pool to select prefixes from. If this value is not zero, these are prompts that are prepended to input prompts. This is useful for benchmarking models that use a K-V cache.
Default: 0
The number of tokens in each prefix prompt. This is only used if "num" is greater than zero. Note that due to the prefix and user prompts being concatenated, the number of tokens in the final prompt may be off by one.
Default: 0
Length of shared system prompt in tokens. This prompt is identical across all sessions and appears as a system message. Mutually exclusive with --prefix-prompt-length/--prefix-prompt-pool-size.
Length of per-session user context prompt in tokens. Each session gets a unique user context prompt. Requires --num-sessions to be specified. Mutually exclusive with --prefix-prompt-length/--prefix-prompt-pool-size.
Mean number of passages per rankings entry (per query)(default 1).
Default: 1
Stddev for passages per rankings entry (default 0).
Default: 0
Mean number of tokens in a passage entry for rankings (default 550).
Default: 550
Stddev for number of tokens in a passage entry for rankings (default 0).
Default: 0
Mean number of tokens in a query entry for rankings (default 550).
Default: 550
Stddev for number of tokens in a query entry for rankings (default 0).
Default: 0
The total number of unique conversations to generate. Each conversation represents a single request session between client and server. Supported on synthetic mode and the custom random_pool dataset. The number of conversations will be used to determine the number of entries in both the custom random_pool and synthetic datasets and will be reused until benchmarking is complete.
The total number of unique dataset entries to generate for the dataset. Each entry represents a single turn used in a request.
Default: 100
The mean number of turns within a conversation.
Default: 1
The standard deviation of the number of turns within a conversation.
Default: 0
The mean delay between turns within a conversation in milliseconds.
Default: 0.0
The standard deviation of the delay between turns within a conversation in milliseconds.
Default: 0.0
A ratio to scale multi-turn delays.
Default: 1.0
The directory to store all the (output) artifacts generated by AIPerf.
Default: artifacts
The prefix for the profile export file names. Will be suffixed with .csv, .json, .jsonl, and _raw.jsonl.If not provided, the default profile export file names will be used: profile_export_aiperf.csv, profile_export_aiperf.json, profile_export.jsonl, and profile_export_raw.jsonl.
The level of profile export files to create.
Choices: [summary, records, raw]
Default: records
The duration (in seconds) of an individual time slice to be used post-benchmark in time-slicing mode.
The HuggingFace tokenizer to use to interpret token metrics from prompts and responses. The value can be the name of a tokenizer or the filepath of the tokenizer. The default value is the model name.
The specific model version to use. It can be a branch name, tag name, or commit ID.
Default: main
Allows custom tokenizer to be downloaded and executed. This carries security risks and should only be used for repositories you trust. This is only necessary for custom tokenizers stored in HuggingFace Hub.
The duration in seconds for benchmarking.
The grace period in seconds to wait for responses after benchmark duration ends. Only applies when --benchmark-duration is set. Responses received within this period are included in metrics.
Default: 30.0
The concurrency value to benchmark.
Sets the request rate for the load generated by AIPerf. Unit: requests/second.
Sets the request rate mode for the load generated by AIPerf. Valid values: constant, poisson. constant: Generate requests at a fixed rate. poisson: Generate requests using a poisson distribution.
Default: poisson
The number of requests to use for measurement.
Default: 10
The number of warmup requests to send before benchmarking.
Default: 0
The percentage of requests to cancel.
Default: 0.0
The delay in seconds before cancelling requests. This is used when --request-cancellation-rate is greater than 0.
Default: 0.0
Enable GPU telemetry console display and optionally specify: (1) 'dashboard' for realtime dashboard mode, (2) custom DCGM exporter URLs (e.g., http://node1:9401/metrics), (3) custom metrics CSV file (e.g., custom_gpu_metrics.csv). Default endpoints localhost:9400 and localhost:9401 are always attempted. Example: --gpu-telemetry dashboard node1:9400 custom.csv.
Disable GPU telemetry collection entirely.
Server metrics collection (ENABLED BY DEFAULT). Automatically collects from inference endpoint base_url + /metrics. Optionally specify additional custom Prometheus-compatible endpoint URLs (e.g., http://node1:8081/metrics, http://node2:9090/metrics). Use --no-server-metrics to disable collection. Example: --server-metrics node1:8081 node2:9090/metrics for additional endpoints.
Disable server metrics collection entirely.
Specify which output formats to generate for server metrics. Options: json, csv, jsonl, and parquet. Default is json and csv (jsonl excluded due to large file size, parquet is opt-in only). Example: --server-metrics-formats json csv parquet.
Default: [ServerMetricsFormat.JSON, ServerMetricsFormat.CSV]
Host address for TCP connections.
Default: 127.0.0.1
Path for IPC sockets.
Maximum number of workers to create. If not specified, the number of workers will be determined by the formula min(concurrency, (num CPUs * 0.75) - 1), with a default max cap of 32. Any value provided will still be capped by the concurrency value (if specified), but not by the max cap.
Logging level.
Choices: [TRACE, DEBUG, INFO, NOTICE, WARNING, SUCCESS, ERROR, CRITICAL]
Default: INFO
Equivalent to --log-level DEBUG. Enables more verbose logging output, but lacks some raw message logging.
Equivalent to --log-level TRACE. Enables the most verbose logging output possible.
Number of services to spawn for processing records. The higher the request rate, the more services should be spawned in order to keep up with the incoming records. If not specified, the number of services will be automatically determined based on the worker count.
Type of UI to use.
Choices: [none, simple, dashboard]
Default: dashboard
Paths to profiling run directories. Defaults to ./artifacts if not specified.
Directory to save generated plots. Defaults to <first_path>/plots if not specified.
Plot theme to use: 'light' (white background) or 'dark' (dark background). Defaults to 'light'.
Default: light
Path to custom plot configuration YAML file. If not specified, auto-creates and uses ~/.aiperf/plot_config.yaml.
Show detailed error tracebacks in console (errors are always logged to ~/.aiperf/plot.log).
Launch interactive dashboard server instead of generating static PNGs.
Port for dashboard server (only used with --dashboard). Defaults to 8050.
Default: 8050