Skip to content

Latest commit

 

History

History
78 lines (51 loc) · 3.89 KB

File metadata and controls

78 lines (51 loc) · 3.89 KB

Goodput in Inference Perf

Goodput is a measure of successful requests that meet specific Quality of Service (QoS) constraints, per unit of time. Unlike raw throughput, which counts all completed requests, goodput only counts requests that are "good" according to your criteria.

Why is Goodput Important?

In production serving of LLMs, it's not enough to just process requests quickly. Requests must also be processed with acceptable latency to ensure a good user experience. For example, if a user has to wait 10 seconds for the first token, or if the generation speed is too slow, the service may be unusable even if the system is completing many requests per second.

Goodput helps you understand the true capacity of your system while maintaining SLA/SLO targets.

How We Measure It

In inference-perf, a request is considered "good" if it completes successfully AND meets all specified constraints.

  • Request Goodput: Number of good requests / Total benchmark time.
  • Token Goodput: Total tokens (input + output) generated by good requests / Total benchmark time.
  • Goodput %: Percentage of total successful requests that met the constraints.

Specifying Goodput Constraints

You can specify goodput constraints in two ways:

1. Globally via Configuration File

You can set global constraints in your configuration file under the report.goodput.constraints section. Supported constraints include:

  • ttft: Time to first token (in seconds).
  • tpot: Time per output token (in seconds).
  • itl: Inter-token latency (in seconds).
  • ntpot: Normalized time per output token (in seconds).
  • request_latency: End-to-end request latency (in seconds).

Example configuration:

report:
  goodput:
    constraints:
      ttft: 0.2
      tpot: 0.02

The above configuration means that a request is considered "good" if its time to first token is less than or equal to 0.2 seconds AND its time per output token is less than or equal to 0.02 seconds. Based on this, inference-perf will report goodput metrics in the CLI summary and in the generated JSON reports. You can use them to inform of things like what's the optimal operating point to meet your SLOs, or to understand how different model server configurations perform under high load.

2. Per-Request via Headers

You can also specify SLOs on a per-request basis using HTTP headers if your model server supports them or if you want to simulate dynamic SLOs. To use this, you need to configure the header names in the api section:

api:
  slo_unit: "ms" # Options: s, ms, us. Defaults to ms.
  slo_ttft_header: "x-slo-ttft-ms"
  slo_tpot_header: "x-slo-tpot-ms"
  headers:
    x-slo-ttft-ms: "200"
    x-slo-tpot-ms: "20"

If these headers are present in the request (or set globally in api.headers), inference-perf will extract them and use them as the constraints for that specific request.

How They Interact

  • If a request has a per-request SLO specified via headers, it will override the corresponding global constraint from the goodput config for that specific request.
  • If no per-request SLO is specified for a metric, it falls back to the global constraint defined in report.goodput.constraints.
  • If a constraint is defined in neither place, it is not checked.

A request is considered "good" if it completes successfully AND meets ALL applicable constraints (either per-request overrides or global defaults).

Analyzing Goodput

When you run a benchmark with goodput constraints configured, inference-perf will report goodput metrics in the CLI summary and in the generated JSON reports.

You can also analyze goodput across different load levels using charts generated by the --analyze command. If you run a multi-stage benchmark with varying request rates, the analyze tool can plot QPS vs Goodput to help you find the optimal operating point for your system.

inference-perf --analyze <path-to-reports-dir>