sidebar-title	Profile Hugging Face TGI Models with AIPerf

Profile Hugging Face TGI Models with AIPerf

AIPerf can benchmark Large Language Models (LLMs) served through the Hugging Face Text Generation Inference (TGI) generate API. TGI exposes two standard HTTP endpoints for text generation:

Endpoint	Description	AIPerf Flag
`/generate`	Returns the full text completion in one response (non-streaming).	(default)
`/generate_stream`	Streams generated tokens as they are produced (SSE).	`--streaming`

Start a Hugging Face TGI Server

To launch a Hugging Face TGI server, use the official ghcr.io image:

docker run --gpus all --rm -it \
  -p 8080:80 \
  -e MODEL_ID=TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  ghcr.io/huggingface/text-generation-inference:latest

# Verify the server is running
curl -s http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"inputs":"Hello world"}' | jq

Profile with AIPerf

You can benchmark TGI models in either non-streaming or streaming, and with either synthetic inputs or a custom input file.

Non-Streaming (`/generate`)

Profile with synthetic inputs

aiperf profile \
    -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --endpoint-type huggingface_generate \
    --url localhost:8080 \
    --request-count 10

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     Using Hugging Face TGI /generate endpoint (non-streaming)
INFO     AIPerf System is PROFILING

Profiling: 10/10 |████████████████████████| 100% [00:08<00:00]

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃                      Metric ┃     avg ┃    min ┃     max ┃     p99 ┃     p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│        Request Latency (ms) │ 1234.56 │ 987.34 │ 1567.89 │ 1567.89 │ 1198.45 │
│ Output Token Count (tokens) │  256.00 │ 200.00 │  300.00 │  300.00 │  254.00 │
│  Request Throughput (req/s) │    2.34 │      - │       - │       - │       - │
└─────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘

JSON Export: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/profile_export_aiperf.json

Profile with custom input file

You can also provide your own text prompts using the --input-file option. The file should be in JSONL format and contain text entries.

cat > inputs.jsonl <<'EOF'
{"text": "Hello TinyLlama!"}
{"text": "Tell me a joke."}
EOF

Then run:

aiperf profile \
    -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --endpoint-type huggingface_generate \
    --url localhost:8080 \
    --input-file ./inputs.jsonl \
    --custom-dataset-type single_turn \
    --request-count 10

Streaming (`/generate_stream`)

When the --streaming flag is enabled, AIPerf automatically sends requests to the /generate_stream endpoint of the TGI server.

Profile with synthetic inputs

aiperf profile \
    -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --endpoint-type huggingface_generate \
    --url localhost:8080 \
    --streaming \
    --request-count 10

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     Using Hugging Face TGI /generate_stream endpoint (streaming)
INFO     AIPerf System is PROFILING

Profiling: 10/10 |████████████████████████| 100% [00:09<00:00]

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃                      Metric ┃     avg ┃    min ┃     max ┃     p99 ┃     p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│        Request Latency (ms) │ 1189.45 │ 945.67 │ 1498.34 │ 1498.34 │ 1156.78 │
│    Time to First Token (ms) │  234.56 │ 189.34 │  298.45 │  298.45 │  228.90 │
│    Inter Token Latency (ms) │   14.23 │  11.45 │   18.90 │   18.90 │   13.89 │
│ Output Token Count (tokens) │  256.00 │ 200.00 │  300.00 │  300.00 │  254.00 │
│  Request Throughput (req/s) │    2.56 │      - │       - │       - │       - │
└─────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘

JSON Export: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/profile_export_aiperf.json

Profile with custom input file

Create your own prompt file in JSONL format:

cat > inputs.jsonl <<'EOF'
{"text": "Explain quantum computing in simple terms."}
{"text": "Write a haiku about rain."}
{"text": "Summarize the causes of the French Revolution."}
EOF

Then run:

aiperf profile \
    -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --endpoint-type huggingface_generate \
    --url localhost:8080 \
    --input-file ./inputs.jsonl \
    --custom-dataset-type single_turn \
    --streaming \
    --request-count 10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile Hugging Face TGI Models with AIPerf

Start a Hugging Face TGI Server

Profile with AIPerf

Non-Streaming (`/generate`)

Profile with synthetic inputs

Profile with custom input file

Streaming (`/generate_stream`)

Profile with synthetic inputs

Profile with custom input file

FilesExpand file tree

huggingface-tgi.md

Latest commit

History

huggingface-tgi.md

File metadata and controls

Profile Hugging Face TGI Models with AIPerf

Start a Hugging Face TGI Server

Profile with AIPerf

Non-Streaming (/generate)

Profile with synthetic inputs

Profile with custom input file

Streaming (/generate_stream)

Profile with synthetic inputs

Profile with custom input file

Non-Streaming (`/generate`)

Streaming (`/generate_stream`)