Skip to content

Latest commit

 

History

History
165 lines (129 loc) · 6.47 KB

File metadata and controls

165 lines (129 loc) · 6.47 KB
sidebar-title Profile Hugging Face TGI Models with AIPerf

Profile Hugging Face TGI Models with AIPerf

AIPerf can benchmark Large Language Models (LLMs) served through the Hugging Face Text Generation Inference (TGI) generate API. TGI exposes two standard HTTP endpoints for text generation:

Endpoint Description AIPerf Flag
/generate Returns the full text completion in one response (non-streaming). (default)
/generate_stream Streams generated tokens as they are produced (SSE). --streaming

Start a Hugging Face TGI Server

To launch a Hugging Face TGI server, use the official ghcr.io image:

docker run --gpus all --rm -it \
  -p 8080:80 \
  -e MODEL_ID=TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  ghcr.io/huggingface/text-generation-inference:latest
# Verify the server is running
curl -s http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"inputs":"Hello world"}' | jq

Profile with AIPerf

You can benchmark TGI models in either non-streaming or streaming, and with either synthetic inputs or a custom input file.

Non-Streaming (/generate)

Profile with synthetic inputs

aiperf profile \
    -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --endpoint-type huggingface_generate \
    --url localhost:8080 \
    --request-count 10

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     Using Hugging Face TGI /generate endpoint (non-streaming)
INFO     AIPerf System is PROFILING

Profiling: 10/10 |████████████████████████| 100% [00:08<00:00]

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃                      Metric ┃     avg ┃    min ┃     max ┃     p99 ┃     p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│        Request Latency (ms) │ 1234.56 │ 987.34 │ 1567.89 │ 1567.89 │ 1198.45 │
│ Output Token Count (tokens) │  256.00 │ 200.00 │  300.00 │  300.00 │  254.00 │
│  Request Throughput (req/s) │    2.34 │      - │       - │       - │       - │
└─────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘

JSON Export: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/profile_export_aiperf.json

Profile with custom input file

You can also provide your own text prompts using the --input-file option. The file should be in JSONL format and contain text entries.

cat > inputs.jsonl <<'EOF'
{"text": "Hello TinyLlama!"}
{"text": "Tell me a joke."}
EOF

Then run:

aiperf profile \
    -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --endpoint-type huggingface_generate \
    --url localhost:8080 \
    --input-file ./inputs.jsonl \
    --custom-dataset-type single_turn \
    --request-count 10

Streaming (/generate_stream)

When the --streaming flag is enabled, AIPerf automatically sends requests to the /generate_stream endpoint of the TGI server.

Profile with synthetic inputs

aiperf profile \
    -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --endpoint-type huggingface_generate \
    --url localhost:8080 \
    --streaming \
    --request-count 10

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     Using Hugging Face TGI /generate_stream endpoint (streaming)
INFO     AIPerf System is PROFILING

Profiling: 10/10 |████████████████████████| 100% [00:09<00:00]

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃                      Metric ┃     avg ┃    min ┃     max ┃     p99 ┃     p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│        Request Latency (ms) │ 1189.45 │ 945.67 │ 1498.34 │ 1498.34 │ 1156.78 │
│    Time to First Token (ms) │  234.56 │ 189.34 │  298.45 │  298.45 │  228.90 │
│    Inter Token Latency (ms) │   14.23 │  11.45 │   18.90 │   18.90 │   13.89 │
│ Output Token Count (tokens) │  256.00 │ 200.00 │  300.00 │  300.00 │  254.00 │
│  Request Throughput (req/s) │    2.56 │      - │       - │       - │       - │
└─────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘

JSON Export: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/profile_export_aiperf.json

Profile with custom input file

Create your own prompt file in JSONL format:

cat > inputs.jsonl <<'EOF'
{"text": "Explain quantum computing in simple terms."}
{"text": "Write a haiku about rain."}
{"text": "Summarize the causes of the French Revolution."}
EOF

Then run:

aiperf profile \
    -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --endpoint-type huggingface_generate \
    --url localhost:8080 \
    --input-file ./inputs.jsonl \
    --custom-dataset-type single_turn \
    --streaming \
    --request-count 10