Skip to content

Commit 58e1942

Browse files
committed
docs: add continuous batching page
1 parent 090a894 commit 58e1942

File tree

1 file changed

+96
-0
lines changed

1 file changed

+96
-0
lines changed
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
---
18+
# Brainstorm
19+
20+
## Persona
21+
22+
Model developer that wants to evaluate his model implementation on a dataset or a model "trainer" that wants to run inference for his GRPO policy.
23+
Pre reqs to understand the docs:
24+
- knows what KV Cache is
25+
- familiarity with transformers and infernece
26+
27+
## what we want do include in the doc
28+
29+
- CB usage examples
30+
- CB API reference
31+
- light refresher on what is CB + links to blog post
32+
33+
- installation / setup instructions
34+
35+
- open telemetry support
36+
37+
- subsection in Transformers > Inference
38+
39+
- supported & unsupported features
40+
41+
- performance considerations
42+
- note on benchmarks (CI + space)
43+
- cuda graphs
44+
- compile
45+
- attn impl
46+
47+
- explicit intended use cases, the why of CB in transformers
48+
49+
- integration with serving
50+
---
51+
52+
53+
# Continuous Batching
54+
55+
Continuous Batching (CB) is an advanced technique to optimize the inference of transformer models by dynamically grouping multiple requests into batches. This approach maximizes GPU utilization and throughput, specifically for workloads with many variable-length inputs.
56+
57+
We are particularly interested in having Continuous Batching in transformers for the following use cases:
58+
- Evaluation of models on large datasets with variable-length inputs
59+
- Generating outputs for multiple sequences for GRPO policies
60+
61+
CB is what makes inference engines like vLLM or SGLang efficient. That being said, transformers does not aim to be a production-ready inference engine, but a complete framework for model development. For this reason, CB is available in `transformers serve`.
62+
63+
If you are not familiar with some of the core concepts CB is built upon, we invite you to read the associated blog post: [Continuous Batching: Efficient Inference for Large Language Models](https://huggingface.co/blog/continuous-batching). _broken link for now_
64+
65+
## Installation
66+
67+
Nothing to do, it comes built-in with `transformers`! :nice:
68+
69+
## API Reference
70+
71+
## Usage Examples
72+
73+
The main way to use CB in transformers is via the `generate_batch` method.
74+
75+
Unlike `generate`, CB takes already tokenized inputs, known as input IDs. Each sequence of input IDs is represented as a list of integers, in python: `list[int]`. Since
76+
77+
For a more detailed example, please refer to: [examples/continuous_batching](./path/to/example)
78+
79+
### `generate_batch` example
80+
81+
### `ContinuousBatchingManager` example
82+
83+
84+
## Supported & Unsupported Features
85+
86+
### Supported Features
87+
88+
89+
### Unsupported Features
90+
91+
## Performance Considerations
92+
93+
94+
## Integration with Serving
95+
96+

0 commit comments

Comments
 (0)