Skip to content

Commit 66a350b

Browse files
Gasoonjiafacebook-github-bot
authored andcommitted
add dynamic export into llm manual (#3202)
Summary: Pull Request resolved: #3202 This diff adds dynamic export into llm manual, including code and related comments. Also update other documentations for better understanding. Reviewed By: dbort Differential Revision: D56365041 fbshipit-source-id: 5ce4c15206a2923c4d54811cefca03f72869c719
1 parent 2dac5f3 commit 66a350b

File tree

1 file changed

+89
-17
lines changed

1 file changed

+89
-17
lines changed

docs/source/llm/getting-started.md

Lines changed: 89 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,18 @@
11
# Getting Started with LLMs via ExecuTorch
22

3+
Welcome to LLM Manual! This manual is designed to provide a practical example to leverage
4+
ExecuTorch in onboarding your own Large Language Models (LLMs). Our primary goal is to offer
5+
a clear and concise guideline on how to integrate our system with your own LLMs.
6+
7+
Please note that this project is intended as a demonstration and not as a fully functional
8+
example with optimal performance. As such, certain components such as the sampler, tokenizer,
9+
and others are provided in their bare minimum versions solely for demonstration purposes.
10+
Consequently, the results produced by the model may vary and might not always be optimal.
11+
12+
We encourage users to use this project as a starting point and adapt it to their specific needs,
13+
which includes creating your own versions of the tokenizer, sampler, acceleration backends, and
14+
other components. We hope this project serves as a useful guide in your journey with LLMs and ExecuTorch.
15+
316
### Table Of Contents
417

518

@@ -141,13 +154,24 @@ model = GPT.from_pretrained('gpt2')
141154

142155
# Create example inputs. This is used in the export process to provide
143156
# hints on the expected shape of the model input.
144-
example_inputs = (torch.randint(0, 100, (1, 8), dtype=torch.long), )
157+
example_inputs = (torch.randint(0, 100, (1, model.config.block_size), dtype=torch.long), )
158+
159+
# Set up dynamic shape configuration. This allows the sizes of the input tensors
160+
# to differ from the sizes of the tensors in `example_inputs` during runtime, as
161+
# long as they adhere to the rules specified in the dynamic shape configuration.
162+
# Here we set the range of 0th model input's 1st dimension as
163+
# [0, model.config.block_size].
164+
# See https://pytorch.org/executorch/main/concepts.html#dynamic-shapes
165+
# for details about creating dynamic shapes.
166+
dynamic_shape = (
167+
{1: torch.export.Dim("token_dim", max=model.config.block_size)},
168+
)
145169

146170
# Trace the model, converting it to a portable intermediate representation.
147171
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
148172
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
149-
m = capture_pre_autograd_graph(model, example_inputs)
150-
traced_model = export(m, example_inputs)
173+
m = capture_pre_autograd_graph(model, example_inputs, dynamic_shapes=dynamic_shape)
174+
traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)
151175

152176
# Convert the model into a runnable ExecuTorch program.
153177
edge_config = EdgeCompileConfig(_check_ir_validity=False)
@@ -204,11 +228,15 @@ output token by token. Each generated token is passed as input for the next run.
204228
```cpp
205229
// main.cpp
206230
231+
// The value of the gpt2 `<|endoftext|>` token.
232+
#define ENDOFTEXT_TOKEN 50256
233+
207234
std::string generate(
208235
Module& llm_model,
209236
std::string& prompt,
210237
BasicTokenizer& tokenizer,
211238
BasicSampler& sampler,
239+
size_t max_input_length,
212240
size_t max_output_length) {
213241
214242
// Convert the input text into a list of integers (tokens) that represents
@@ -237,14 +265,23 @@ std::string generate(
237265
238266
// Sample the next token from the logits.
239267
int64_t next_token = sampler.sample(logits);
268+
269+
// Break if we reached the end of the text.
270+
if (next_token == ENDOFTEXT_TOKEN) {
271+
break;
272+
}
273+
274+
// Add the next token to the output.
240275
output_tokens.push_back(next_token);
241276
242277
std::cout << tokenizer.decode({ next_token });
243278
std::cout.flush();
244279
245280
// Update next input.
246-
input_tokens.erase(input_tokens.begin());
247281
input_tokens.push_back(next_token);
282+
if (input_tokens.size() > max_input_length) {
283+
input_tokens.erase(input_tokens.begin());
284+
}
248285
}
249286
250287
std::cout << std::endl;
@@ -278,7 +315,9 @@ penalties for repeated tokens, and biases to prioritize or de-prioritize specifi
278315

279316
int main() {
280317
// Set up the prompt. This provides the seed text for the model to elaborate.
281-
std::string prompt = "Once upon a time, there was a";
318+
std::cout << "Enter model prompt: ";
319+
std::string prompt;
320+
std::getline(std::cin, prompt);
282321

283322
// The tokenizer is used to convert between tokens (used by the model) and
284323
// human-readable strings.
@@ -290,19 +329,19 @@ int main() {
290329
// Load the exported nanoGPT program, which was generated via the previous steps.
291330
Module model("nanogpt.pte", torch::executor::Module::MlockConfig::UseMlockIgnoreErrors);
292331

332+
const auto max_input_tokens = 1024;
293333
const auto max_output_tokens = 30;
294334
std::cout << prompt;
295-
generate(model, prompt, tokenizer, sampler, max_output_tokens);
335+
generate(model, prompt, tokenizer, sampler, max_input_tokens, max_output_tokens);
296336
}
297337
```
298338

299339
Finally, download the following files into the same directory as main.h:
300340

301-
TODO: This is a placeholder.
302341
```
303-
curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt/managed_tensor.h
304-
curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt/basic_tokenizer.h
305-
curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt/basic_sampler.h
342+
curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/basic_sampler.h
343+
curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/basic_tokenizer.h
344+
curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/managed_tensor.h
306345
```
307346

308347
To learn more, see [Running an ExecuTorch Model in C++](https://pytorch.org/executorch/main/running-a-model-cpp-tutorial.html)
@@ -363,10 +402,20 @@ cmake --build cmake-out -j10
363402
./cmake-out/nanogpt_runner
364403
```
365404

366-
You should see something like the following:
405+
You should see the message:
406+
407+
```
408+
Enter model prompt:
409+
```
410+
411+
Type some seed text for the model and press enter. Here we use "Hello world!" as
412+
an example prompt:
367413

368414
```
369-
Once upon a time, there was a man who was a member of the military...
415+
Enter model prompt: Hello world!
416+
Hello world!
417+
418+
I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in
370419
```
371420

372421
At this point, it is likely to run very slowly. This is because ExecuTorch hasn't been told to optimize for
@@ -423,14 +472,25 @@ model = GPT.from_pretrained('gpt2')
423472
# Create example inputs. This is used in the export process to provide
424473
# hints on the expected shape of the model input.
425474
example_inputs = (
426-
torch.randint(0, 100, (1, 8), dtype=torch.long),
475+
torch.randint(0, 100, (1, model.config.block_size - 1), dtype=torch.long),
427476
)
428477

478+
# Set up dynamic shape configuration. This allows the sizes of the input tensors
479+
# to differ from the sizes of the tensors in `example_inputs` during runtime, as
480+
# long as they adhere to the rules specified in the dynamic shape configuration.
481+
# Here we set the range of 0th model input's 1st dimension as
482+
# [0, model.config.block_size].
483+
# See https://pytorch.org/executorch/main/concepts.html#dynamic-shapes
484+
# for details about creating dynamic shapes.
485+
dynamic_shape = (
486+
{1: torch.export.Dim("token_dim", max=model.config.block_size - 1)},
487+
)
488+
429489
# Trace the model, converting it to a portable intermediate representation.
430490
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
431491
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
432-
m = capture_pre_autograd_graph(model, example_inputs)
433-
traced_model = export(m, example_inputs)
492+
m = capture_pre_autograd_graph(model, example_inputs, dynamic_shapes=dynamic_shape)
493+
traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)
434494

435495
# Convert the model into a runnable ExecuTorch program.
436496
# To be further lowered to Xnnpack backend, `traced_model` needs xnnpack-specific edge compile config
@@ -512,12 +572,24 @@ cmake --build cmake-out -j10
512572
./cmake-out/nanogpt_runner
513573
```
514574

515-
You should see something like the following:
575+
576+
You should see the message:
577+
578+
```
579+
Enter model prompt:
580+
```
581+
582+
Type some seed text for the model and press enter. Here we use "Hello world!" as
583+
an example prompt:
516584

517585
```
518-
Once upon a time, there was a man who was a member of the military...
586+
Enter model prompt: Hello world!
587+
Hello world!
588+
589+
I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in
519590
```
520591

592+
The delegated model should be noticeably faster compared to the non-delegated model.
521593

522594
For more information regarding backend delegateion, see the ExecuTorch guides
523595
for the

0 commit comments

Comments
 (0)