1
1
# Getting Started with LLMs via ExecuTorch
2
2
3
+ Welcome to LLM Manual! This manual is designed to provide a practical example to leverage
4
+ ExecuTorch in onboarding your own Large Language Models (LLMs). Our primary goal is to offer
5
+ a clear and concise guideline on how to integrate our system with your own LLMs.
6
+
7
+ Please note that this project is intended as a demonstration and not as a fully functional
8
+ example with optimal performance. As such, certain components such as the sampler, tokenizer,
9
+ and others are provided in their bare minimum versions solely for demonstration purposes.
10
+ Consequently, the results produced by the model may vary and might not always be optimal.
11
+
12
+ We encourage users to use this project as a starting point and adapt it to their specific needs,
13
+ which includes creating your own versions of the tokenizer, sampler, acceleration backends, and
14
+ other components. We hope this project serves as a useful guide in your journey with LLMs and ExecuTorch.
15
+
3
16
### Table Of Contents
4
17
5
18
@@ -141,13 +154,24 @@ model = GPT.from_pretrained('gpt2')
141
154
142
155
# Create example inputs. This is used in the export process to provide
143
156
# hints on the expected shape of the model input.
144
- example_inputs = (torch.randint(0 , 100 , (1 , 8 ), dtype = torch.long), )
157
+ example_inputs = (torch.randint(0 , 100 , (1 , model.config.block_size), dtype = torch.long), )
158
+
159
+ # Set up dynamic shape configuration. This allows the sizes of the input tensors
160
+ # to differ from the sizes of the tensors in `example_inputs` during runtime, as
161
+ # long as they adhere to the rules specified in the dynamic shape configuration.
162
+ # Here we set the range of 0th model input's 1st dimension as
163
+ # [0, model.config.block_size].
164
+ # See https://pytorch.org/executorch/main/concepts.html#dynamic-shapes
165
+ # for details about creating dynamic shapes.
166
+ dynamic_shape = (
167
+ {1 : torch.export.Dim(" token_dim" , max = model.config.block_size)},
168
+ )
145
169
146
170
# Trace the model, converting it to a portable intermediate representation.
147
171
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
148
172
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH ]), torch.no_grad():
149
- m = capture_pre_autograd_graph(model, example_inputs)
150
- traced_model = export(m, example_inputs)
173
+ m = capture_pre_autograd_graph(model, example_inputs, dynamic_shapes = dynamic_shape )
174
+ traced_model = export(m, example_inputs, dynamic_shapes = dynamic_shape )
151
175
152
176
# Convert the model into a runnable ExecuTorch program.
153
177
edge_config = EdgeCompileConfig(_check_ir_validity = False )
@@ -204,11 +228,15 @@ output token by token. Each generated token is passed as input for the next run.
204
228
```cpp
205
229
// main.cpp
206
230
231
+ // The value of the gpt2 `<|endoftext|>` token.
232
+ #define ENDOFTEXT_TOKEN 50256
233
+
207
234
std::string generate(
208
235
Module& llm_model,
209
236
std::string& prompt,
210
237
BasicTokenizer& tokenizer,
211
238
BasicSampler& sampler,
239
+ size_t max_input_length,
212
240
size_t max_output_length) {
213
241
214
242
// Convert the input text into a list of integers (tokens) that represents
@@ -237,14 +265,23 @@ std::string generate(
237
265
238
266
// Sample the next token from the logits.
239
267
int64_t next_token = sampler.sample(logits);
268
+
269
+ // Break if we reached the end of the text.
270
+ if (next_token == ENDOFTEXT_TOKEN) {
271
+ break;
272
+ }
273
+
274
+ // Add the next token to the output.
240
275
output_tokens.push_back(next_token);
241
276
242
277
std::cout << tokenizer.decode({ next_token });
243
278
std::cout.flush();
244
279
245
280
// Update next input.
246
- input_tokens.erase(input_tokens.begin());
247
281
input_tokens.push_back(next_token);
282
+ if (input_tokens.size() > max_input_length) {
283
+ input_tokens.erase(input_tokens.begin());
284
+ }
248
285
}
249
286
250
287
std::cout << std::endl;
@@ -278,7 +315,9 @@ penalties for repeated tokens, and biases to prioritize or de-prioritize specifi
278
315
279
316
int main () {
280
317
// Set up the prompt. This provides the seed text for the model to elaborate.
281
- std::string prompt = "Once upon a time, there was a";
318
+ std::cout << "Enter model prompt: ";
319
+ std::string prompt;
320
+ std::getline (std::cin, prompt);
282
321
283
322
// The tokenizer is used to convert between tokens (used by the model) and
284
323
// human-readable strings.
@@ -290,19 +329,19 @@ int main() {
290
329
// Load the exported nanoGPT program, which was generated via the previous steps.
291
330
Module model("nanogpt.pte", torch::executor::Module::MlockConfig::UseMlockIgnoreErrors);
292
331
332
+ const auto max_input_tokens = 1024;
293
333
const auto max_output_tokens = 30;
294
334
std::cout << prompt;
295
- generate (model, prompt, tokenizer, sampler, max_output_tokens);
335
+ generate (model, prompt, tokenizer, sampler, max_input_tokens, max_output_tokens);
296
336
}
297
337
```
298
338
299
339
Finally, download the following files into the same directory as main.h:
300
340
301
- TODO: This is a placeholder.
302
341
```
303
- curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt/managed_tensor .h
304
- curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt /basic_tokenizer.h
305
- curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt/basic_sampler .h
342
+ curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/basic_sampler .h
343
+ curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual /basic_tokenizer.h
344
+ curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/managed_tensor .h
306
345
```
307
346
308
347
To learn more, see [ Running an ExecuTorch Model in C++] ( https://pytorch.org/executorch/main/running-a-model-cpp-tutorial.html )
@@ -363,10 +402,20 @@ cmake --build cmake-out -j10
363
402
./cmake-out/nanogpt_runner
364
403
```
365
404
366
- You should see something like the following:
405
+ You should see the message:
406
+
407
+ ```
408
+ Enter model prompt:
409
+ ```
410
+
411
+ Type some seed text for the model and press enter. Here we use "Hello world!" as
412
+ an example prompt:
367
413
368
414
```
369
- Once upon a time, there was a man who was a member of the military...
415
+ Enter model prompt: Hello world!
416
+ Hello world!
417
+
418
+ I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in
370
419
```
371
420
372
421
At this point, it is likely to run very slowly. This is because ExecuTorch hasn't been told to optimize for
@@ -423,14 +472,25 @@ model = GPT.from_pretrained('gpt2')
423
472
# Create example inputs. This is used in the export process to provide
424
473
# hints on the expected shape of the model input.
425
474
example_inputs = (
426
- torch.randint(0 , 100 , (1 , 8 ), dtype = torch.long),
475
+ torch.randint(0 , 100 , (1 , model.config.block_size - 1 ), dtype = torch.long),
427
476
)
428
477
478
+ # Set up dynamic shape configuration. This allows the sizes of the input tensors
479
+ # to differ from the sizes of the tensors in `example_inputs` during runtime, as
480
+ # long as they adhere to the rules specified in the dynamic shape configuration.
481
+ # Here we set the range of 0th model input's 1st dimension as
482
+ # [0, model.config.block_size].
483
+ # See https://pytorch.org/executorch/main/concepts.html#dynamic-shapes
484
+ # for details about creating dynamic shapes.
485
+ dynamic_shape = (
486
+ {1 : torch.export.Dim(" token_dim" , max = model.config.block_size - 1 )},
487
+ )
488
+
429
489
# Trace the model, converting it to a portable intermediate representation.
430
490
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
431
491
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH ]), torch.no_grad():
432
- m = capture_pre_autograd_graph(model, example_inputs)
433
- traced_model = export(m, example_inputs)
492
+ m = capture_pre_autograd_graph(model, example_inputs, dynamic_shapes = dynamic_shape )
493
+ traced_model = export(m, example_inputs, dynamic_shapes = dynamic_shape )
434
494
435
495
# Convert the model into a runnable ExecuTorch program.
436
496
# To be further lowered to Xnnpack backend, `traced_model` needs xnnpack-specific edge compile config
@@ -512,12 +572,24 @@ cmake --build cmake-out -j10
512
572
./cmake-out/nanogpt_runner
513
573
```
514
574
515
- You should see something like the following:
575
+
576
+ You should see the message:
577
+
578
+ ```
579
+ Enter model prompt:
580
+ ```
581
+
582
+ Type some seed text for the model and press enter. Here we use "Hello world!" as
583
+ an example prompt:
516
584
517
585
```
518
- Once upon a time, there was a man who was a member of the military...
586
+ Enter model prompt: Hello world!
587
+ Hello world!
588
+
589
+ I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in
519
590
```
520
591
592
+ The delegated model should be noticeably faster compared to the non-delegated model.
521
593
522
594
For more information regarding backend delegateion, see the ExecuTorch guides
523
595
for the
0 commit comments