11# Getting Started with LLMs via ExecuTorch
22
3+ Welcome to LLM Manual! This manual is designed to provide a practical example to leverage
4+ ExecuTorch in onboarding your own Large Language Models (LLMs). Our primary goal is to offer
5+ a clear and concise guideline on how to integrate our system with your own LLMs.
6+
7+ Please note that this project is intended as a demonstration and not as a fully functional
8+ example with optimal performance. As such, certain components such as the sampler, tokenizer,
9+ and others are provided in their bare minimum versions solely for demonstration purposes.
10+ Consequently, the results produced by the model may vary and might not always be optimal.
11+
12+ We encourage users to use this project as a starting point and adapt it to their specific needs,
13+ which includes creating your own versions of the tokenizer, sampler, acceleration backends, and
14+ other components. We hope this project serves as a useful guide in your journey with LLMs and ExecuTorch.
15+
316### Table Of Contents
417
518
@@ -141,13 +154,23 @@ model = GPT.from_pretrained('gpt2')
141154
142155# Create example inputs. This is used in the export process to provide
143156# hints on the expected shape of the model input.
144- example_inputs = (torch.randint(0 , 100 , (1 , 8 ), dtype = torch.long), )
157+ example_inputs = (torch.randint(0 , 100 , (1 , model.config.block_size - 1 ), dtype = torch.long), )
158+
159+ # Set up dynamic shape configuration, which makes the input tensors'
160+ # sizes during the runtime does not need to match the size of tensors
161+ # in `example_inputs`, but follow the rule dynamic shape configuration shares.
162+ # Here we set the range of 0th model input's 1st dimension as [0, model.config.block_size - 1]
163+ # Detials of dynamic shape and how to create it customized can follow
164+ # [ExecuTorch Concept](https://pytorch.org/executorch/0.2/concepts.html#dynamic-shapes)
165+ dynamic_shape = (
166+ {1 : torch.export.Dim(" token_dim" , max = model.config.block_size - 1 )},
167+ )
145168
146169# Trace the model, converting it to a portable intermediate representation.
147170# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
148171with torch.nn.attention.sdpa_kernel([SDPBackend.MATH ]), torch.no_grad():
149- m = capture_pre_autograd_graph(model, example_inputs)
150- traced_model = export(m, example_inputs)
172+ m = capture_pre_autograd_graph(model, example_inputs, dynamic_shapes = dynamic_shape )
173+ traced_model = export(m, example_inputs, dynamic_shapes = dynamic_shape )
151174
152175# Convert the model into a runnable ExecuTorch program.
153176edge_config = EdgeCompileConfig(_check_ir_validity = False )
@@ -204,11 +227,14 @@ output token by token. Each generated token is passed as input for the next run.
204227```cpp
205228// main.cpp
206229
230+ #define ENDOFTEXT 50256
231+
207232std::string generate(
208233 Module& llm_model,
209234 std::string& prompt,
210235 BasicTokenizer& tokenizer,
211236 BasicSampler& sampler,
237+ size_t max_input_length,
212238 size_t max_output_length) {
213239
214240 // Convert the input text into a list of integers (tokens) that represents
@@ -237,14 +263,23 @@ std::string generate(
237263
238264 // Sample the next token from the logits.
239265 int64_t next_token = sampler.sample(logits);
266+
267+ // Break if we reached the end of the text.
268+ if (next_token == ENDOFTEXT) {
269+ break;
270+ }
271+
272+ // Add the next token to the output.
240273 output_tokens.push_back(next_token);
241274
242275 std::cout << tokenizer.decode({ next_token });
243276 std::cout.flush();
244277
245278 // Update next input.
246- input_tokens.erase(input_tokens.begin());
247279 input_tokens.push_back(next_token);
280+ if (input_tokens.size() > max_input_length) {
281+ input_tokens.erase(input_tokens.begin());
282+ }
248283 }
249284
250285 std::cout << std::endl;
@@ -278,7 +313,9 @@ penalties for repeated tokens, and biases to prioritize or de-prioritize specifi
278313
279314int main () {
280315 // Set up the prompt. This provides the seed text for the model to elaborate.
281- std::string prompt = "Once upon a time, there was a";
316+ std::cout << "Prompt: ";
317+ std::string prompt;
318+ std::getline (std::cin, prompt);
282319
283320 // The tokenizer is used to convert between tokens (used by the model) and
284321 // human-readable strings.
@@ -290,19 +327,19 @@ int main() {
290327 // Load the exported nanoGPT program, which was generated via the previous steps.
291328 Module model("nanogpt.pte", torch::executor::Module::MlockConfig::UseMlockIgnoreErrors);
292329
330+ const auto max_input_tokens = 1024;
293331 const auto max_output_tokens = 30;
294332 std::cout << prompt;
295- generate (model, prompt, tokenizer, sampler, max_output_tokens);
333+ generate (model, prompt, tokenizer, sampler, max_input_tokens, max_output_tokens);
296334}
297335```
298336
299337Finally, download the following files into the same directory as main.h:
300338
301- TODO: This is a placeholder.
302339```
303- curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt/managed_tensor .h
304- curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt /basic_tokenizer.h
305- curl -O https://raw.githubusercontent.com/GregoryComer/et-tutorials/quantization/nanogpt/basic_sampler .h
340+ curl -O https://raw.githubusercontent.com/pytorch/executorch/release/0.2/examples/llm_manual/basic_sampler .h
341+ curl -O https://raw.githubusercontent.com/pytorch/executorch/release/0.2/examples/llm_manual /basic_tokenizer.h
342+ curl -O https://raw.githubusercontent.com/pytorch/executorch/release/0.2/examples/llm_manual/managed_tensor .h
306343```
307344
308345To learn more, see [ Running an ExecuTorch Model in C++] ( https://pytorch.org/executorch/main/running-a-model-cpp-tutorial.html )
@@ -363,10 +400,19 @@ cmake --build cmake-out -j10
363400./cmake-out/nanogpt_runner
364401```
365402
366- You should see something like the following:
403+ You should see the instruction like the following to make you input the initial prompt:
404+
405+ ```
406+ Prompt:
407+ ```
408+
409+ Here we use "Hello world!" as example prompt. After you input your prompt and press enter:
367410
368411```
369- Once upon a time, there was a man who was a member of the military...
412+ Prompt: Hello world!
413+ Hello world!
414+
415+ I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in
370416```
371417
372418At this point, it is likely to run very slowly. This is because ExecuTorch hasn't been told to optimize for
@@ -423,14 +469,24 @@ model = GPT.from_pretrained('gpt2')
423469# Create example inputs. This is used in the export process to provide
424470# hints on the expected shape of the model input.
425471example_inputs = (
426- torch.randint(0 , 100 , (1 , 8 ), dtype = torch.long),
472+ torch.randint(0 , 100 , (1 , model.config.block_size - 1 ), dtype = torch.long),
427473 )
428474
475+ # Set up dynamic shape configuration, which makes the input tensors'
476+ # sizes during the runtime does not need to match the size of tensors
477+ # in `example_inputs`, but follow the rule dynamic shape configuration shares.
478+ # Here we set the range of 0th model input's 1st dimension as [0, model.config.block_size - 1]
479+ # Detials of dynamic shape and how to create it customized can follow
480+ # [ExecuTorch Concept](https://pytorch.org/executorch/0.2/concepts.html#dynamic-shapes)
481+ dynamic_shape = (
482+ {1 : torch.export.Dim(" token_dim" , max = model.config.block_size - 1 )},
483+ )
484+
429485# Trace the model, converting it to a portable intermediate representation.
430486# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
431487with torch.nn.attention.sdpa_kernel([SDPBackend.MATH ]), torch.no_grad():
432- m = capture_pre_autograd_graph(model, example_inputs)
433- traced_model = export(m, example_inputs)
488+ m = capture_pre_autograd_graph(model, example_inputs, dynamic_shapes = dynamic_shape )
489+ traced_model = export(m, example_inputs, dynamic_shapes = dynamic_shape )
434490
435491# Convert the model into a runnable ExecuTorch program.
436492# To be further lowered to Xnnpack backend, `traced_model` needs xnnpack-specific edge compile config
@@ -512,12 +568,23 @@ cmake --build cmake-out -j10
512568./cmake-out/nanogpt_runner
513569```
514570
515- You should see something like the following:
571+
572+ You should see the instruction like the following to make you input the initial prompt:
573+
574+ ```
575+ Prompt:
576+ ```
577+
578+ Here we use "Hello world!" as example prompt. After you input your prompt and press enter:
516579
517580```
518- Once upon a time, there was a man who was a member of the military...
581+ Prompt: Hello world!
582+ Hello world!
583+
584+ I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in
519585```
520586
587+ Now you'll be able to clearly feel the acceleration of the generation process, compare with no delegation.
521588
522589For more information regarding backend delegateion, see the ExecuTorch guides
523590for the
0 commit comments