Trace cache start pos #1

nlpfollower · 2025-01-17T04:29:17Z

Command:
DEBUG_CACHE=True python3 torchchat.py chat --checkpoint-path ../torchtitan/models/Llama3.1-8B-Instruct/consolidated.0.pth --tokenizer-path ../torchtitan/models/Llama3.1-8B-Instruct/tokenizer.model --params-path torchchat/model_params/Meta-Llama-3.1-8B.json

Without `start_pos += encoded.size(0)`

Entering Chat Mode. Will continue chatting back and forth with the language model until the models max context length of 8192 tokens is hit or until the user says /bye
Do you want to enter a system prompt? Enter y for yes and anything else for no. 
y
What is your system prompt? 
You are a helpful assistant.
User: Can you remember this word for me: "berryjuice"? Reply briefly.
Model: Transformer forward input pos tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([37], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([15717], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([38], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([8783], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([39], device='cuda:0', dtype=torch.int32)
berryjuiceToken tensor: tensor([560], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([40], device='cuda:0', dtype=torch.int32)
final token input_pos: tensor([41], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([128009], device='cuda:0', dtype=torch.int32)
local_token_tensor: [15717, 8783, 560, 128009]
berryjuice<|eot_id|>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 4 tokens                 
Time for inference 1: 0.5418 sec total                 
Time to first token: 0.1843 sec with parallel prefill.                

      Total throughput: 9.2289 tokens/sec, 0.1084 s/token                 
First token throughput: 5.4250 tokens/sec, 0.1843 s/token                 
 Next token throughput: 11.1905 tokens/sec, 0.0894 s/token                     

Bandwidth achieved: 148.22 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================

User: What word did I ask you to remember? Reply briefly.
Model: Transformer forward input pos tensor([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
        22, 23, 24, 25], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([26], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([2028], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([27], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([10652], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([28], device='cuda:0', dtype=torch.int32)
This conversation just startedToken tensor: tensor([1120], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([29], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([3940], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([30], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([13], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([31], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([358], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([32], device='cuda:0', dtype=torch.int32)
. I don'tToken tensor: tensor([1541], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([33], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([956], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([34], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([19635], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([35], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([499], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([36], device='cuda:0', dtype=torch.int32)
 recall you asking meToken tensor: tensor([10371], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([37], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([757], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([38], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([311], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([39], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([6227], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([40], device='cuda:0', dtype=torch.int32)
 to remember any wordToken tensor: tensor([904], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([41], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([3492], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([42], device='cuda:0', dtype=torch.int32)
.Token tensor: tensor([13], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([43], device='cuda:0', dtype=torch.int32)
final token input_pos: tensor([44], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([128009], device='cuda:0', dtype=torch.int32)
local_token_tensor: [2028, 10652, 1120, 3940, 13, 358, 1541, 956, 19635, 499, 10371, 757, 311, 6227, 904, 3492, 13, 128009]
This conversation just started. I don't recall you asking me to remember any word.<|eot_id|>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 18 tokens                 
Time for inference 2: 1.2454 sec total                 
Time to first token: 0.0276 sec with parallel prefill.                

      Total throughput: 15.2556 tokens/sec, 0.0655 s/token                 
First token throughput: 36.2249 tokens/sec, 0.0276 s/token                 
 Next token throughput: 14.7803 tokens/sec, 0.0677 s/token                     

Bandwidth achieved: 245.01 GB/s

========================================

User:

with `start_pos += encoded.size(0)`:

Entering Chat Mode. Will continue chatting back and forth with the language model until the models max context length of 8192 tokens is hit or until the user says /bye
Do you want to enter a system prompt? Enter y for yes and anything else for no. 
y
What is your system prompt? 
You are a helpful assistant.
User: Can you remember this word for me: "berryjuice"? Reply briefly.
Model: Transformer forward input pos tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([37], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([40], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([38], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([3077], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([39], device='cuda:0', dtype=torch.int32)
I've got itToken tensor: tensor([2751], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([40], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([433], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([41], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([25], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([42], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([330], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([43], device='cuda:0', dtype=torch.int32)
: "berryjuToken tensor: tensor([15717], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([44], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([8783], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([45], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([560], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([46], device='cuda:0', dtype=torch.int32)
ice".Token tensor: tensor([3343], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([47], device='cuda:0', dtype=torch.int32)
final token input_pos: tensor([48], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([128009], device='cuda:0', dtype=torch.int32)
local_token_tensor: [40, 3077, 2751, 433, 25, 330, 15717, 8783, 560, 3343, 128009]
I've got it: "berryjuice".<|eot_id|>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 11 tokens                 
Time for inference 1: 1.0131 sec total                 
Time to first token: 0.1842 sec with parallel prefill.                

      Total throughput: 11.8449 tokens/sec, 0.0844 s/token                 
First token throughput: 5.4291 tokens/sec, 0.1842 s/token                 
 Next token throughput: 13.2706 tokens/sec, 0.0754 s/token                     

Bandwidth achieved: 190.24 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================

User: What word did I ask you to remember? Reply briefly.
Model: Transformer forward input pos tensor([48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
        66, 67, 68, 69], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([70], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([1], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([71], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([15717], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([72], device='cuda:0', dtype=torch.int32)
"berryjuiceToken tensor: tensor([8783], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([73], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([560], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([74], device='cuda:0', dtype=torch.int32)
".Token tensor: tensor([3343], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([75], device='cuda:0', dtype=torch.int32)
final token input_pos: tensor([76], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([128009], device='cuda:0', dtype=torch.int32)
local_token_tensor: [1, 15717, 8783, 560, 3343, 128009]
"berryjuice".<|eot_id|>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 6 tokens                 
Time for inference 2: 0.4490 sec total                 
Time to first token: 0.0265 sec with parallel prefill.                

      Total throughput: 15.5912 tokens/sec, 0.0641 s/token                 
First token throughput: 37.7502 tokens/sec, 0.0265 s/token                 
 Next token throughput: 14.2018 tokens/sec, 0.0704 s/token                     

Bandwidth achieved: 250.40 GB/s

========================================

User:

Debug logs

6a6be21

nlpfollower mentioned this pull request Jan 17, 2025

Increment start_pos by encoded size in generate pytorch/torchchat#1462

Merged

nlpfollower changed the title ~~Debug cache start pos~~ Trace cache start pos Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trace cache start pos #1

Trace cache start pos #1

Uh oh!

nlpfollower commented Jan 17, 2025

Uh oh!

Uh oh!

Trace cache start pos #1

Are you sure you want to change the base?

Trace cache start pos #1

Uh oh!

Conversation

nlpfollower commented Jan 17, 2025

Without start_pos += encoded.size(0)

with start_pos += encoded.size(0):

Uh oh!

Uh oh!

Without `start_pos += encoded.size(0)`

with `start_pos += encoded.size(0)`: