Skip to content

Trace cache start pos #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: nlp/cache-start-pos
Choose a base branch
from

Conversation

nlpfollower
Copy link
Owner

Command:
DEBUG_CACHE=True python3 torchchat.py chat --checkpoint-path ../torchtitan/models/Llama3.1-8B-Instruct/consolidated.0.pth --tokenizer-path ../torchtitan/models/Llama3.1-8B-Instruct/tokenizer.model --params-path torchchat/model_params/Meta-Llama-3.1-8B.json

Without start_pos += encoded.size(0)

Entering Chat Mode. Will continue chatting back and forth with the language model until the models max context length of 8192 tokens is hit or until the user says /bye
Do you want to enter a system prompt? Enter y for yes and anything else for no. 
y
What is your system prompt? 
You are a helpful assistant.
User: Can you remember this word for me: "berryjuice"? Reply briefly.
Model: Transformer forward input pos tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([37], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([15717], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([38], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([8783], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([39], device='cuda:0', dtype=torch.int32)
berryjuiceToken tensor: tensor([560], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([40], device='cuda:0', dtype=torch.int32)
final token input_pos: tensor([41], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([128009], device='cuda:0', dtype=torch.int32)
local_token_tensor: [15717, 8783, 560, 128009]
berryjuice<|eot_id|>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 4 tokens                 
Time for inference 1: 0.5418 sec total                 
Time to first token: 0.1843 sec with parallel prefill.                

      Total throughput: 9.2289 tokens/sec, 0.1084 s/token                 
First token throughput: 5.4250 tokens/sec, 0.1843 s/token                 
 Next token throughput: 11.1905 tokens/sec, 0.0894 s/token                     

Bandwidth achieved: 148.22 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================

User: What word did I ask you to remember? Reply briefly.
Model: Transformer forward input pos tensor([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
        22, 23, 24, 25], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([26], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([2028], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([27], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([10652], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([28], device='cuda:0', dtype=torch.int32)
This conversation just startedToken tensor: tensor([1120], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([29], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([3940], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([30], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([13], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([31], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([358], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([32], device='cuda:0', dtype=torch.int32)
. I don'tToken tensor: tensor([1541], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([33], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([956], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([34], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([19635], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([35], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([499], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([36], device='cuda:0', dtype=torch.int32)
 recall you asking meToken tensor: tensor([10371], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([37], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([757], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([38], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([311], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([39], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([6227], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([40], device='cuda:0', dtype=torch.int32)
 to remember any wordToken tensor: tensor([904], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([41], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([3492], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([42], device='cuda:0', dtype=torch.int32)
.Token tensor: tensor([13], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([43], device='cuda:0', dtype=torch.int32)
final token input_pos: tensor([44], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([128009], device='cuda:0', dtype=torch.int32)
local_token_tensor: [2028, 10652, 1120, 3940, 13, 358, 1541, 956, 19635, 499, 10371, 757, 311, 6227, 904, 3492, 13, 128009]
This conversation just started. I don't recall you asking me to remember any word.<|eot_id|>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 18 tokens                 
Time for inference 2: 1.2454 sec total                 
Time to first token: 0.0276 sec with parallel prefill.                

      Total throughput: 15.2556 tokens/sec, 0.0655 s/token                 
First token throughput: 36.2249 tokens/sec, 0.0276 s/token                 
 Next token throughput: 14.7803 tokens/sec, 0.0677 s/token                     

Bandwidth achieved: 245.01 GB/s

========================================

User: 

with start_pos += encoded.size(0):

Entering Chat Mode. Will continue chatting back and forth with the language model until the models max context length of 8192 tokens is hit or until the user says /bye
Do you want to enter a system prompt? Enter y for yes and anything else for no. 
y
What is your system prompt? 
You are a helpful assistant.
User: Can you remember this word for me: "berryjuice"? Reply briefly.
Model: Transformer forward input pos tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([37], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([40], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([38], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([3077], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([39], device='cuda:0', dtype=torch.int32)
I've got itToken tensor: tensor([2751], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([40], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([433], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([41], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([25], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([42], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([330], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([43], device='cuda:0', dtype=torch.int32)
: "berryjuToken tensor: tensor([15717], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([44], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([8783], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([45], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([560], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([46], device='cuda:0', dtype=torch.int32)
ice".Token tensor: tensor([3343], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([47], device='cuda:0', dtype=torch.int32)
final token input_pos: tensor([48], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([128009], device='cuda:0', dtype=torch.int32)
local_token_tensor: [40, 3077, 2751, 433, 25, 330, 15717, 8783, 560, 3343, 128009]
I've got it: "berryjuice".<|eot_id|>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 11 tokens                 
Time for inference 1: 1.0131 sec total                 
Time to first token: 0.1842 sec with parallel prefill.                

      Total throughput: 11.8449 tokens/sec, 0.0844 s/token                 
First token throughput: 5.4291 tokens/sec, 0.1842 s/token                 
 Next token throughput: 13.2706 tokens/sec, 0.0754 s/token                     

Bandwidth achieved: 190.24 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================

User: What word did I ask you to remember? Reply briefly.
Model: Transformer forward input pos tensor([48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
        66, 67, 68, 69], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([70], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([1], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([71], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([15717], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([72], device='cuda:0', dtype=torch.int32)
"berryjuiceToken tensor: tensor([8783], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([73], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([560], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([74], device='cuda:0', dtype=torch.int32)
".Token tensor: tensor([3343], device='cuda:0', dtype=torch.int32)
Transformer forward input pos tensor([75], device='cuda:0', dtype=torch.int32)
final token input_pos: tensor([76], device='cuda:0', dtype=torch.int32)
Token tensor: tensor([128009], device='cuda:0', dtype=torch.int32)
local_token_tensor: [1, 15717, 8783, 560, 3343, 128009]
"berryjuice".<|eot_id|>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 6 tokens                 
Time for inference 2: 0.4490 sec total                 
Time to first token: 0.0265 sec with parallel prefill.                

      Total throughput: 15.5912 tokens/sec, 0.0641 s/token                 
First token throughput: 37.7502 tokens/sec, 0.0265 s/token                 
 Next token throughput: 14.2018 tokens/sec, 0.0704 s/token                     

Bandwidth achieved: 250.40 GB/s

========================================

User: 

@nlpfollower nlpfollower changed the title Debug cache start pos Trace cache start pos Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant