Skip to content

Load parallel.cpp -f file.txt external prompt file #3416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Oct 6, 2023

Conversation

pudepiedj
Copy link
Contributor

This branch includes amendments to three files in ./llama.cpp/examples necessary to implement the external prompt file option -f file.txt that arises in ./bin/parallel --help. The three affected files are:

common.h: add new params.prompt_file placeholder to gpt_params definition
common.cpp: assign the name of the file in the command-prompt using argv[i]
parallel.cpp: add slice function to segment params.prompt_file; add code to assign the segments to k_prompts and overwrite the default values; display the contents of the external prompts file; add a datetime stamp to the final report including the name of the external file using #include <ctime> etc.

Command-line code (second run example with -ns 128):

% ./build/bin/parallel -m ./models/llama-2-13b/ggml-model-q8_0.gguf -f "ParallelQuestions.txt" -n 512 -t 1 -s 3456 -ngl 100 -c 8192 -np 4 -ns 16 -cb

Example output from two different runs on M2 MAX 32GB and MacOS Sonoma 14.0 (omitting the initialisation):

llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 6400.00 MB
llama_new_context_with_model: compute buffer total size = 691.88 MB
llama_new_context_with_model: max tensor size =   166.02 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 13190.58 MB, (13191.20 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  6402.00 MB, (19593.20 / 21845.34)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   686.02 MB, (20279.22 / 21845.34)

Now printing the external prompt file ParallelQuestions.txt

 1 prompt: What do you know about Hobbits?
 2 prompt: What is quantum field theory?
 3 prompt: Why did the chicken cross the road?
 4 prompt: Who is the president of the United States?
 5 prompt: How do I run CMake on MacOS?
 6 prompt: Do you agree that C++ is a really finicky language compared with Python3?
 7 prompt: Is it a good idea to invest in technology?
 8 prompt: Do you like Wagner's Ring?
 9 prompt: Do you think this file input option is really neat?
10 prompt: What should we all do about climate change?
11 prompt: Is time-travel possible within the laws of current physics?
12 prompt: Is it like anything to be a bat?
13 prompt: Once the chicken has crossed the road, does it try to go back?
14 prompt: Who is the greatest of all musical composers?
15 prompt: What is art?
16 prompt: Is there life elsewhere in the universe?
17 prompt: What is intelligence?
18 prompt: What is the difference between knowledge and intelligence?
19 prompt: Will religion ever die?
20 prompt: Do we understand ourselves?
21 prompt: What is the best way to cook eggs?
22 prompt: If you cannot see things, on what basis do you evaluate them?
23 prompt: Explain the role of the np junction in photovoltaic cells?
24 prompt: Is professional sport a good or bad influence on human behaviour?
25 prompt: Is capital punishment immoral?
26 prompt: Should we care about other people?
27 prompt: Who are you?
28 prompt: Which sense would you surrender if you could?
29 prompt: Was Henry Ford a hero or a villain?
30 prompt: Do we need leaders?
31 prompt: What is nucleosynthesis?
32 prompt: Who is the greatest scientist of all time so far?


main: Simulating parallel requests from clients:
main: n_parallel = 4, n_sequences = 16, cont_batching = 1, system tokens = 305

main: Evaluating the system prompt ...

Processing requests ...

main: clearing the KV cache
Client   0, seq    0, started decoding ...
Client   1, seq    1, started decoding ...
Client   2, seq    2, started decoding ...
Client   3, seq    3, started decoding ...
Client   2, seq   2/ 16, prompt   15 t, response   63 t, time  9.33 s, speed  8.36 t/s, cache miss 0  
Input:    Is it a good idea to invest in technology?
Response: It depends on your personal circumstances and financial situation. Investing in technology can be a way to diversify your portfolio and potentially generate higher returns than traditional investments like stocks and bonds. However, it is important to do your research and understand the risks before making any investment decisions.

Client   2, seq    4, started decoding ...
Client   3, seq   3/ 16, prompt   14 t, response   65 t, time  9.63 s, speed  8.20 t/s, cache miss 0  
Input:    What should we all do about climate change?
Response: We should all do something about climate change. Everyone can make a difference by reducing their carbon footprint, conserving energy and water, eating less meat, and using public transportation or biking instead of driving whenever possible. We should also support policies that promote renewable energy sources like solar and wind power.

Many lines omitted. This from the end of a run with -ns 128.

Client   0, seq 121/128, prompt   21 t, response  149 t, time 55.81 s, speed  3.05 t/s, cache miss 103  
Input:    Once the chicken has crossed the road, does it try to go back?
Response: It is difficult to say for sure what the chicken's motivations are. Chickens may cross roads for a variety of reasons, such as searching for food or water, escaping predators, or simply exploring their surroundings. If a chicken has crossed a road once, it is possible that it will attempt to go back across the same road in the future if it finds something worth returning to on the other side. However, it is also possible that the chicken may not want to cross the road again, depending on its experiences and motivations at the time. Ultimately, it is hard to know for sure what a particular chicken's thoughts or intentions are when it comes to crossing roads.

main: clearing the KV cache

RUN PARAMETERS as at Sat Sep 30 17:05:27 2023

main: n_parallel = 64, n_sequences = 128, cont_batching = 1, system tokens = 305
external prompt file (if any): ParallelQuestions.txt

Total prompt tokens:   1711, speed: 14.74 t/s
Total gen tokens:      9152, speed: 78.82 t/s
Total speed (AVG):           speed: 93.55 t/s
Cache misses:           103


llama_print_timings:        load time =   647.84 ms
llama_print_timings:      sample time =  6313.92 ms /  9280 runs   (    0.68 ms per token,  1469.77 tokens per second)
llama_print_timings: prompt eval time = 106014.63 ms / 11134 tokens (    9.52 ms per token,   105.02 tokens per second)
llama_print_timings:        eval time =  2807.62 ms /    34 runs   (   82.58 ms per token,    12.11 tokens per second)
llama_print_timings:       total time = 116116.12 ms
ggml_metal_free: deallocating

@pudepiedj pudepiedj changed the title Load parallel prompt file Load parallel.cpp -f file.txt external prompt file Oct 1, 2023
pudepiedj added a commit to pudepiedj/llama.cpp that referenced this pull request Oct 2, 2023
@pudepiedj
Copy link
Contributor Author

pudepiedj commented Oct 3, 2023 via email

@cebtenzzre
Copy link
Collaborator

Is your preference for a new PR after this kind of updating?

No, please keep the existing PR. You may git rebase -i and force-push to your branch if you would like to clean up the history, but it doesn't really matter because the PR will be squashed into a single commit before merging.

@pudepiedj
Copy link
Contributor Author

Is your preference for a new PR after this kind of updating?

No, please keep the existing PR. You may git rebase -i and force-push to your branch if you would like to clean up the history, but it doesn't really matter because the PR will be squashed into a single commit before merging.

Thank you. All changes to <origin/load-parallel-prompt-file> now pushed to <origin/Update-load-parallel-prompt-file> which I hope I have done correctly this time.

@pudepiedj
Copy link
Contributor Author

It's interesting to use the 100 questions in /examples/jeopardy/questions.txt as the external prompt file to measure the difference in performance between simultaneous and sequential processing.
Without simultaneous processing -np 1 the file takes roughly 233 seconds on an M2 MAX (38 core) and Sonoma 14.0; with -np 64 simultaneous processes and continuous batching -cb it takes about 43 seconds.
The command-line (-c 16384 is slightly too big to run with f16):

% ./build/bin/parallel -m ./models/llama-2-7b/ggml-model-f16.gguf -f "examples/jeopardy/questions.txt" -n 256 -t 1 -ngl 100 -c 8192 -s 1234 -np 64 -ns 100 -cb

Since memory is critical it's worth noting the resources used and how parsimonious the system allocation is with `device.recommendedMaxWorkingSetSize':

ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 21845.34 MB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 557.88 MB
llama_new_context_with_model: max tensor size =   250.00 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 12853.73 MB, (12854.36 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  4098.00 MB, (16952.36 / 21845.34)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   552.02 MB, (17504.38 / 21845.34)

@cebtenzzre
Copy link
Collaborator

Thank you. All changes to <origin/load-parallel-prompt-file> now pushed to <origin/Update-load-parallel-prompt-file> which I hope I have done correctly this time.

Could you please push your changes to the load-parallel-prompt-file branch so they appear here?

@pudepiedj
Copy link
Contributor Author

Thank you. All changes to <origin/load-parallel-prompt-file> now pushed to <origin/Update-load-parallel-prompt-file> which I hope I have done correctly this time.

Could you please push your changes to the load-parallel-prompt-file branch so they appear here?

OK this is what I did. I hope it's right. I've looked at the files in load-parallel-prompt-file and they appear to have been changed correctly. Please let me know if I have done something wrong (again)!

(base) edsilm2@JCPM2 llama.cpp % git remote add llama.cpp https://github.com/pudepiedj/llama.cpp.git
(base) edsilm2@JCPM2 llama.cpp % git fetch llama.cpp
From https://github.com/pudepiedj/llama.cpp
 * [new branch]      Update-load-parallel-prompt-file -> llama.cpp/Update-load-parallel-prompt-file
 * [new branch]      load-parallel-prompt-file        -> llama.cpp/load-parallel-prompt-file
 * [new branch]      master                           -> llama.cpp/master
(base) edsilm2@JCPM2 llama.cpp % git push llama.cpp Update-load-parallel-prompt-file:load-parallel-prompt-file
Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/pudepiedj/llama.cpp.git
   ce10861..bf8c4df  Update-load-parallel-prompt-file -> load-parallel-prompt-file

@ggerganov ggerganov merged commit a8777ad into ggml-org:master Oct 6, 2023
ggerganov added a commit that referenced this pull request Oct 6, 2023
joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 6, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp:
  kv cache slot search improvements (ggml-org#3493)
  prompts : fix editorconfig checks after ggml-org#3416
  parallel : add option to load external prompt file (ggml-org#3416)
  server : reuse llama_sample_token common util (ggml-org#3494)
  llama : correct hparams comparison (ggml-org#3446)
  ci : fix xcodebuild destinations (ggml-org#3491)
  convert : update Falcon script for new HF config (ggml-org#3448)
  build : use std::make_tuple() for compatibility with older GCC versions (ggml-org#3488)
  common : process escape sequences in reverse prompts (ggml-org#3461)
  CLBlast: Fix handling of on-device tensor data
  server : fix incorrect num_tokens_predicted (ggml-org#3480)
  swift : disable ACCELERATE_NEW_LAPACK (ggml-org#3481)
  ci : add swift build via xcodebuild (ggml-org#3482)
@pudepiedj
Copy link
Contributor Author

pudepiedj commented Oct 6, 2023 via email

yusiwen pushed a commit to yusiwen/llama.cpp that referenced this pull request Oct 7, 2023
* Enable external file and add datestamp

* Add name of external file at end

* Upload ToK2024

* Delete ToK2024.txt

* Experiments with jeopardy

* Move ParallelQuestions to /proimpts and rename

* Interim commit

* Interim commit

* Final revision

* Remove trailing whitespace

* remove cmake_all.sh

* Remove cmake_all.sh

* Changed .gitignore

* Improved reporting and new question files.

* Corrected typo

* More LLM questions

* Update LLM-questions.txt

* Yet more LLM-questions

* Remove jeopardy results file

* Reinstate original jeopardy.sh

* Update examples/parallel/parallel.cpp

---------

Co-authored-by: Georgi Gerganov <[email protected]>
yusiwen pushed a commit to yusiwen/llama.cpp that referenced this pull request Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants