Add mlperf benchmark scripts in-tree. #148

qihqi · 2024-07-13T00:43:28Z

TODOS:

try llama70.
Modify backend.py to use asyncio instead of threadpool.

===
All python files are copied from https://github.com/tpu-inference/inference_mlperf4.1/tree/mixtral_loadgen/language/mixtral-8x7b
and shell scripts adapted from https://docs.google.com/document/d/112oYiFB_hkbb0_Kfcm0iDaDVcsq3JwLYprc_1zh_FCo/edit?tab=t.0

With the following changes:

dont install mlperf loadgen from source but install from pip instead
install scripts / run scripts respects virtualenv
remove 'warmup' feature from loadgen scripts; but do it in a separate script intead.

* Almost working except mask, need to rebase to main to pick up the the ring buffer support then fix the mask. Int8 updates also included but not tested. * Fixed the test_model_impl for llama, but test_llama_e2e is still failing. * Adds lazy_cache_update and restructure the cache flags. * Disable all the prints. Fix create engine. * Fix typos and minor errors. * Fixes create engine. * Adds new_cache_stacked and fixes cache update. * Fix cache update when new_cach_stacked is False. * Fix the cache manager and make unit tests pass except for 1. * Updates the exportable model to return cache. * Removed the fori loop in cache finalize. Moves the cache.finalize() to the end of existing cache attention. * Try to use shard_map for cache update. * Fix update single cache line in cache.finalize() * Adds int8 support. * Int8 left aligned lazy cache update working, performance still not good enough. * Fix the stacked cache introduced in the previous couple of commits. * Put original ragged attention back. * Add the original ragged attention kernel. * Fixes the bf16/int8 cache stack. * Fix int8 stacked cache insertion in engine and finalization. * Fixes int8 with lazy cache update. * Updates the int8 test. * Fix the int8 ragged attention output sharding. * Fix group query attention broadcasting issue. * Fix shard map input issue. Variables not listed as inputs are freezed into jit function. * Fix the flash attention mask shape; Fix the update single cache line quant version * Adds the kv cache test. * Replace quantized cache "pos" with "input_pos" to align with bf16 cache. Fix the kv cache quantization test. * Fix prefill cache insertion issue for stacked cache; Changes reduce dim for quantization from 1,3 to -3,-1 to make it more robust; * Adds lazy cache update with generate cache stacked new cache unstacked for performance validation. * Fix the shard map sharding for stacked generate cache and unstacked new cache. * Using Jax API to slicing instead of Pytorch index slicing. * Adds stacked cache support in ragged attention reference kernel. * Adds stacked cache support for the modified ragged kernel. * Llama2 70b int8 optimization done. Output not correct yet. * Remove testing temp output files. * Fix the llama 70b output accuracy resulting from gqa. * Fixes the attention output slicing issue when not using flash attention. Refactor to use only 1 flash attention kernel. Changes the modified ring buffer ragged attention kernel with quantization, layer, etc. * Fix the pallas kernel OOB issue * Fix tests; Fix lint issues; * Fix the interactive script. * Add mlperf benchmark scripts in-tree. (#148) * Fix lint errors. * Fix errors. * Fix the comments. * Fix based on comments; Fix all the unit tests. * Fix the remaining pylint errors. * Default ring buffer back to true so that all the test_run_server and run_interactive in CPU mode can work. When we default ring buffer to false, should add additional flags to run_interactive CI to set test mode to true so that pallas kernel can run. * Fix all the lint errors. * Fix run_offline script. * Fix lint errors. --------- Co-authored-by: qihqi <[email protected]>

qihqi requested a review from sixiang-google July 13, 2024 00:45

qihqi force-pushed the hanq_mlperf branch from 2b63ef3 to ecb423a Compare July 15, 2024 18:08

Add mlperf benchmark scripts in-tree.

f05f03a

qihqi force-pushed the hanq_mlperf branch from ecb423a to f05f03a Compare July 15, 2024 18:45

FanhaiLu1 approved these changes Jul 15, 2024

View reviewed changes

qihqi merged commit b2e5106 into main Jul 15, 2024
4 checks passed

wang2yn84 pushed a commit that referenced this pull request Jul 18, 2024

Add mlperf benchmark scripts in-tree. (#148)

4da0bbf

wang2yn84 pushed a commit that referenced this pull request Jul 20, 2024

Add mlperf benchmark scripts in-tree. (#148)

dc0921e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add mlperf benchmark scripts in-tree. #148

Add mlperf benchmark scripts in-tree. #148

Uh oh!

qihqi commented Jul 13, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Add mlperf benchmark scripts in-tree. #148

Add mlperf benchmark scripts in-tree. #148

Uh oh!

Conversation

qihqi commented Jul 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qihqi commented Jul 13, 2024 •

edited

Loading