Fix TPU head resource name for v4 and v5e #165

richardsliu · 2024-08-01T21:19:43Z

No description provided.

allenwang28 · 2024-08-01T21:22:12Z

run_ray_serve_interleave.py

@@ -40,7 +40,11 @@


 def create_head_resource_name(generation, tpu_chips):
-  return f"TPU-{generation}-{tpu_chips}-head"
+  if generation == 'v5litepod':


maybe just check if lite is in the name?

actually perhaps it doesn't matter that much for now

FanhaiLu1

Thanks for the changes!

* Fix TPU head resource name for v4 and v5e * fix format

* Almost working except mask, need to rebase to main to pick up the the ring buffer support then fix the mask. Int8 updates also included but not tested. * Fixed the test_model_impl for llama, but test_llama_e2e is still failing. * Adds lazy_cache_update and restructure the cache flags. * Disable all the prints. Fix create engine. * Fix typos and minor errors. * Fixes create engine. * Adds new_cache_stacked and fixes cache update. * Fix cache update when new_cach_stacked is False. * Fix the cache manager and make unit tests pass except for 1. * Updates the exportable model to return cache. * Removed the fori loop in cache finalize. Moves the cache.finalize() to the end of existing cache attention. * Try to use shard_map for cache update. * Fix update single cache line in cache.finalize() * Adds int8 support. * Int8 left aligned lazy cache update working, performance still not good enough. * Fix the stacked cache introduced in the previous couple of commits. * Put original ragged attention back. * Add the original ragged attention kernel. * Fixes the bf16/int8 cache stack. * Fix int8 stacked cache insertion in engine and finalization. * Fixes int8 with lazy cache update. * Updates the int8 test. * Fix the int8 ragged attention output sharding. * Fix group query attention broadcasting issue. * Fix shard map input issue. Variables not listed as inputs are freezed into jit function. * Fix the flash attention mask shape; Fix the update single cache line quant version * Adds the kv cache test. * Replace quantized cache "pos" with "input_pos" to align with bf16 cache. Fix the kv cache quantization test. * Fix prefill cache insertion issue for stacked cache; Changes reduce dim for quantization from 1,3 to -3,-1 to make it more robust; * Adds lazy cache update with generate cache stacked new cache unstacked for performance validation. * Fix the shard map sharding for stacked generate cache and unstacked new cache. * Using Jax API to slicing instead of Pytorch index slicing. * Adds stacked cache support in ragged attention reference kernel. * Adds stacked cache support for the modified ragged kernel. * Llama2 70b int8 optimization done. Output not correct yet. * Remove testing temp output files. * Fix the llama 70b output accuracy resulting from gqa. * Fixes the attention output slicing issue when not using flash attention. Refactor to use only 1 flash attention kernel. Changes the modified ring buffer ragged attention kernel with quantization, layer, etc. * Fix the pallas kernel OOB issue * Fix tests; Fix lint issues; * Fix the interactive script. * Fix lint errors. * Fix errors. * Fix the comments. * Fix based on comments; Fix all the unit tests. * Fix the remaining pylint errors. * Default ring buffer back to true so that all the test_run_server and run_interactive in CPU mode can work. When we default ring buffer to false, should add additional flags to run_interactive CI to set test mode to true so that pallas kernel can run. * Fix all the lint errors. * Fix run_offline script. * Fix the ring buffer mode long latency issue. * Rebase to main. * Fix all the lint issues. * Fix Ray engine crash on multihost (#164) * Fix TPU head resource name for v4 and v5e (#165) * Fix TPU head resource name for v4 and v5e * fix format * Fixed exhausted bug between head and workers (#163) * add xla2 fix * update jax version * revert jax TPU version * Fix test_run_server issue from fixing the lint; Fix run_interactive from merge; Fix lints; * Revert xla changes. --------- Co-authored-by: Richard Liu <[email protected]> Co-authored-by: Fanhai Lu <[email protected]>

Fix TPU head resource name for v4 and v5e

6d08b8a

allenwang28 reviewed Aug 1, 2024

View reviewed changes

fix format

d4709fc

allenwang28 approved these changes Aug 1, 2024

View reviewed changes

richardsliu merged commit f494514 into AI-Hypercomputer:main Aug 1, 2024
4 checks passed

FanhaiLu1 reviewed Aug 1, 2024

View reviewed changes

wang2yn84 pushed a commit that referenced this pull request Aug 6, 2024

Fix TPU head resource name for v4 and v5e (#165)

743c0e5

* Fix TPU head resource name for v4 and v5e * fix format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix TPU head resource name for v4 and v5e #165

Fix TPU head resource name for v4 and v5e #165

Uh oh!

richardsliu commented Aug 1, 2024

Uh oh!

allenwang28 Aug 1, 2024

Uh oh!

allenwang28 Aug 1, 2024

Uh oh!

Uh oh!

FanhaiLu1 left a comment

Uh oh!

Uh oh!

Fix TPU head resource name for v4 and v5e #165

Fix TPU head resource name for v4 and v5e #165

Uh oh!

Conversation

richardsliu commented Aug 1, 2024

Uh oh!

allenwang28 Aug 1, 2024

Choose a reason for hiding this comment

Uh oh!

allenwang28 Aug 1, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FanhaiLu1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!