Skip to content

[RFC]: KV Offloading Roadmap #33689

@orozery

Description

@orozery

Currently supported features:

  • Pluggable out-of-tree offloading backend
  • CPU-GPU Offloading (NVIDIA+AMD)
  • Custom offloading block size
  • Fully asynchronous (cross-engine-steps) offloading / loading
  • Immediate Offloading (as opposed to spill-over)
  • LRU Eviction
  • Cross layer blocks
  • Metrics
  • Onloading preempted requests
  • KV Events
  • Kernel-block-size support
  • Abstract eviction strategy (LRU, ARC, etc.)
  • HMA Support

Upcoming:

  • Documentation
  • Extend cross layers to hybrid models
  • Support tiering (e.g. CPU -> Pluggable backend)

Backlog:

Tactical tasks:

  • Move worker/cpu_gpu.py to cpu/gpu_worker.py
  • Remove mediums.py. Move GPULoadStoreSpec to abstract.py, CPULoadStoreSpec to cpu/spec.py.
  • Remove kv_offload/spec.py. Move to abstract.py.
  • Rename abstract.py to base.py (as well as in cpu/policies).
  • Move reuse_manager.py into cpu/manager.py.
  • Add request_finished method to OffloadingManager.
  • Remove kv_offload/worker/worker.py. Replace OffloadingHandler with OffloadingWorker in abstract.py with abstract functions: submit_store, submit_load (replacing transfer_async).
  • Remove GPULoadStoreSpec.group_sizes, and instead have a LoadStoreSpec per group.
  • Replace GPULoadStore.block_indices with an offloaded-side LoadStoreSpec.set_gpu_block_offset(...)
  • Create OffloadPolicy(ABC).get_blocks_to_store. Move existing _get_reqs_to_store to StoreOnComputePolicy(OffloadPolicy). Rename RequestOffloadState to RequestKVState, and move next_stored_block_idx to StoreOnComputePolicy.
  • Make the offloading connector prefer HND layout. (get_required_kvcache_layout)
  • Support reset_cache.
  • Move code from wait_for_save to get_finished (wait_for_save maybe skipped in some cases -> potential bug)
  • Kick-off offload transfers in request_finished.
  • Support CPU usage percentage (current percentage of CPU blocks being stored/loaded) metric.
  • Move all hardware-agnostic parts of cpu/gpu_worker.py to cpu/worker.py, including creating abstractions for hardware specific parts (like swap_blocks_batch and tensor allocation/pinning).
  • Add request ID to RequestContext (to allow handling of infinite loop (per-requst) of get_num_new_matched_tokens returning None)
  • in OffloadingConnectorScheduler.build_connector_meta, if no active requests (self._req_status is empty), then flush all pending jobs so that they complete on this engine step (which may be the last until new requests arrive).

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCkeep-openPrevents stale label being applied

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions