[RFC]: KV Offloading Roadmap

Currently supported features:
- [x] Pluggable out-of-tree offloading backend
- [x] CPU-GPU Offloading (NVIDIA+AMD)
- [x] Custom offloading block size
- [x] Fully asynchronous (cross-engine-steps) offloading / loading
- [x] Immediate Offloading (as opposed to spill-over)
- [x] LRU Eviction
- [x] Cross layer blocks
- [x] Metrics
- [x] Onloading preempted requests
- [x] KV Events
- [x] Kernel-block-size support
- [x] Abstract eviction strategy (LRU, ARC, etc.)
- [x] HMA Support

Upcoming:
- [ ] Documentation
- [ ] Extend cross layers to hybrid models
- [ ] Support tiering (e.g. CPU -> Pluggable backend)

Backlog:
- [ ] Create an offloading policy abstraction
  - [ ] Preemption-only offload mode (old PR: #32398)
  - [ ] Spillover mode (swap-in swap-out)

Tactical tasks:
- [x] Move `worker/cpu_gpu.py` to `cpu/gpu_worker.py`
- [x] Remove `mediums.py`. Move `GPULoadStoreSpec` to `abstract.py`, `CPULoadStoreSpec` to `cpu/spec.py`.
- [x] Remove `kv_offload/spec.py`. Move to `abstract.py`.
- [x] Rename `abstract.py` to `base.py` (as well as in `cpu/policies`).
- [x] Move `reuse_manager.py` into `cpu/manager.py`.
- [ ] Add `request_finished` method to `OffloadingManager`.
- [ ] Remove `kv_offload/worker/worker.py`. Replace `OffloadingHandler` with `OffloadingWorker` in `abstract.py` with abstract functions: `submit_store`, `submit_load` (replacing `transfer_async`).
- [ ] Remove `GPULoadStoreSpec.group_sizes`, and instead have a `LoadStoreSpec` per group.
- [ ] Replace `GPULoadStore.block_indices` with an offloaded-side `LoadStoreSpec.set_gpu_block_offset(...)`
- [ ] Create `OffloadPolicy(ABC).get_blocks_to_store`. Move existing `_get_reqs_to_store` to `StoreOnComputePolicy(OffloadPolicy)`. Rename `RequestOffloadState` to `RequestKVState`, and move `next_stored_block_idx` to `StoreOnComputePolicy`.
- [x] Make the offloading connector prefer HND layout. (`get_required_kvcache_layout`)
- [x] Support `reset_cache`.
- [x] Move code from `wait_for_save` to `get_finished` (`wait_for_save` maybe skipped in some cases -> potential bug)
- [ ] Kick-off offload transfers in `request_finished`.
- [ ] Support CPU usage percentage (current percentage of CPU blocks being stored/loaded) metric.
- [ ] Move all hardware-agnostic parts of `cpu/gpu_worker.py` to `cpu/worker.py`, including creating abstractions for hardware specific parts (like `swap_blocks_batch` and tensor allocation/pinning).
- [x] Add request ID to `RequestContext` (to allow handling of infinite loop (per-requst) of `get_num_new_matched_tokens` returning None)
- [ ] in `OffloadingConnectorScheduler.build_connector_meta`, if no active requests (`self._req_status` is empty), then flush all pending jobs so that they complete on this engine step (which may be the last until new requests arrive).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]: KV Offloading Roadmap #33689

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: KV Offloading Roadmap #33689

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions