You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Remove mediums.py. Move GPULoadStoreSpec to abstract.py, CPULoadStoreSpec to cpu/spec.py.
Remove kv_offload/spec.py. Move to abstract.py.
Rename abstract.py to base.py (as well as in cpu/policies).
Move reuse_manager.py into cpu/manager.py.
Add request_finished method to OffloadingManager.
Remove kv_offload/worker/worker.py. Replace OffloadingHandler with OffloadingWorker in abstract.py with abstract functions: submit_store, submit_load (replacing transfer_async).
Remove GPULoadStoreSpec.group_sizes, and instead have a LoadStoreSpec per group.
Replace GPULoadStore.block_indices with an offloaded-side LoadStoreSpec.set_gpu_block_offset(...)
Create OffloadPolicy(ABC).get_blocks_to_store. Move existing _get_reqs_to_store to StoreOnComputePolicy(OffloadPolicy). Rename RequestOffloadState to RequestKVState, and move next_stored_block_idx to StoreOnComputePolicy.
Make the offloading connector prefer HND layout. (get_required_kvcache_layout)
Support reset_cache.
Move code from wait_for_save to get_finished (wait_for_save maybe skipped in some cases -> potential bug)
Kick-off offload transfers in request_finished.
Support CPU usage percentage (current percentage of CPU blocks being stored/loaded) metric.
Move all hardware-agnostic parts of cpu/gpu_worker.py to cpu/worker.py, including creating abstractions for hardware specific parts (like swap_blocks_batch and tensor allocation/pinning).
Add request ID to RequestContext (to allow handling of infinite loop (per-requst) of get_num_new_matched_tokens returning None)
in OffloadingConnectorScheduler.build_connector_meta, if no active requests (self._req_status is empty), then flush all pending jobs so that they complete on this engine step (which may be the last until new requests arrive).
Currently supported features:
Upcoming:
Backlog:
Tactical tasks:
worker/cpu_gpu.pytocpu/gpu_worker.pymediums.py. MoveGPULoadStoreSpectoabstract.py,CPULoadStoreSpectocpu/spec.py.kv_offload/spec.py. Move toabstract.py.abstract.pytobase.py(as well as incpu/policies).reuse_manager.pyintocpu/manager.py.request_finishedmethod toOffloadingManager.kv_offload/worker/worker.py. ReplaceOffloadingHandlerwithOffloadingWorkerinabstract.pywith abstract functions:submit_store,submit_load(replacingtransfer_async).GPULoadStoreSpec.group_sizes, and instead have aLoadStoreSpecper group.GPULoadStore.block_indiceswith an offloaded-sideLoadStoreSpec.set_gpu_block_offset(...)OffloadPolicy(ABC).get_blocks_to_store. Move existing_get_reqs_to_storetoStoreOnComputePolicy(OffloadPolicy). RenameRequestOffloadStatetoRequestKVState, and movenext_stored_block_idxtoStoreOnComputePolicy.get_required_kvcache_layout)reset_cache.wait_for_savetoget_finished(wait_for_savemaybe skipped in some cases -> potential bug)request_finished.cpu/gpu_worker.pytocpu/worker.py, including creating abstractions for hardware specific parts (likeswap_blocks_batchand tensor allocation/pinning).RequestContext(to allow handling of infinite loop (per-requst) ofget_num_new_matched_tokensreturning None)OffloadingConnectorScheduler.build_connector_meta, if no active requests (self._req_statusis empty), then flush all pending jobs so that they complete on this engine step (which may be the last until new requests arrive).