-
-
Notifications
You must be signed in to change notification settings - Fork 44
Labels
featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.needs-kindIndicates a PR lacks a label and requires one.Indicates a PR lacks a label and requires one.needs-priorityIndicates a PR lacks a label and requires one.Indicates a PR lacks a label and requires one.needs-triageIndicates an issue or PR lacks a label and requires one.Indicates an issue or PR lacks a label and requires one.
Description
What would you like to be added:
Why is this needed:
The inference engine will batch the requests for processing, and maintain a queue inside for best effort scheduling, which means when we kill the pod, there maybe some requests sitting in the queue still, we need to inject the preStop command to detect that no requests there, it looks like:
while true; do
RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
exit 0
else
echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
sleep 5
fi
done
Completion requirements:
This enhancement requires the following artifacts:
- Design doc
- API change
- Docs update
The artifacts should be linked in subsequent comments.
Metadata
Metadata
Assignees
Labels
featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.needs-kindIndicates a PR lacks a label and requires one.Indicates a PR lacks a label and requires one.needs-priorityIndicates a PR lacks a label and requires one.Indicates a PR lacks a label and requires one.needs-triageIndicates an issue or PR lacks a label and requires one.Indicates an issue or PR lacks a label and requires one.