PreStop command injection for inference engines

**What would you like to be added**:

**Why is this needed**:

The inference engine will batch the requests for processing, and maintain a queue inside for best effort scheduling, which means when we kill the pod, there maybe some requests sitting in the queue still, we need to inject the preStop command to detect that no requests there, it looks like:
```
                while true; do
                  RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
                  WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
                  if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
                    echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
                    exit 0
                  else
                    echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
                    sleep 5
                  fi
                done
```

**Completion requirements**:

This enhancement requires the following artifacts:

- [ ] Design doc
- [x] API change
- [ ] Docs update

The artifacts should be linked in subsequent comments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PreStop command injection for inference engines #288

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PreStop command injection for inference engines #288

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions