Skip to content

PreStop command injection for inference engines #288

@kerthcet

Description

@kerthcet

What would you like to be added:

Why is this needed:

The inference engine will batch the requests for processing, and maintain a queue inside for best effort scheduling, which means when we kill the pod, there maybe some requests sitting in the queue still, we need to inject the preStop command to detect that no requests there, it looks like:

                while true; do
                  RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
                  WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
                  if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
                    echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
                    exit 0
                  else
                    echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
                    sleep 5
                  fi
                done

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureCategorizes issue or PR as related to a new feature.needs-kindIndicates a PR lacks a label and requires one.needs-priorityIndicates a PR lacks a label and requires one.needs-triageIndicates an issue or PR lacks a label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions