Cost-based autoscaler #18819

Fly-Style · 2025-12-05T22:42:07Z

Cost-Based Autoscaler for Seekable Stream Supervisors

Overview

Implements a cost-based autoscaling algorithm for seekable stream supervisor tasks that optimizes task count by balancing lag reduction against resource efficiency.

Algorithm Design

Cost Function

The autoscaler uses a weighted cost function to evaluate different task count configurations:

cost = lagWeight × normalizedLag + idleWeight × predictedIdleRatio

Components of cost function:

Lag component: Measures how quickly the system processes "backlog"
Idle component: Measures resource efficiency (tasks waiting for data or busy)

Key Features

1. Predictive Cost Calculation

Predicts lag and idle ratio for candidate task counts using linear scaling
Evaluates multiple task count options within ±2 positions of current count
Selects configuration with minimum cost
Tracks historical observed lag values, provides stable normalization even with varying lag magnitudes

2. Lag-Aware Idle Estimation

High lag (>2M): Inverse relationship - more tasks = less idle (processing backlog)
Low lag (<2M): Normal relationship - more tasks = more idle (waiting for data)

Key files Changed

Core Implementation:

WeightedCostFunction.java - Cost function and adaptive bounds
CostBasedAutoScaler.java - Autoscaler orchestration
CostMetrics.java - Metrics data class

Tests:

WeightedCostFunctionTest.java - Comprehensive unit tests
CostBasedAutoScalerTest.java - Integration tests

This PR has:

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+
+      lock.lock();
+      try {
+        metricsQueue.offer(metrics);


jtuglu1 · 2025-12-06T00:52:02Z

While I think this will be very useful, the primary issue we've run into with the current scaler is that it needs to shut down tasks in order to scale (causes a lot of lag during this process). #18466 was working on a way to fix this.

I think this will help the scaler be smarter on each scale, but each scale action still costs a lot to do.

Cost-based autoscaler: first raw version

58b4e2c

github-actions bot added the Area - Ingestion label Dec 5, 2025

Fly-Style changed the title ~~Cost-based autoscaler: first raw version~~ Cost-based autoscaler Dec 5, 2025

github-advanced-security bot found potential problems Dec 5, 2025

View reviewed changes

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

lock.lock();

try {

metricsQueue.offer(metrics);

Check notice

Code scanning / CodeQL

Ignored error status of call Note

Method run ignores exceptional return value of CircularFifoQueue.offer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cost-based autoscaler #18819

Cost-based autoscaler #18819

Uh oh!

Fly-Style commented Dec 5, 2025 •

edited

Loading

Uh oh!

Check notice

jtuglu1 commented Dec 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cost-based autoscaler #18819

Are you sure you want to change the base?

Cost-based autoscaler #18819

Uh oh!

Conversation

Fly-Style commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cost-Based Autoscaler for Seekable Stream Supervisors

Overview

Algorithm Design

Cost Function

Key Features

Key files Changed

Uh oh!

Check notice

jtuglu1 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fly-Style commented Dec 5, 2025 •

edited

Loading

jtuglu1 commented Dec 6, 2025 •

edited

Loading