Skip to content

Conversation

@Fly-Style
Copy link
Contributor

@Fly-Style Fly-Style commented Dec 5, 2025

Cost-Based Autoscaler for Seekable Stream Supervisors

Overview

Implements a cost-based autoscaling algorithm for seekable stream supervisor tasks that optimizes task count by balancing lag reduction against resource efficiency.

Algorithm Design

Cost Function

The autoscaler uses a weighted cost function to evaluate different task count configurations:

cost = lagWeight × normalizedLag + idleWeight × predictedIdleRatio

Components of cost function:

  • Lag component: Measures how quickly the system processes "backlog"
  • Idle component: Measures resource efficiency (tasks waiting for data or busy)

Key Features

1. Predictive Cost Calculation

  • Predicts lag and idle ratio for candidate task counts using linear scaling
  • Evaluates multiple task count options within ±2 positions of current count
  • Selects configuration with minimum cost
  • Tracks historical observed lag values, provides stable normalization even with varying lag magnitudes

2. Lag-Aware Idle Estimation

  • High lag (>2M): Inverse relationship - more tasks = less idle (processing backlog)
  • Low lag (<2M): Normal relationship - more tasks = more idle (waiting for data)
Key files Changed

Core Implementation:

  • WeightedCostFunction.java - Cost function and adaptive bounds
  • CostBasedAutoScaler.java - Autoscaler orchestration
  • CostMetrics.java - Metrics data class

Tests:

  • WeightedCostFunctionTest.java - Comprehensive unit tests
  • CostBasedAutoScalerTest.java - Integration tests

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@Fly-Style Fly-Style changed the title Cost-based autoscaler: first raw version Cost-based autoscaler Dec 5, 2025

lock.lock();
try {
metricsQueue.offer(metrics);

Check notice

Code scanning / CodeQL

Ignored error status of call Note

Method run ignores exceptional return value of CircularFifoQueue.offer.
@jtuglu1
Copy link
Contributor

jtuglu1 commented Dec 6, 2025

While I think this will be very useful, the primary issue we've run into with the current scaler is that it needs to shut down tasks in order to scale (causes a lot of lag during this process). #18466 was working on a way to fix this.

I think this will help the scaler be smarter on each scale, but each scale action still costs a lot to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants