Skip to content

feat: implement Balanced consolidation policy#2962

Open
jamesmt-aws wants to merge 2 commits intokubernetes-sigs:mainfrom
jamesmt-aws:balanced-impl-pr
Open

feat: implement Balanced consolidation policy#2962
jamesmt-aws wants to merge 2 commits intokubernetes-sigs:mainfrom
jamesmt-aws:balanced-impl-pr

Conversation

@jamesmt-aws
Copy link
Copy Markdown
Contributor

Summary

Implements the Balanced consolidation policy from the RFC in #2942. Gated behind --feature-gates BalancedConsolidation=true, disabled by default.

Balanced scores each consolidation move by comparing savings and disruption as fractions of NodePool totals. A move is approved when score >= 1/consolidationThreshold. The scoring step is a filter inserted after scheduling feasibility and price comparison. It can only reject moves, never create them.

Changes

API (21 lines in nodepool.go, nodepool_status.go):

  • consolidationPolicy: Balanced (new enum value)
  • consolidationThreshold: 1-3 (default 2, CEL-validated to require Balanced)
  • ConsolidationPolicyUnsupported status condition (set when gate is disabled)

Scoring (252 lines in new balanced.go):

  • ScoreMove: the RFC formula (savings_fraction / disruption_fraction)
  • ComputeNodePoolTotals: computed from all candidates before ShouldDisrupt filtering
  • EvaluateBalancedMove: cross-NodePool support (each Balanced pool scored independently)
  • candidateSavingsRatio: scoring-consistent sort key for candidate ranking

Integration (~30 lines each in singlenodeconsolidation.go, multinodeconsolidation.go):

  • Scoring check after computeConsolidation / firstNConsolidationOption
  • Events emitted on Node+NodeClaim (single-node) or NodePool (multi-node)
  • Metrics recorded (histogram + counter)

Observability (71 lines in events.go, 19 lines in metrics.go):

  • ConsolidationApproved / ConsolidationRejected events with score, threshold, savings%, disruption%
  • karpenter_consolidation_score histogram (buckets: 0.1, 0.25, 0.33, 0.5, 1.0, 2.0, 5.0, 10.0)
  • karpenter_consolidation_moves_total counter by decision and NodePool

Lines changed in existing code paths when Balanced is NOT used: 4 (new _ return values from GetCandidates). Everything else is behind ConsolidationPolicyBalanced or BalancedConsolidation feature gate checks.

Test plan

31 new tests:

  • 15 unit tests covering all RFC worked examples (Oversized Node 2.81, Marginal Move 0.42, scale invariance, heterogeneous disruption, etc.)
  • 9 integration tests (NodePool totals, cross-pool evaluation, candidate price edge cases)
  • 3 feature gate tests (rejected when off, accepted when on, WhenEmptyOrUnderutilized unaffected)
  • 5 validation tests (threshold requires Balanced, boundary values [1,3])
  • 4 defaulting tests (nil -> 2 for Balanced, preserved when explicit, nil for non-Balanced)
  • 4 score-based ranking tests (sort order with/without Balanced candidates)
  • 1 ConsolidationPolicyUnsupported status condition test
ok  sigs.k8s.io/karpenter/pkg/controllers/disruption  63.219s
ok  sigs.k8s.io/karpenter/pkg/apis/v1                 6.694s

Related

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jamesmt-aws
Once this PR has been reviewed and has the lgtm label, please assign njtran for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @jamesmt-aws. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 10, 2026
@jamesmt-aws
Copy link
Copy Markdown
Contributor Author

Note: the RFC discussion (#2942) has an open question about whether this should be a new `Balanced` policy or a `consolidationThreshold` parameter on the existing `WhenEmptyOrUnderutilized`. The implementation supports either direction with minimal changes. If the decision is to go with a threshold on `WhenEmptyOrUnderutilized`, the scoring logic and tests are unchanged. The API surface changes are: remove the `Balanced` enum value, allow `consolidationThreshold` on `WhenEmptyOrUnderutilized`, and update `ShouldDisrupt` to check the threshold instead of the policy.

@jamesmt-aws jamesmt-aws force-pushed the balanced-impl-pr branch 3 times, most recently from cdd5075 to 6644310 Compare April 11, 2026 04:06
Adds a new consolidationPolicy value, Balanced, that scores each
consolidation move and rejects moves where the disruption outweighs the
savings. Gated behind --feature-gates BalancedConsolidation=true.

The scoring formula compares savings and disruption as fractions of
NodePool totals: score = savings_fraction / disruption_fraction. A move
is approved when score >= 1/consolidationThreshold (default 2). The
scoring step is a filter inserted after scheduling feasibility and price
comparison. It can only reject moves, never create them. If scoring has
a bug that incorrectly approves, the move was already feasible and
cost-saving. If it incorrectly rejects, the cluster is less optimized
but not disrupted.

API:
- consolidationPolicy: Balanced (new enum value)
- consolidationThreshold: 1-3 (default 2, requires Balanced)

Implementation:
- balanced.go: scoring formula, NodePool totals, candidate pre-filter,
  cross-NodePool move evaluation
- Feature gate, API validation (CEL + runtime), defaulting
- ShouldDisrupt accepts Balanced, sets ConsolidationPolicyUnsupported
  status condition when gate is disabled
- Score-based candidate ranking for single-node consolidation
- Events (ConsolidationApproved/Rejected on Node+NodeClaim for
  single-node, NodePool for multi-node)
- Metrics (consolidation_score histogram, consolidation_moves_total
  counter)

Tests (31 new):
- 15 unit tests covering all RFC worked examples
- 9 integration tests (NodePool totals, cross-pool, candidate price)
- 3 feature gate tests
- 5 validation + 4 defaulting tests
- 4 score-based ranking tests
- 1 status condition test

See designs/balanced-consolidation.md (PR kubernetes-sigs#2942) for the full RFC.
…ents

- EmitBalancedMultiNodeEvents emits per Balanced NodePool in cross-pool
  batches, not just the first pool
- Thread ctx into sortCandidates/candidateSavingsRatio (was using
  context.Background, inconsistent with every other EvictionCost caller)
- Emit ConsolidationRejected event when balanced scoring rejects a batch
  inside the multi-node binary search, so operators can see why a batch
  size was tried and rejected

Co-Authored-By: James Thompson <jamesmt@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants