feat: implement Balanced consolidation policy#2962
feat: implement Balanced consolidation policy#2962jamesmt-aws wants to merge 2 commits intokubernetes-sigs:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jamesmt-aws The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @jamesmt-aws. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Note: the RFC discussion (#2942) has an open question about whether this should be a new `Balanced` policy or a `consolidationThreshold` parameter on the existing `WhenEmptyOrUnderutilized`. The implementation supports either direction with minimal changes. If the decision is to go with a threshold on `WhenEmptyOrUnderutilized`, the scoring logic and tests are unchanged. The API surface changes are: remove the `Balanced` enum value, allow `consolidationThreshold` on `WhenEmptyOrUnderutilized`, and update `ShouldDisrupt` to check the threshold instead of the policy. |
cdd5075 to
6644310
Compare
Adds a new consolidationPolicy value, Balanced, that scores each consolidation move and rejects moves where the disruption outweighs the savings. Gated behind --feature-gates BalancedConsolidation=true. The scoring formula compares savings and disruption as fractions of NodePool totals: score = savings_fraction / disruption_fraction. A move is approved when score >= 1/consolidationThreshold (default 2). The scoring step is a filter inserted after scheduling feasibility and price comparison. It can only reject moves, never create them. If scoring has a bug that incorrectly approves, the move was already feasible and cost-saving. If it incorrectly rejects, the cluster is less optimized but not disrupted. API: - consolidationPolicy: Balanced (new enum value) - consolidationThreshold: 1-3 (default 2, requires Balanced) Implementation: - balanced.go: scoring formula, NodePool totals, candidate pre-filter, cross-NodePool move evaluation - Feature gate, API validation (CEL + runtime), defaulting - ShouldDisrupt accepts Balanced, sets ConsolidationPolicyUnsupported status condition when gate is disabled - Score-based candidate ranking for single-node consolidation - Events (ConsolidationApproved/Rejected on Node+NodeClaim for single-node, NodePool for multi-node) - Metrics (consolidation_score histogram, consolidation_moves_total counter) Tests (31 new): - 15 unit tests covering all RFC worked examples - 9 integration tests (NodePool totals, cross-pool, candidate price) - 3 feature gate tests - 5 validation + 4 defaulting tests - 4 score-based ranking tests - 1 status condition test See designs/balanced-consolidation.md (PR kubernetes-sigs#2942) for the full RFC.
6644310 to
06a011d
Compare
…ents - EmitBalancedMultiNodeEvents emits per Balanced NodePool in cross-pool batches, not just the first pool - Thread ctx into sortCandidates/candidateSavingsRatio (was using context.Background, inconsistent with every other EvictionCost caller) - Emit ConsolidationRejected event when balanced scoring rejects a batch inside the multi-node binary search, so operators can see why a batch size was tried and rejected Co-Authored-By: James Thompson <jamesmt@amazon.com>
Summary
Implements the Balanced consolidation policy from the RFC in #2942. Gated behind
--feature-gates BalancedConsolidation=true, disabled by default.Balanced scores each consolidation move by comparing savings and disruption as fractions of NodePool totals. A move is approved when
score >= 1/consolidationThreshold. The scoring step is a filter inserted after scheduling feasibility and price comparison. It can only reject moves, never create them.Changes
API (21 lines in
nodepool.go,nodepool_status.go):consolidationPolicy: Balanced(new enum value)consolidationThreshold: 1-3(default 2, CEL-validated to require Balanced)ConsolidationPolicyUnsupportedstatus condition (set when gate is disabled)Scoring (252 lines in new
balanced.go):ScoreMove: the RFC formula (savings_fraction / disruption_fraction)ComputeNodePoolTotals: computed from all candidates before ShouldDisrupt filteringEvaluateBalancedMove: cross-NodePool support (each Balanced pool scored independently)candidateSavingsRatio: scoring-consistent sort key for candidate rankingIntegration (~30 lines each in
singlenodeconsolidation.go,multinodeconsolidation.go):computeConsolidation/firstNConsolidationOptionObservability (71 lines in
events.go, 19 lines inmetrics.go):ConsolidationApproved/ConsolidationRejectedevents with score, threshold, savings%, disruption%karpenter_consolidation_scorehistogram (buckets: 0.1, 0.25, 0.33, 0.5, 1.0, 2.0, 5.0, 10.0)karpenter_consolidation_moves_totalcounter by decision and NodePoolLines changed in existing code paths when Balanced is NOT used: 4 (new
_return values fromGetCandidates). Everything else is behindConsolidationPolicyBalancedorBalancedConsolidationfeature gate checks.Test plan
31 new tests:
Related