diff --git a/designs/balanced-consolidation.md b/designs/balanced-consolidation.md new file mode 100644 index 0000000000..b2cad2432a --- /dev/null +++ b/designs/balanced-consolidation.md @@ -0,0 +1,445 @@ +# Balanced Consolidation: Scoring Moves by Savings and Disruption + +## Motivation + +Today, Karpenter's consolidation is all-or-nothing. `WhenEmptyOrUnderutilized` consolidates any node where pods can be repacked more cheaply, regardless of how little is saved or how many pods are disrupted. `WhenEmpty` consolidates only nodes with no pods. A move that saves $0.02/day by evicting a pod with a 30-minute warm-up cache is treated the same as a move that saves $5/day by evicting a stateless proxy. + +Terminating a non-empty node requires evicting running pods and starting replacements. Customers report cases where the disruption is not worth the savings. Related issues: + +- Nodes at 93-99% CPU utilization disrupted instead of lightly utilized ones ([aws#8868](https://github.com/aws/karpenter-provider-aws/issues/8868), [kubernetes-sigs#2319](https://github.com/kubernetes-sigs/karpenter/issues/2319)) +- Multi-hour consolidation loops replacing the same instance types with no net savings ([aws#8536](https://github.com/aws/karpenter-provider-aws/issues/8536), [aws#6642](https://github.com/aws/karpenter-provider-aws/issues/6642), [aws#7146](https://github.com/aws/karpenter-provider-aws/issues/7146)) +- Rapid node churn where consolidation deletes nodes that are immediately re-provisioned ([kubernetes-sigs#1019](https://github.com/kubernetes-sigs/karpenter/issues/1019), [kubernetes-sigs#735](https://github.com/kubernetes-sigs/karpenter/issues/735), [kubernetes-sigs#1851](https://github.com/kubernetes-sigs/karpenter/issues/1851)) +- `consolidateAfter` not preventing disruption of well-packed nodes ([kubernetes-sigs#2705](https://github.com/kubernetes-sigs/karpenter/issues/2705), [aws#3577](https://github.com/aws/karpenter-provider-aws/issues/3577)) +- Direct requests for a savings threshold or utilization-based consolidation gating ([kubernetes-sigs#2883](https://github.com/kubernetes-sigs/karpenter/issues/2883), [kubernetes-sigs#1440](https://github.com/kubernetes-sigs/karpenter/issues/1440), [kubernetes-sigs#1686](https://github.com/kubernetes-sigs/karpenter/issues/1686), [kubernetes-sigs#1430](https://github.com/kubernetes-sigs/karpenter/issues/1430), [aws#5218](https://github.com/aws/karpenter-provider-aws/issues/5218)) + +This RFC calls each consolidation action a *move*. A move deletes one or more nodes, along with pod eviction and optional replacement node creation. We propose new `consolidationPolicy` values that score each move and reject moves where the disruption outweighs the savings. `Balanced` (k=2) is the recommended default. Integer values (1-3) provide direct control over the threshold. + +## Alternatives Considered + +Five approaches were considered and rejected. Each fails to account for something the scoring approach captures. + +### Cost Improvement Factor + +Require a minimum price improvement ratio (e.g., old_price / new_price >= 2). Considered for spot consolidation ([spot-consolidation.md](https://github.com/kubernetes-sigs/karpenter/blob/main/designs/spot-consolidation.md)). A move that saves 50% of a node's cost passes a 2x factor whether it disrupts 2 default-cost pods or 20 high-cost pods. The factor ignores disruption entirely. + +### Absolute Dollar Threshold + +Require savings to exceed a fixed dollar amount (e.g., $1/day) ([kubernetes-sigs#2883](https://github.com/kubernetes-sigs/karpenter/issues/2883), [kubernetes-sigs#1440](https://github.com/kubernetes-sigs/karpenter/issues/1440)). Two moves that each save $1/day and disrupt 4 default-cost pods: on a $50/day NodePool with 40 pods, this saves 2% of cost for 10% of disruption. On a $5,000/day NodePool with 4000 pods, the same $1 saves 0.02% for 0.1% of disruption. A $1 threshold approves both. The threshold does not scale with NodePool size. + +### Utilization-Based Threshold + +Exclude nodes above a resource utilization percentage (e.g., 70%) from consolidation, like CA's `scale-down-utilization-threshold` ([kubernetes-sigs#1686](https://github.com/kubernetes-sigs/karpenter/issues/1686), [aws#5218](https://github.com/aws/karpenter-provider-aws/issues/5218)). This is the most frequently requested alternative. A node at 40% utilization running one pod with `pod-deletion-cost: 2147483647` (a model-serving pod with a 2-hour warm-up cache) and a node at 40% running ten stateless pods both pass a 70% threshold. The utilization threshold cannot distinguish them. + +### Selective Consolidation Type Disable + +Disable single-node consolidation (replace) while keeping multi-node and emptiness consolidation ([kubernetes-sigs#1430](https://github.com/kubernetes-sigs/karpenter/issues/1430), [kubernetes-sigs#684](https://github.com/kubernetes-sigs/karpenter/issues/684), [PR #1433](https://github.com/kubernetes-sigs/karpenter/pull/1433)). An m7i.2xlarge ($9.68/day) running 2 pods requesting 2 vCPU total could replace to an m7i.large ($2.42/day), saving $7.26/day for 2 pods of disruption. Disabling replace blocks this move along with every other replace, regardless of the savings-to-disruption ratio. + +### Separate Disruption Cost Annotation + +A dedicated `karpenter.sh/disruption-cost` annotation separate from the existing `EvictionCost` inputs. This would let application developers independently control eviction ordering and consolidation gating. We prefer reusing existing parameters. `controller.kubernetes.io/pod-deletion-cost` and pod priority already express disruption cost. A separate annotation could be introduced later if eviction ordering and consolidation gating need to diverge. + +### Related Work + +- [PR #2562](https://github.com/kubernetes-sigs/karpenter/pull/2562): `ConsolidationPriceImprovementFactor` field (0.0-1.0) with operator-level default and NodePool override. Cost Improvement Factor with a different UX. +- [PR #2893](https://github.com/kubernetes-sigs/karpenter/pull/2893): Decision ratio with a configurable `DecisionRatioThreshold` (default 1.0). Same scoring approach as this RFC but exposes the threshold from day one. +- [PR #2901](https://github.com/kubernetes-sigs/karpenter/pull/2901): External health signal probes on NodePools that block disruption when a probe fails. Orthogonal and complementary. +- [PR #2894](https://github.com/kubernetes-sigs/karpenter/pull/2894): Controller that automatically manages `controller.kubernetes.io/pod-deletion-cost` based on pluggable ranking strategies. Complementary. + +## Proposal + +### Proposed Spec + +```yaml +apiVersion: karpenter.sh/v1 +kind: NodePool +metadata: + name: default +spec: + disruption: + consolidationPolicy: Balanced + consolidateAfter: 30s + budgets: + - nodes: 10% +``` + +`consolidationPolicy` is an `IntOrString` field. All values are expressed through the disruption cost model: + +| Value | Behavior | +|---|---| +| `WhenEmpty` | Approve only when move disruption cost equals the per-node disruption cost (no pod contributes positive disruption cost) | +| `1` | Scoring with break-even threshold (deletes only, no replaces in uniform pools) | +| `Balanced` | Scoring with k=2 (within-family replaces viable) | +| `3` | Scoring with k=3 (adds cross-family replace pairs) | +| `WhenEmptyOrUnderutilized` | Any positive savings (k=+inf) | + +`Balanced` is shorthand for k=2. An integer value uses the scoring formula directly with that k. `WhenEmptyOrUnderutilized` is equivalent to k=+inf (any move with positive savings passes). Validation rejects integers outside 1-3. + +`WhenEmpty` approves a move when its disruption cost equals the per-node disruption cost, meaning no pod on the candidate node has positive disruption cost. This is a behavioral change from today's `WhenEmpty`, which checks for literally zero pods. Under the new definition, a node whose pods all have large negative `pod-deletion-cost` (driving their disruption cost to 0 after clamping) qualifies as "empty" for consolidation purposes. This is an improvement: pods that declared themselves free to disrupt should not block consolidation. + +A move is approved when `score >= 1/k`. At the default `Balanced` (k=2), `score >= 0.5`. + +If an operator enables `BalancedConsolidation`, sets `consolidationPolicy: Balanced` (or an integer), then disables the feature gate during rollback, the controller falls back to `WhenEmptyOrUnderutilized` behavior and sets a `ConsolidationPolicyUnsupported` status condition on the NodePool. The condition message directs the operator to change the policy or re-enable the gate. This avoids reconcile failures while making the fallback visible. + +### How Scoring Works + +The score compares a move's savings and disruption as fractions of NodePool totals. + +#### Per-Pod Disruption Cost + +[`EvictionCost`](../pkg/utils/disruption/disruption.go) starts at 1.0 per pod and adds two terms: + +1. **Pod deletion cost** ([`controller.kubernetes.io/pod-deletion-cost`](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/#pod-deletion-cost)), divided by 2^27, range -16 to +16. Default 0. The ReplicaSet controller uses this for scale-down ordering; Karpenter reuses it as a disruption signal. +2. **Pod priority**, divided by 2^25, range -64 to +30. Default 0. Higher-priority pods increase their node's disruption cost. + +With neither set, per-pod disruption cost is 1.0. `EvictionCost` clamps to [-10, 10]. The scoring path clamps negative values to 0 via `max(0, EvictionCost(pod))` in the per-node formula (see [NodePool Totals](#nodepool-totals)). Other consumers of `EvictionCost` (eviction ordering) still see negatives. Scoring range per pod: [0, 10]. + +#### Default Behavior: Savings vs. Pod Count + +Most clusters today do not set `pod-deletion-cost`. When every pod has default disruption cost 1.0, the score reduces to savings-versus-pod-count. It still rejects marginal moves, but the ability to distinguish expensive-to-restart pods from cheap ones is inactive until operators or automation ([PR #2894](https://github.com/kubernetes-sigs/karpenter/pull/2894)) set per-pod costs. + +#### NodePool Totals + +The score normalizes savings and disruption against NodePool totals. + +``` +nodepool_cost = sum(node.price for node in nodepool.nodes) +nodepool_total_disruption_cost = sum(node.disruption_cost for node in nodepool.nodes) +``` + +Each node's disruption cost is 1.0 (per-node cost) plus `sum(max(0, EvictionCost(pod)))` for its pods. Nodes have a disruption cost independent of their pods (cordoning, draining, API calls, replacement latency). We have not modeled this precisely and 1.0 is a placeholder. It eliminates division-by-zero for empty nodes and gives room to refine later (e.g., if GPU nodes should carry higher inherent disruption cost than CPU nodes). A 10-pod node has disruption cost 11; the per-node cost is 9% of the total. A node whose only pod has `EvictionCost = -10` has disruption cost 1.0, same as an empty node. + +For cross-NodePool moves, the source pool's totals, policy, and budget govern scoring. Cross-pool DELETEs require scheduling simulation to confirm pods fit on the destination. The destination pool's budget is not consumed (it absorbs pods, not loses a node). The source pool's policy governs regardless of the destination's policy; otherwise a permissive destination could override a conservative source. + +NodePool totals are snapshotted once per cycle. Later moves in the same cycle use stale totals. A move that scored 1.5 against a 100-node snapshot still passes against a 98-node pool. Moves near the boundary could flip; the next cycle corrects. + +#### Calculation + +``` +savings = sum(deleted_node.price) - sum(created_node.price) # price = Karpenter's pricing model (on-demand, current spot, or on-demand/10M for ODCRs) +disruption_cost = sum(max(0, EvictionCost(pod)) for pod in evicted_pods) + +savings_fraction = savings / nodepool_total_cost +disruption_fraction = disruption_cost / nodepool_total_disruption_cost + +score = savings_fraction / disruption_fraction +``` + +`evicted_pods` is all pods on deleted source nodes. Every pod is evicted regardless of where it lands. The disruption cost counts every eviction. + +A move is approved when `score >= 1/k`, where k comes from `consolidationPolicy` (`Balanced` = 2, or an integer value directly). At k=2, `score >= 0.5`. Both sides are dimensionless fractions, so the score is scale-invariant (see [Scale Invariance](#scale-invariance)). + +**Division-by-zero handling.** With the per-node disruption cost of 1.0, `disruption_cost` is always positive for any node, eliminating the zero-disruption special case. The remaining edge cases: + +| savings | nodepool_total_cost | nodepool_total_disruption_cost | decision | +|---|---|---|---| +| positive | positive | positive | compute score normally | +| zero | any | any | reject (no benefit) | +| negative | any | any | reject (net loss) | +| near-zero | near-zero (ODCR pool) | any | compute normally; the 1/10M divisor cancels and the score equals the on-demand price ratio | +| any | any | zero | cannot happen (per-node disruption cost ensures > 0) | + +See [Edge Cases](#edge-cases) for worked examples. + +Feasibility checks (PDBs, `karpenter.sh/do-not-disrupt`, scheduling constraints) filter which moves can be generated. `consolidateAfter` determines which nodes are candidates. Scoring evaluates which feasible moves are worth executing. Disruption budgets gate how many execute per cycle. + +### Move Score as Ranking Function + +When multiple moves pass the threshold and a disruption budget limits how many execute, the score determines execution order. Today, `WhenEmptyOrUnderutilized` ranks by disruption alone (lowest first). Score-based ranking accounts for both savings and disruption. Greedy-by-ratio is the standard knapsack heuristic, [optimal for the continuous case](https://en.wikipedia.org/wiki/Continuous_knapsack_problem#Solution) and near-optimal for the discrete case when items are small relative to the budget. + +![Ranking consolidation moves: score vs. single-dimension ranking](ranking-strategies.png) + +The graphs show REPLACE and DELETE moves from a simulated cluster: 5000 pods with log-normal CPU/memory requests, packed across c7i/m7i/r7i instances, after 10 rounds of workload churn. Each curve shows cumulative savings vs. cumulative disruption under a different ranking strategy. At every disruption level, score-based ranking delivers more savings than the alternatives (see [`balanced-consolidation-ranking.py`](scripts/balanced-consolidation-ranking.py)). + +### Edge Cases + +#### Empty Nodes + +An empty node has disruption cost 1.0 (per-node cost only). A DELETE saves the full node cost against a small disruption fraction. Empty DELETEs always pass. No special case needed. + +#### Single-Node Pool + +A single-node pool DELETE scores exactly 1.0 (`savings_fraction = 1.0`, `disruption_fraction = 1.0`). This passes the default threshold. Deleted pods become unschedulable and trigger a new provisioning cycle. See [Open Questions](#open-questions) for whether the formula should have a floor on pool size. + +#### Near-Zero-Cost Nodes (ODCRs, Reserved Capacity) + +Karpenter prices ODCRs and reserved instances at on-demand price divided by a large constant, keeping the most expensive ODCR cheaper than the cheapest spot node (see the provider's pricing implementation for the exact divisor). Within an ODCR-only pool, the divisor cancels in the score (it divides both savings and total cost). Scores equal the on-demand price ratios, so ODCR pools consolidate normally. This is correct: replacing an expensive reservation with a cheaper one frees capacity. The divided values are small but well within float64 precision. + +When a positive-cost source node is consolidated and its pods land on an ODCR destination node, this is a DELETE from the source pool's perspective. The score reflects the source pool's cost structure. The destination node's near-zero cost does not affect the score. + +### Candidate Filtering + +Move generation is expensive (find a destination, compute replacement costs, verify scheduling). A node's best possible score is its delete ratio: a DELETE saving the full node cost with no replacement. + +``` +delete_ratio = (node.price / nodepool_cost) / (node.disruption_cost / nodepool_total_disruption_cost) +``` + +If this ratio is below 1/k, this node is not a good consolidation candidate. A DELETE saves the full node cost — a REPLACE saves strictly less because the replacement has positive cost. If the best case (DELETE) doesn't pass, nothing will. The system skips move generation for that node. + +This filter applies to single-node consolidation. A group of individually-failing nodes could produce a passing batch if their combined savings outweigh combined disruption. The filter misses these opportunities. Evaluating all multi-node combinations is exponential, so the implementation takes single-node savings first and attempts multi-node moves only when single-node opportunities are exhausted. + +The NodePool totals only need to be sensible relative to each other. The implementation may cache totals or estimate them from a subset of nodes, as long as cost and disruption are estimated from the same sample. + +### Interaction with Existing Features + +Existing feasibility checks (disruption budgets, PDBs, `consolidateAfter`, `do-not-disrupt`) are unchanged. Scoring applies only to consolidation, not to spot interruptions, expiration, or drift. + +**Consolidation cycling loops.** The scoring filter is the primary mechanism that breaks cycling loops ([aws#8536](https://github.com/aws/karpenter-provider-aws/issues/8536), [aws#6642](https://github.com/aws/karpenter-provider-aws/issues/6642), [aws#7146](https://github.com/aws/karpenter-provider-aws/issues/7146)). A same-type REPLACE that saves $0 scores 0 and is rejected at any threshold. A near-zero-savings move with meaningful disruption scores below 1/k and is also rejected. These are the moves that produce multi-hour loops under `WhenEmptyOrUnderutilized`, where any positive savings (including rounding-error savings) triggers a replace. The `consolidateAfter` cooldown on destination nodes (described below) provides a secondary damping effect between rounds but is not sufficient on its own to prevent cycling. + +`consolidateAfter` determines candidacy: a node whose last pod event is within the `consolidateAfter` window is not a candidate. Non-candidate nodes still contribute to the denominators, preventing the post-replacement cycle from being artificially aggressive. When a move executes, pods landing on the destination node reset its `consolidateAfter` timer, temporarily removing it from candidacy. This provides a natural cooldown between consolidation rounds. + +### Observability + +Approved and rejected moves are surfaced as events. Single-node moves emit on the NodeClaim. Multi-node moves emit on the NodePool (the score describes the move, not any single node). + +- `ConsolidationApproved`: `"score %.2f >= threshold %.2f (k: %d, savings %.1f%%, disruption %.1f%%)"` +- `ConsolidationRejected`: `"score %.2f < threshold %.2f (k: %d, savings %.1f%%, disruption %.1f%%)"` + +Scored moves are also logged at DEBUG level. + +A Prometheus histogram `karpenter_consolidation_score` records scores by decision and NodePool, with buckets {0.1, 0.25, 0.33, 0.5, 1.0, 2.0, 5.0, 10.0}. A counter `karpenter_consolidation_moves_total` by decision and NodePool tracks move volume. + +How to use these: if `moves_total{decision="rejected"}` is high and the histogram shows most rejected scores clustered just below the threshold, the operator's k is slightly too conservative — raising it by 1 would capture those moves. If `moves_total{decision="approved"}` is zero and `moves_total{decision="rejected"}` is also zero, there are no candidates (the cluster is well-packed or `consolidateAfter` hasn't elapsed). If approved moves are high but savings aren't materializing, kube-scheduler divergence may be undoing the moves (check for rapidly churning NodeClaims). + +## Examples + +All examples use `consolidationPolicy: Balanced` (k=2). The NodePool has 10 nodes: eight m7i.xlarge (4 vCPU, 16 GiB, $4.84/day) and two m7i.2xlarge (8 vCPU, 32 GiB, $9.68/day). Total NodePool cost is $58.08/day. The NodePool runs 80 pods with total disruption cost 80. + +### Oversized Node (approved) + +One m7i.2xlarge runs 3 pods requesting 1.5 vCPU and 6 GiB total. Disruption cost is 3. These pods fit on an m7i.large at $2.42/day. Savings is $7.26. + +``` +savings_fraction = 7.26 / 58.08 = 12.5% +disruption_fraction = 3 / 80 = 3.75% +score = 0.125 / 0.0375 = 3.33 > 0.5 --> approved +``` + +### Spare Capacity Delete (approved) + +One m7i.xlarge runs 4 pods requesting 1.5 vCPU and 6 GiB. Disruption cost is 4. Another node has spare capacity. Savings is $4.84 (full node cost, no replacement needed). + +``` +savings_fraction = 4.84 / 58.08 = 8.3% +disruption_fraction = 4 / 80 = 5.0% +score = 0.083 / 0.05 = 1.67 > 0.5 --> approved +``` + +### Marginal Move (rejected) + +One m7i.xlarge runs 8 pods requesting 1.8 vCPU and 7 GiB. Disruption cost is 8. The pods fit on an m7i.large at $2.42/day. Savings is $2.42. + +``` +savings_fraction = 2.42 / 58.08 = 4.2% +disruption_fraction = 8 / 80 = 10.0% +score = 0.042 / 0.10 = 0.42 < 0.5 --> rejected +``` + +### Well-Packed Node (rejected) + +One m7i.xlarge runs 10 pods requesting 3.5 vCPU and 14 GiB. The smallest fitting replacement is another m7i.xlarge. Savings is $0. No threshold approves this move. + +### Uniform Pool Replace (approved) + +All 10 nodes are m7i.xlarge ($4.84/day each, $48.40/day total, 80 pods, disruption cost 80). One node's 8 pods fit on an m7i.large ($2.42/day). Savings is $2.42. + +``` +savings_fraction = 2.42 / 48.40 = 5.0% +disruption_fraction = 8 / 80 = 10.0% +score = 0.05 / 0.10 = 0.50 >= 0.5 --> approved +``` + +The replacement costs exactly half the original. In a uniform pool, this is the boundary at k=2: a replace that saves less than half the source node's cost is rejected. In a heterogeneous pool, the threshold depends on pool-level fractions, not per-node percentage. At k=1, the score simplifies to `1 - replacement_price / node_price`, which never reaches 1.0. k=2 is the smallest value that makes uniform-pool REPLACEs viable. + +### Scale Invariance + +The same oversized-node scenario on a 100-node NodePool ($580.80/day total cost, 800 total disruption cost) produces the same score: + +``` +savings_fraction = 7.26 / 580.80 = 1.25% +disruption_fraction = 3 / 800 = 0.375% +score = 0.0125 / 0.00375 = 3.33 +``` + +The threshold produces the same decision regardless of NodePool size. + +### Heterogeneous Disruption Cost + +Two m7i.xlarge nodes each run 4 pods and can be deleted (pods fit on other nodes). Both save $4.84/day. Node A runs 4 stateless proxies with default disruption cost (total 4). Node B runs 1 stateless proxy (cost 1) and 3 model-serving pods with `pod-deletion-cost: 2147483647` (cost ~10 each, total 31). The NodePool total disruption cost is 107 (76 default-cost pods + node B's 31). + +**Node A (approved):** + +``` +savings_fraction = 4.84 / 58.08 = 8.3% +disruption_fraction = 4 / 107 = 3.7% +score = 0.083 / 0.037 = 2.24 > 0.5 --> approved +``` + +**Node B (rejected):** + +``` +savings_fraction = 4.84 / 58.08 = 8.3% +disruption_fraction = 31 / 107 = 29.0% +score = 0.083 / 0.29 = 0.29 < 0.5 --> rejected +``` + +Same savings, same node count, same pod count. The score rejects node B because the model-serving pods are expensive to restart. This is the score's main advantage over alternatives that ignore disruption cost: it distinguishes nodes where disruption is cheap from nodes where it is not. + +### Cross-NodePool: On-Demand and Spot + +A cluster has two NodePools. The On-Demand pool has 10 m7i.xlarge nodes at $4.84/day each ($48.40/day total, 80 pods, total disruption cost 80). The Spot pool has 10 m7i.xlarge nodes at $1.45/day each ($14.50/day total, 80 pods, total disruption cost 80). + +One node in each pool runs 3 pods requesting 1 vCPU. Disruption cost is 3. The pods can be absorbed by other nodes. This is a DELETE of the source node. + +**On-Demand pool DELETE:** + +``` +savings_fraction = 4.84 / 48.40 = 10.0% +disruption_fraction = 3 / 80 = 3.75% +score = 0.10 / 0.0375 = 2.67 > 0.5 --> approved +``` + +**Spot pool DELETE:** + +``` +savings_fraction = 1.45 / 14.50 = 10.0% +disruption_fraction = 3 / 80 = 3.75% +score = 0.10 / 0.0375 = 2.67 > 0.5 --> approved +``` + +Both moves score identically because each node represents the same fraction of its pool's cost and disrupts the same fraction of its pool's pods. + +If the Spot pool node instead runs 8 pods with disruption cost 8: + +``` +savings_fraction = 1.45 / 14.50 = 10.0% +disruption_fraction = 8 / 80 = 10.0% +score = 0.10 / 0.10 = 1.0 > 0.5 --> approved +``` + +The move is approved. To reach the 0.5 boundary, each of the 8 pods would need disruption cost 2 (disruption fraction 20%, score 0.50). + +## Why k=2 + +The scoring formula has one free parameter: k (exposed via `consolidationPolicy`). We chose k=2 as the default by exhaustive enumeration. + +### State Space + +We enumerate configurations in a bounded space using on-demand prices from three instance families in us-east-1: c7i (compute-optimized), m7i (general-purpose), and r7i (memory-optimized), medium through 4xlarge (15 price points). 1 to 6 nodes per pool, 0 to 4 pods per node, per-pod disruption cost in {1, 2, 5, 10} (max per-node disruption cost: 4 pods x 10 = 40). For each configuration, we evaluate every candidate move (Delete and Replace to every cheaper price) at every k from 1 through 5. Three families matter because cross-family replacement ratios are not power-of-2. + +Properties 1-4, 6, and 7 are algebraic properties of the ratio formula and hold for any pod count — the enumeration is a sanity check, not the proof. Properties 5 and 8 are empirical results bounded by the enumerated price structure and pod counts. + +### What a good scoring function does + +Eight properties define what we want from the scoring formula. The first seven are about the formula itself. The eighth is about what happens over time: a REPLACE creates a new node, and that node might itself be a consolidation candidate. An m7i.xlarge replaces to m7i.large, which replaces to c7i.medium, and so on. We call this sequence a **churn chain**. A good scoring function produces churn chains that converge (reach a stable instance type) rather than cycle. + +1. **Monotonicity in savings.** Cheaper replacement never makes approval harder. +2. **Monotonicity in disruption.** Higher disruption never makes approval easier. +3. **Empty nodes always deletable.** Zero disruption, positive savings: always approved. +4. **Zero-savings moves never approved.** Same-price replacement scores zero. +5. **Replaces work in uniform pools.** The minimum useful k is the smallest value where meaningful replaces pass. +6. **Skewed disruption differentiates.** High-disruption pods make their node harder to approve. +7. **Fleet size independence.** Pool size cancels algebraically in uniform pools. +8. **Bounded churn.** Churn chains converge and terminate quickly. + +Properties 1-4, 6, and 7 hold at all k values (structural properties of the ratio). Properties 5 and 8 select k. + +### Results + +| k | min score (1/k) | approved replace pairs | new pairs vs k-1 | max churn chain | +|---|-----------------|----------------------|-----------------|-----------------| +| 1 | 1.000 | 0 | — | 0 | +| 2 | 0.500 | 78 | 78 | 4 steps | +| 3 | 0.333 | 86 | 8 | 4 steps | +| 4 | 0.250 | 95 | 9 | 9 steps | +| 5 | 0.200 | 100 | 5 | 9 steps | + +At k=1, no replace is ever approved in a uniform pool. The score for a uniform-pool replace simplifies to `1 - replacement_price / node_price`, which requires a free replacement to reach 1.0. + +k=2 is the smallest integer where uniform-pool REPLACEs pass. Within a single family, prices follow power-of-2 scaling, so every replacement ratio is 0.5 or less and k>=3 adds nothing. Across families, k=3 opens 8 additional cross-family pairs (e.g., c7i.large → m7i.medium at 43% savings, score 0.43) without increasing the max churn chain. k=4 opens 9 more pairs but allows 9-step churn chains that zigzag through all three families. Max chain lengths are confirmed by exhaustive DFS over all approved replacement paths, not a greedy heuristic (see [`balanced-consolidation-properties.py`](scripts/balanced-consolidation-properties.py)). + +k=2 is the right default. It is the smallest value that makes within-family REPLACEs viable, and it captures all cross-family pairs where the replacement costs less than half the original. The 8 additional cross-family pairs at k=3 are available to operators who set `consolidationPolicy: 3` (see [`balanced-consolidation-properties.py`](scripts/balanced-consolidation-properties.py)). + +## API Choices + +### Consolidation Aggressiveness Tuning [Recommended: IntOrString consolidationPolicy] + +`consolidationPolicy` is an `IntOrString` field. `Balanced` maps to k=2. Integer values (1-3) pass k directly. This gives most users a single named value while preserving an escape hatch for operators who need a different threshold. + +The formally motivated values are all integers: k=1 (break-even, deletes only), k=2 (within-family replaces viable, 4-step max churn), k=3 (adds cross-family pairs, same 4-step churn). At k=4, churn chains jump to 9 steps and the formal analysis argues against higher values. + +Two alternatives were considered: + +**Separate `consolidationThreshold` field.** A dedicated integer field alongside `consolidationPolicy: Balanced`. This works but adds a field that only applies to one policy value. Folding k into `consolidationPolicy` keeps the API surface smaller. + +**Named presets (Low/Medium/High).** A `consolidationAggressiveness` enum mapping to k values. Karpenter does not have an existing ordinal enum pattern, and picking names that age well is hard. "Conservative/Balanced/Aggressive" reuses "Balanced" which is already the policy name. IntOrString is simpler. + +### Per-NodePool vs. Per-Cluster Normalization [Recommended: Per-NodePool] + +The score denominators (total cost, total disruption cost) could be computed per-NodePool or across the entire cluster. This proposal recommends per-NodePool. + +Per-NodePool normalization matches Karpenter's existing architecture (policies, budgets, `consolidateAfter` are already per-NodePool). Per-cluster normalization would dilute small pool scores: a single node representing 10% of its own pool becomes 0.1% of the cluster. + +Con: scores reflect relative efficiency within a pool, not absolute dollar impact. A score of 2.0 in a $50/hr pool and 2.0 in a $10,000/hr pool look identical. This does not affect behavior but limits cross-pool comparison. + +### consolidationPolicy Naming + +No behavior change for existing users. Migration: change `WhenEmptyOrUnderutilized` to `Balanced`. + +Con: `WhenEmpty` and `WhenEmptyOrUnderutilized` describe behavior; `Balanced` describes character. Alternatives: `Scored` (exposes implementation), `CostWeighted` (implies cost-only), `WhenWorthIt` (too informal). `Balanced` is the least-bad option. + +## Backward Compatibility + +- `WhenEmptyOrUnderutilized` and `WhenEmpty` are unchanged. Existing NodePool specs continue to work. +- Per-pod disruption cost is computed from the existing `controller.kubernetes.io/pod-deletion-cost` annotation and pod priority via `EvictionCost`. Pods without these inputs default to disruption cost 1.0, matching current behavior (all pods are equal). + +## Open Questions + +- **Is k=2 the right default?** [Why k=2](#why-k2) explains why k=2 is the smallest value that makes uniform-pool REPLACEs viable. We do not know whether k=2 works well across diverse workloads. The feature gate and opt-in rollout exist to answer this empirically. + +- **Should single-node pools be exempt from scoring?** A single-node pool DELETE scores 1.0 and passes, making all pods unschedulable until provisioning creates a replacement. Disruption budgets (`nodes: 0`) prevent this today. The formula could refuse by requiring pool size > 1, but that adds a special case for something disruption budgets already handle. + +- **How many customers use `pod-deletion-cost` today?** If few do, the score reduces to savings-versus-pod-count. [PR #2894](https://github.com/kubernetes-sigs/karpenter/pull/2894) would automate annotation. + +- **Move quality tracking.** Annotate moved pods with the move's score. Track re-disruption rate (pods evicted again within minutes). High re-disruption means the threshold is too aggressive. + +- **Async move generation.** Scoring enables a shift from synchronous per-cycle consolidation to a continuous pipeline: generate, score, enqueue by priority, execute as budget allows. This eliminates the per-cycle timeout problem. + + +## Frequently Asked Questions + +### What happens in a uniformly inefficient cluster where no single REPLACE clears the threshold? + +Every REPLACE scores low (e.g., 0.2 for a one-tier downsize) and is rejected. DELETEs still work: a DELETE of a node whose pods fit elsewhere scores 1.0 in a uniform pool. The system consolidates by deleting nodes, not replacing them. Operators who want all feasible moves can use `WhenEmptyOrUnderutilized`. + +### Does the score account for kube-scheduler pod placement? + +No. The score assumes pods land on the intended destination. In practice, kube-scheduler may scatter pods across existing nodes instead of packing onto the replacement. The replacement ends up nearly empty and becomes a consolidation candidate itself. + +This exists in all consolidation modes. The cost threshold concentrates the remaining moves onto higher-impact candidates. The system self-corrects: a nearly-empty replacement scores as a trivial DELETE next cycle. Cascades terminate because each round has strictly fewer displaced nodes. + +Configuring kube-scheduler with `MostAllocated` scoring reduces divergence. + +### Why doesn't the score account for reserved instance or ODCR opportunity cost? + +See [Near-Zero-Cost Nodes](#near-zero-cost-nodes-odcrs-reserved-capacity). The 1/10M factor cancels in the score, so ODCR pools consolidate using on-demand price ratios. Real opportunity cost (what freed capacity is worth to the organization) is a different question that requires billing API integration. This RFC defers opportunity-cost modeling. + +### Where is the score visible? + +See [Observability](#observability). DEBUG logs, NodeClaim events, and a Prometheus histogram. + +### Will this become the default consolidation policy? + +Not at launch. `Balanced` is opt-in behind a feature gate. Whether it becomes the default is a community decision deferred to GA graduation. + +### Does constraining maximum node size improve this proposal? + +Rejections are driven by high disruption fraction relative to savings fraction. A 50-pod node in a 500-pod pool has 10% disruption fraction regardless of instance type. Constraining maximum node size reduces the per-node share of pool disruption, making moves easier to approve. The formula works correctly regardless of node size. Operators can manage node size through NodePool instance-type constraints today. + +## Rollout + +Standard feature gate pattern (SpotToSpotConsolidation, NodeOverlay). + +- **Alpha**: disabled by default. `--feature-gates BalancedConsolidation=true` to opt in. +- **Beta**: enabled by default. Disable if issues arise. +- **GA**: gate removed. Whether `Balanced` becomes the default policy is a separate decision. diff --git a/designs/ranking-strategies.png b/designs/ranking-strategies.png new file mode 100644 index 0000000000..0a40673087 Binary files /dev/null and b/designs/ranking-strategies.png differ diff --git a/designs/scripts/balanced-consolidation-properties.py b/designs/scripts/balanced-consolidation-properties.py new file mode 100644 index 0000000000..f792a0a81d --- /dev/null +++ b/designs/scripts/balanced-consolidation-properties.py @@ -0,0 +1,415 @@ +#!/usr/bin/env python3 +""" +Exhaustive check of consolidation scoring properties. + +Decision rule: approved iff k * savings_fraction >= disruption_fraction +Cross-multiplied: k * savings * pool_disruption >= move_disruption * pool_cost + +State space: + - 1-6 nodes + - Prices from c7i, m7i, r7i on-demand (us-east-1), scaled to integers + - 0-4 pods per node, disruption cost in {1, 2, 5, 10} + - Actions: Delete(node), Replace(node, new_price) + - k: the decision constant (1 = break-even, 2 = savings count 2x) +""" + +from itertools import product as cartesian +from collections import defaultdict + +# On-demand prices in us-east-1, $/hr × 10000 for exact integer arithmetic. +# c7i (compute-optimized) +# c7i.medium $0.0446/hr → 446 +# c7i.large $0.0892/hr → 892 +# c7i.xlarge $0.1785/hr → 1785 +# c7i.2xlarge $0.3570/hr → 3570 +# c7i.4xlarge $0.7140/hr → 7140 +# m7i (general-purpose) +# m7i.medium $0.0504/hr → 504 +# m7i.large $0.1008/hr → 1008 +# m7i.xlarge $0.2016/hr → 2016 +# m7i.2xlarge $0.4032/hr → 4032 +# m7i.4xlarge $0.8064/hr → 8064 +# r7i (memory-optimized) +# r7i.medium $0.0661/hr → 661 +# r7i.large $0.1323/hr → 1323 +# r7i.xlarge $0.2646/hr → 2646 +# r7i.2xlarge $0.5292/hr → 5292 +# r7i.4xlarge $1.0584/hr → 10584 +PRICES = sorted([ + 446, 892, 1785, 3570, 7140, # c7i + 504, 1008, 2016, 4032, 8064, # m7i + 661, 1323, 2646, 5292, 10584, # r7i +]) +DCOSTS = [1, 2, 5, 10] +K_VALUES = [1, 2, 3] + + +def approved(k, savings, move_disruption, pool_cost, pool_disruption): + """Cross-multiplied decision rule. Handles zero-disruption edge case.""" + if move_disruption == 0: + return savings > 0 + if pool_disruption == 0: + return savings > 0 + if pool_cost == 0: + return False + return k * savings * pool_disruption >= move_disruption * pool_cost + + +def score(savings, move_disruption, pool_cost, pool_disruption): + """Actual score (float). For reporting only, not for decisions.""" + if pool_cost == 0 or pool_disruption == 0: + return float('inf') if savings > 0 else 0.0 + sf = savings / pool_cost + df = move_disruption / pool_disruption + if df == 0: + return float('inf') if sf > 0 else 0.0 + return sf / df + + +# ─── P1: Monotonicity in savings ─── + +def check_p1(): + """If replace(n, q) is approved and p < q (more savings), then replace(n, p) is approved.""" + violations = [] + for k in K_VALUES: + for n_nodes in range(1, 5): + for node_price in PRICES: + for pod_config in pod_configs(max_pods=4): + move_disruption = sum(pod_config) + # Other nodes: try a few representative configs + for other_price in PRICES[:3]: + for other_pods in [0, 2, 4]: + other_disruption = other_pods * 1 # default cost + pool_cost = node_price + (n_nodes - 1) * other_price + pool_disruption = move_disruption + (n_nodes - 1) * other_disruption + for q in PRICES: + if q >= node_price: + continue + savings_q = node_price - q + if not approved(k, savings_q, move_disruption, pool_cost, pool_disruption): + continue + # q is approved. Check all p < q. + for p in PRICES: + if p >= q: + continue + savings_p = node_price - p + if not approved(k, savings_p, move_disruption, pool_cost, pool_disruption): + violations.append((k, node_price, q, p, n_nodes)) + return violations + + +# ─── P2: Monotonicity in disruption ─── + +def check_p2(): + """Higher disruption on a node should make approval at least as hard.""" + violations = [] + for k in K_VALUES: + for n_nodes in range(2, 5): + for price in PRICES: + for other_price in PRICES[:3]: + for other_pods in [0, 2, 4]: + other_disruption = other_pods * 1 + base_pool_cost = price + (n_nodes - 1) * other_price + base_other_disruption = (n_nodes - 1) * other_disruption + for d_high in range(1, 41): + for d_low in range(0, d_high): + pool_d_high = d_high + base_other_disruption + pool_d_low = d_low + base_other_disruption + # Delete savings = price for both + app_high = approved(k, price, d_high, base_pool_cost, pool_d_high) + app_low = approved(k, price, d_low, base_pool_cost, pool_d_low) + if app_high and not app_low: + violations.append((k, price, d_high, d_low, n_nodes)) + return violations + + +# ─── P3: Empty nodes always approved for delete ─── + +def check_p3(): + """Node with 0 pods, positive price: delete always approved.""" + violations = [] + for k in K_VALUES: + for n_nodes in range(1, 7): + for price in PRICES: + for other_price in PRICES: + for other_pods in range(0, 5): + other_disruption = other_pods * 1 + pool_cost = price + (n_nodes - 1) * other_price + pool_disruption = 0 + (n_nodes - 1) * other_disruption + if not approved(k, price, 0, pool_cost, pool_disruption): + violations.append((k, price, n_nodes)) + return violations + + +# ─── P4: Zero-savings moves never approved ─── + +def check_p4(): + """Replace with same price: savings = 0, never approved.""" + violations = [] + for k in K_VALUES: + for price in PRICES: + for pod_config in pod_configs(max_pods=4): + move_disruption = sum(pod_config) + for n_nodes in range(1, 5): + pool_cost = n_nodes * price + pool_disruption = n_nodes * move_disruption + if approved(k, 0, move_disruption, pool_cost, pool_disruption): + violations.append((k, price, pod_config)) + return violations + + +# ─── P5: Single-node pool replaces ─── + +def check_p5(): + """For each k, find all approved replaces in single-node pools.""" + results = defaultdict(list) + for k in K_VALUES: + for price in PRICES: + for pod_config in pod_configs(max_pods=4): + if not pod_config: + continue # empty node, trivial + move_disruption = sum(pod_config) + pool_cost = price + pool_disruption = move_disruption + for np in PRICES: + if np >= price: + continue + savings = price - np + if approved(k, savings, move_disruption, pool_cost, pool_disruption): + s = score(savings, move_disruption, pool_cost, pool_disruption) + results[k].append({ + 'price': price, 'replacement': np, + 'pods': pod_config, 'disruption': move_disruption, + 'savings_pct': 100 * savings / price, + 'score': s, + }) + return results + + +# ─── P6: Skewed disruption cost ─── + +def check_p6(): + """Find pools where high-disruption node is rejected but low-disruption is approved.""" + examples = [] + for k in [1, 2, 3]: + for n_nodes in range(2, 5): + for price in PRICES: + for d_high_pods in pod_configs(max_pods=4): + for d_low_pods in pod_configs(max_pods=4): + d_high = sum(d_high_pods) + d_low = sum(d_low_pods) + if d_high <= d_low or d_low == 0: + continue + other_disruption = (n_nodes - 1) * 4 * 1 # other nodes: 4 default pods + pool_cost = n_nodes * price + # Both nodes are in the pool + pool_disruption = d_high + d_low + (n_nodes - 2) * 4 * 1 if n_nodes > 2 else d_high + d_low + app_high = approved(k, price, d_high, pool_cost, pool_disruption) + app_low = approved(k, price, d_low, pool_cost, pool_disruption) + if app_low and not app_high: + examples.append({ + 'k': k, 'price': price, 'n_nodes': n_nodes, + 'high_pods': d_high_pods, 'low_pods': d_low_pods, + 'd_high': d_high, 'd_low': d_low, + }) + if len(examples) >= 3: + return examples + return examples + + +# ─── P7: Fleet size independence ─── + +def check_p7(): + """For uniform pools, check that the decision doesn't change with pool size.""" + violations = [] + for k in K_VALUES: + for price in PRICES: + for pod_config in pod_configs(max_pods=4): + if not pod_config: + continue + move_disruption = sum(pod_config) + for np in PRICES: + if np >= price: + continue + savings = price - np + decisions = {} + for n_nodes in range(1, 7): + pool_cost = n_nodes * price + pool_disruption = n_nodes * move_disruption + decisions[n_nodes] = approved(k, savings, move_disruption, pool_cost, pool_disruption) + vals = set(decisions.values()) + if len(vals) > 1: + violations.append((k, price, np, pod_config, decisions)) + return violations + + +# ─── P8: Churn chains ─── + +def check_p8(): + """Find the longest replace chain via exhaustive DFS over all approved next-steps. + + At each step, every cheaper price that the scoring formula approves is a valid + next node in the chain. DFS explores all of them and returns the true maximum + chain length — not just the greedy "least aggressive step" path, which could + miss longer cross-family zigzag paths.""" + results = defaultdict(list) + for k in K_VALUES: + for pod_config in pod_configs(max_pods=4): + if not pod_config: + continue + move_disruption = sum(pod_config) + for n_nodes in range(1, 5): + for start_price in PRICES: + other_cost = (n_nodes - 1) * start_price + other_disruption = (n_nodes - 1) * move_disruption + longest_chain = _dfs_longest_chain( + k, start_price, [start_price], + move_disruption, other_cost, other_disruption, + ) + if len(longest_chain) > 1: + results[k].append({ + 'n_nodes': n_nodes, 'pods': pod_config, + 'chain': longest_chain, 'steps': len(longest_chain) - 1, + }) + # Keep only the longest chains per k + for k in results: + results[k].sort(key=lambda x: -x['steps']) + results[k] = results[k][:5] + return results + + +def _dfs_longest_chain(k, current_price, chain, move_disruption, other_cost, other_disruption): + """Return the longest chain reachable from current_price via any sequence of approved replaces.""" + best = list(chain) + for np in PRICES: + if np >= current_price: + continue + savings = current_price - np + pool_cost = current_price + other_cost + pool_disruption = move_disruption + other_disruption + if approved(k, savings, move_disruption, pool_cost, pool_disruption): + candidate = _dfs_longest_chain( + k, np, chain + [np], + move_disruption, other_cost, other_disruption, + ) + if len(candidate) > len(best): + best = candidate + return best + + +# ─── Utility ─── + +def pod_configs(max_pods=4): + """Generate representative pod configurations (tuples of disruption costs).""" + configs = [()] # empty + for n in range(1, max_pods + 1): + # Don't enumerate all combos — use representative configs + for d in DCOSTS: + configs.append(tuple([d] * n)) # uniform + if n >= 2: + configs.append((1, 10)) # mixed + configs.append((1, 1, 10)) # mostly low, one high + configs.append((10, 10, 1)) # mostly high, one low + if n >= 3: + configs.append((1, 5, 10)) # spread + if n == 4: + configs.append((1, 1, 1, 10)) + configs.append((1, 2, 5, 10)) + configs.append((10, 10, 10, 1)) + return configs + + +# ─── Main ─── + +def main(): + print("=" * 70) + print("Consolidation Scoring Properties — Exhaustive Check") + print("=" * 70) + + print("\n── P1: Monotonicity in savings ──") + v = check_p1() + if v: + print(f" FAIL: {len(v)} violations found!") + for x in v[:3]: + print(f" k={x[0]} price={x[1]} approved_at={x[2]} rejected_at={x[3]} nodes={x[4]}") + else: + print(" PASS: no violations") + + print("\n── P2: Monotonicity in disruption ──") + v = check_p2() + if v: + print(f" FAIL: {len(v)} violations found!") + for x in v[:3]: + print(f" k={x[0]} price={x[1]} d_high={x[2]} d_low={x[3]} nodes={x[4]}") + else: + print(" PASS: no violations") + + print("\n── P3: Empty nodes always approved for delete ──") + v = check_p3() + if v: + print(f" FAIL: {len(v)} violations found!") + for x in v[:3]: + print(f" k={x[0]} price={x[1]} nodes={x[2]}") + else: + print(" PASS: no violations") + + print("\n── P4: Zero-savings moves never approved ──") + v = check_p4() + if v: + print(f" FAIL: {len(v)} violations found!") + for x in v[:3]: + print(f" k={x[0]} price={x[1]} pods={x[2]}") + else: + print(" PASS: no violations") + + print("\n── P5: Single-node pool replaces ──") + results = check_p5() + for k in K_VALUES: + moves = results.get(k, []) + pairs = set((m['price'], m['replacement']) for m in moves) + print(f" k={k}: {len(pairs)} unique price pairs, {len(moves)} configurations") + for m in moves[:5]: + print(f" {m['price']} → {m['replacement']} " + f"saves {m['savings_pct']:.0f}% " + f"disruption={m['disruption']} " + f"score={m['score']:.2f}") + + print("\n── P6: Skewed disruption cost ──") + examples = check_p6() + if examples: + for ex in examples[:3]: + print(f" k={ex['k']} price={ex['price']}¢ nodes={ex['n_nodes']}") + print(f" high-disruption pods {ex['high_pods']} (d={ex['d_high']}): REJECTED") + print(f" low-disruption pods {ex['low_pods']} (d={ex['d_low']}): APPROVED") + else: + print(" No examples found") + + print("\n── P7: Fleet size independence (uniform pools) ──") + v = check_p7() + if v: + print(f" FAIL: {len(v)} violations!") + for x in v[:3]: + print(f" k={x[0]} price={x[1]} replacement={x[2]} pods={x[3]}") + print(f" decisions by pool size: {x[4]}") + else: + print(" PASS: decision is identical across pool sizes 1-6") + + print("\n── P8: Churn chains (with actual post-replace pool updates) ──") + results = check_p8() + for k in K_VALUES: + chains = results.get(k, []) + if chains: + longest = chains[0] + print(f" k={k}: longest chain = {longest['steps']} steps") + for c in chains[:3]: + arrow = " → ".join(f"{p}¢" for p in c['chain']) + print(f" nodes={c['n_nodes']} pods={c['pods']} chain: {arrow}") + else: + print(f" k={k}: no churn (no replaces approved)") + + print("\n" + "=" * 70) + print("Done.") + + +if __name__ == "__main__": + main() diff --git a/designs/scripts/balanced-consolidation-ranking.py b/designs/scripts/balanced-consolidation-ranking.py new file mode 100644 index 0000000000..d868064717 --- /dev/null +++ b/designs/scripts/balanced-consolidation-ranking.py @@ -0,0 +1,475 @@ +#!/usr/bin/env python3 +""" +Generate ranking-strategies charts for the consolidation cost threshold RFC. + +Simulates a realistic cluster by: +1. Sampling pod workloads from AWS Fargate task sizes (vCPU, memory pairs) +2. Bin-packing pods onto EC2 instance types (c7i, m7i, r7i families) +3. Killing a variable fraction of pods per node to simulate scale-down +4. Generating REPLACE moves (reprovision remaining pods on cheapest instance) +5. Generating DELETE moves (scatter pods onto existing spare capacity) + +Produces two charts showing cumulative savings vs. cumulative disruption +under four ranking strategies: score, savings-only, disruption-only, random. + +Usage: + python3 designs/scripts/consolidation-cost-threshold-ranking.py + +Output: + designs/ranking-strategies-replace.png + designs/ranking-strategies-delete.png +""" + +import numpy as np +import matplotlib.pyplot as plt +from dataclasses import dataclass, field +from typing import List +from pathlib import Path + +SEED = 42 +OUTPUT_DIR = Path(__file__).parent.parent + +# --------------------------------------------------------------------------- +# EC2 instance types: c7i, m7i, r7i families (us-east-1, Linux, on-demand) +# Source: https://instances.vantage.sh +# --------------------------------------------------------------------------- +EC2_INSTANCES = [ + # c7i: compute optimized (2:1 memory:vCPU) + ("c7i.large", 2, 4, 0.0893), + ("c7i.xlarge", 4, 8, 0.1785), + ("c7i.2xlarge", 8, 16, 0.3570), + ("c7i.4xlarge", 16, 32, 0.7140), + ("c7i.8xlarge", 32, 64, 1.4280), + ("c7i.12xlarge", 48, 96, 2.1420), + ("c7i.16xlarge", 64, 128, 2.8560), + ("c7i.24xlarge", 96, 192, 4.2840), + # m7i: general purpose (4:1 memory:vCPU) + ("m7i.large", 2, 8, 0.1008), + ("m7i.xlarge", 4, 16, 0.2016), + ("m7i.2xlarge", 8, 32, 0.4032), + ("m7i.4xlarge", 16, 64, 0.8064), + ("m7i.8xlarge", 32, 128, 1.6128), + ("m7i.12xlarge", 48, 192, 2.4190), + ("m7i.16xlarge", 64, 256, 3.2260), + ("m7i.24xlarge", 96, 384, 4.8384), + ("m7i.48xlarge",192, 768, 9.6768), + # r7i: memory optimized (8:1 memory:vCPU) + ("r7i.large", 2, 16, 0.1323), + ("r7i.xlarge", 4, 32, 0.2646), + ("r7i.2xlarge", 8, 64, 0.5292), + ("r7i.4xlarge", 16, 128, 1.0584), + ("r7i.8xlarge", 32, 256, 2.1168), + ("r7i.12xlarge", 48, 384, 3.1750), + ("r7i.16xlarge", 64, 512, 4.2336), + ("r7i.24xlarge", 96, 768, 6.3500), + ("r7i.48xlarge",192, 1536, 12.7008), +] + +# Fixed overhead of consolidating any node (independent of pod count) +NODE_BASELINE_DISRUPTION = 1.0 + + +@dataclass +class Pod: + cpu: float + mem: float + disruption_cost: float = 1.0 + + +@dataclass +class Node: + name: str + cpu_cap: float + mem_cap: float + cost_per_hr: float + pods: list = field(default_factory=list) + + @property + def cpu_used(self): + return sum(p.cpu for p in self.pods) + + @property + def mem_used(self): + return sum(p.mem for p in self.pods) + + def fits(self, pod): + return (pod.cpu <= self.cpu_cap - self.cpu_used and + pod.mem <= self.mem_cap - self.mem_used) + + @property + def disruption_cost(self): + return NODE_BASELINE_DISRUPTION + sum( + max(0, p.disruption_cost) for p in self.pods) + + +def find_cheapest_fitting(cpu_needed, mem_needed): + """Find the cheapest EC2 instance that fits the given resources.""" + best = None + for name, vcpu, mem, cost in EC2_INSTANCES: + if vcpu >= cpu_needed and mem >= mem_needed: + if best is None or cost < best[3]: + best = (name, vcpu, mem, cost) + return best + + +def cheapest_instance_for(pods): + """Find the cheapest instance type that fits all pods.""" + total_cpu = sum(p.cpu for p in pods) + total_mem = sum(p.mem for p in pods) + return find_cheapest_fitting(total_cpu, total_mem) + + +def build_cluster(rng, n_pods=20000): + """ + Build a cluster by bin-packing pods onto EC2 instances. + + Uses first-fit-decreasing: sort all pods by resource footprint (cpu * mem) + descending, then greedily place each pod onto the first existing node that + fits. When no node fits, provision a new one -- picking the cheapest + instance that fits the pod, but upsizing to a ~4x bigger instance if it + costs less than 2x as much. This produces denser packing and a more + realistic instance type distribution. + """ + # Generate pods with random resource requests. CPU and memory are drawn + # independently from log-normal distributions, producing a realistic mix + # of compute-heavy, memory-heavy, and balanced pods. This naturally + # spreads pods across different instance families (c7i for compute-heavy, + # r7i for memory-heavy, m7i for balanced), creating cost diversity. + cpu_raw = rng.lognormal(mean=0.0, sigma=1.0, size=n_pods) + mem_raw = rng.lognormal(mean=1.5, sigma=1.2, size=n_pods) + + # Clamp to reasonable ranges and round to Fargate-like granularity + all_pods = [] + for c, m in zip(cpu_raw, mem_raw): + cpu = float(np.clip(round(c * 2) / 2, 0.25, 16)) # 0.25 to 16, step 0.5 + mem = float(np.clip(round(m), 1, 120)) # 1 to 120 GiB, step 1 + dc = float(rng.choice([0, 1, 1, 1, 1, 1, 10])) + all_pods.append(Pod(cpu=cpu, mem=mem, disruption_cost=dc)) + + # First-fit-decreasing bin packing: sort all pods by resource footprint + # descending, then greedily place each onto the first existing node that + # fits. When no node fits, provision the cheapest instance that can hold + # the pod. This mirrors Karpenter's provisioning behavior. + all_pods.sort(key=lambda p: p.cpu * p.mem, reverse=True) + + nodes = [] + for pod in all_pods: + placed = False + for node in nodes: + if node.fits(pod): + node.pods.append(pod) + placed = True + break + if not placed: + chosen = find_cheapest_fitting(pod.cpu, pod.mem) + if chosen is None: + raise ValueError( + f"No instance fits pod ({pod.cpu}, {pod.mem})") + new_node = Node( + name=chosen[0], + cpu_cap=chosen[1], mem_cap=chosen[2], + cost_per_hr=chosen[3], + ) + new_node.pods.append(pod) + nodes.append(new_node) + + return nodes + + +def churn_pods(rng, nodes): + """Simulate workload churn: kill some pods, add some new ones. + + For each node, kill 0-80% of pods (simulating scale-down or + redeployment), then add 0-3 new randomly-sized pods if they fit + (simulating new deployments landing on nodes with spare capacity). + This changes the resource mix on each node over time, creating + mismatches between the node's instance type and its actual workload + -- exactly the situation that consolidation is designed to fix. + """ + for node in nodes: + # Kill phase: remove a random fraction of pods + if len(node.pods) > 1: + kill_fraction = rng.uniform(0.0, 0.8) + n_kill = int(len(node.pods) * kill_fraction) + if n_kill > 0: + kill_indices = set( + rng.choice(len(node.pods), size=n_kill, replace=False)) + node.pods = [p for i, p in enumerate(node.pods) + if i not in kill_indices] + + # Add phase: place 0-3 new pods if they fit + n_add = rng.integers(0, 4) + for _ in range(n_add): + cpu = float(np.clip(round(rng.lognormal(0.0, 1.0) * 2) / 2, 0.25, 16)) + mem = float(np.clip(round(rng.lognormal(1.5, 1.2)), 1, 120)) + dc = float(rng.choice([0, 1, 1, 1, 1, 1, 10])) + pod = Pod(cpu=cpu, mem=mem, disruption_cost=dc) + if node.fits(pod): + node.pods.append(pod) + + +def generate_replace_moves(rng, nodes): + """ + For each node, find the cheapest instance that fits its current pods. + If cheaper than the current node, that's a REPLACE move. + """ + moves = [] + nodepool_cost = sum(n.cost_per_hr for n in nodes) + nodepool_disruption = sum(n.disruption_cost for n in nodes) + + for node in nodes: + if len(node.pods) == 0: + continue + + result = cheapest_instance_for(node.pods) + if result is None: + continue + + _, _, _, new_cost = result + savings = node.cost_per_hr - new_cost + disruption = node.disruption_cost + + if savings <= 0 or disruption <= 0: + continue + if nodepool_cost <= 0 or nodepool_disruption <= 0: + continue + + savings_frac = savings / nodepool_cost + disruption_frac = disruption / nodepool_disruption + score = savings_frac / disruption_frac + + moves.append({ + "savings": savings, + "disruption": disruption, + "savings_frac": savings_frac, + "disruption_frac": disruption_frac, + "score": score, + }) + + return moves + + +def generate_delete_moves(rng, nodes): + """ + For each node, check if its pods could fit on other nodes' spare capacity. + If so, generate a DELETE move (full node cost saved, all pods disrupted). + """ + moves = [] + nodepool_cost = sum(n.cost_per_hr for n in nodes) + nodepool_disruption = sum(n.disruption_cost for n in nodes) + + for i, node in enumerate(nodes): + if len(node.pods) == 0: + continue + + # Check if all pods fit on spare capacity of other nodes + sorted_pods = sorted(node.pods, key=lambda p: (p.cpu, p.mem), + reverse=True) + + remaining = {} + for j in range(len(nodes)): + if j != i: + remaining[j] = ( + nodes[j].cpu_cap - nodes[j].cpu_used, + nodes[j].mem_cap - nodes[j].mem_used, + ) + + can_fit = True + for pod in sorted_pods: + best_j = None + best_waste = float("inf") + for j, (cpu_f, mem_f) in remaining.items(): + if cpu_f >= pod.cpu and mem_f >= pod.mem: + waste = (cpu_f - pod.cpu) + (mem_f - pod.mem) + if waste < best_waste: + best_waste = waste + best_j = j + if best_j is not None: + cpu_f, mem_f = remaining[best_j] + remaining[best_j] = (cpu_f - pod.cpu, mem_f - pod.mem) + else: + can_fit = False + break + + if not can_fit: + continue + + savings = node.cost_per_hr + disruption = node.disruption_cost + + if savings <= 0 or disruption <= 0: + continue + if nodepool_cost <= 0 or nodepool_disruption <= 0: + continue + + savings_frac = savings / nodepool_cost + disruption_frac = disruption / nodepool_disruption + score = savings_frac / disruption_frac + + moves.append({ + "savings": savings, + "disruption": disruption, + "savings_frac": savings_frac, + "disruption_frac": disruption_frac, + "score": score, + }) + + return moves + + +def plot_ranking(moves, title, output_path): + """Plot cumulative savings vs disruption for four ranking strategies.""" + if not moves: + print(f" No moves for {title}, skipping") + return + + rng = np.random.default_rng(SEED + 1) + n = len(moves) + + savings = np.array([m["savings_frac"] for m in moves]) + disruption = np.array([m["disruption_frac"] for m in moves]) + scores = np.array([m["score"] for m in moves]) + + strategies = { + "Score (benefit/cost)": np.argsort(-scores), + "Savings only": np.argsort(-savings), + "Disruption only (asc)": np.argsort(disruption), + "Random": rng.permutation(n), + } + + fig, ax = plt.subplots(figsize=(8, 6)) + + for label, order in strategies.items(): + cum_disruption = np.cumsum(disruption[order]) + cum_savings = np.cumsum(savings[order]) + if cum_disruption[-1] > 0: + cum_disruption = cum_disruption / cum_disruption[-1] + if cum_savings[-1] > 0: + cum_savings = cum_savings / cum_savings[-1] + ax.plot(cum_disruption, cum_savings, label=label, linewidth=2) + + ax.set_xlabel("Cumulative disruption (fraction of total)") + ax.set_ylabel("Cumulative savings (fraction of total)") + ax.set_title(title) + ax.legend(loc="lower right") + ax.set_xlim(0, 1) + ax.set_ylim(0, 1) + ax.set_aspect("equal") + ax.grid(True, alpha=0.3) + + fig.tight_layout() + fig.savefig(output_path, dpi=150) + print(f" Saved {output_path}") + + +def _plot_panel(ax, moves, title): + """Plot one panel of cumulative savings vs disruption.""" + rng = np.random.default_rng(SEED + 1) + n = len(moves) + + savings = np.array([m["savings_frac"] for m in moves]) + disruption = np.array([m["disruption_frac"] for m in moves]) + scores = np.array([m["score"] for m in moves]) + + strategies = { + "Score (benefit/cost)": np.argsort(-scores), + "Savings only": np.argsort(-savings), + "Disruption only (asc)": np.argsort(disruption), + "Random": rng.permutation(n), + } + + for label, order in strategies.items(): + cum_disruption = np.cumsum(disruption[order]) + cum_savings = np.cumsum(savings[order]) + if cum_disruption[-1] > 0: + cum_disruption = cum_disruption / cum_disruption[-1] + if cum_savings[-1] > 0: + cum_savings = cum_savings / cum_savings[-1] + ax.plot(cum_disruption, cum_savings, label=label, linewidth=2) + + ax.set_xlabel("Cumulative disruption (fraction of total)") + ax.set_ylabel("Cumulative savings (fraction of total)") + ax.set_title(title) + ax.legend(loc="lower right", fontsize=8) + ax.set_xlim(0, 1) + ax.set_ylim(0, 1) + ax.set_aspect("equal") + ax.grid(True, alpha=0.3) + + +def plot_ranking_sidebyside(replace_moves, delete_moves, output_path): + """Plot REPLACE and DELETE ranking charts side by side.""" + fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) + + if replace_moves: + _plot_panel(ax1, replace_moves, "REPLACE moves") + if delete_moves: + _plot_panel(ax2, delete_moves, "DELETE moves") + + fig.tight_layout() + fig.savefig(output_path, dpi=150) + print(f" Saved {output_path}") + + +def main(): + rng = np.random.default_rng(SEED) + + print("Building cluster...") + nodes = build_cluster(rng, n_pods=5000) + total_pods = sum(len(n.pods) for n in nodes) + print(f" {len(nodes)} nodes, {total_pods} pods, " + f"{total_pods/len(nodes):.1f} pods/node") + print(f" NodePool cost: ${sum(n.cost_per_hr for n in nodes):.2f}/hr") + + # Run multiple rounds of churn to accumulate consolidation candidates. + # Each round kills some pods and adds new ones, then we collect moves. + # This simulates a cluster that has been running for a while with + # ongoing deployments and scale-downs. + all_replace_moves = [] + all_delete_moves = [] + n_rounds = 10 + print(f"Simulating {n_rounds} rounds of workload churn...") + for round_num in range(n_rounds): + churn_pods(rng, nodes) + nodes = [n for n in nodes if len(n.pods) > 0] + replace_moves = generate_replace_moves(rng, nodes) + delete_moves = generate_delete_moves(rng, nodes) + all_replace_moves.extend(replace_moves) + all_delete_moves.extend(delete_moves) + if round_num == 0 or round_num == n_rounds - 1: + total_pods = sum(len(n.pods) for n in nodes) + print(f" Round {round_num+1}: {len(nodes)} nodes, {total_pods} pods, " + f"{len(replace_moves)} REPLACE, {len(delete_moves)} DELETE") + + replace_moves = all_replace_moves + delete_moves = all_delete_moves + print(f" Total: {len(replace_moves)} REPLACE, {len(delete_moves)} DELETE") + + # Combine all moves -- this is what the consolidation controller evaluates + all_moves = replace_moves + delete_moves + print(f" Combined: {len(all_moves)} moves") + + print("Plotting...") + plot_ranking_sidebyside( + replace_moves, delete_moves, + OUTPUT_DIR / "ranking-strategies.png", + ) + + # Write CSVs for inspection + for label, moves in [("all", all_moves), + ("replace", replace_moves), + ("delete", delete_moves)]: + csv_path = Path(__file__).parent / f"ranking-strategies-{label}.csv" + with open(csv_path, "w") as f: + f.write("savings_per_hr,disruption,savings_frac,disruption_frac,score\n") + for m in sorted(moves, key=lambda m: -m["score"]): + f.write(f"{m['savings']:.4f},{m['disruption']:.1f}," + f"{m['savings_frac']:.6f},{m['disruption_frac']:.6f}," + f"{m['score']:.4f}\n") + print(f" Saved {csv_path}") + + print("Done.") + + +if __name__ == "__main__": + main()