Commit 89d9045
[data] Add DownstreamCapacityBackpressurePolicy based on downstream processing capacity (ray-project#55463)
## Summary
Implement a downstream processing capacity-based backpressure mechanism
to address stability and performance issues caused by unbalanced
processing speeds across pipeline operators due to user
misconfigurations, instance preemptions, and cluster resource scaling.
## Problem Statement
Current Ray Data pipelines face several critical challenges:
### 1. Performance & Stability Issues
- Large amounts of objects accumulate in memory while downstream
operators cannot consume them timely
- Memory resource waste and potential spilling leads to significant
performance degradation
- Pipeline instability due to memory pressure
### 2. Resource Waste in Dynamic Environments
- In preemption scenarios, the situation becomes worse as large amounts
of objects are repeatedly rebuilt when workers are preempted
- Inefficient resource utilization due to upstream-downstream speed
mismatch
- Wasted compute resources on processing data that cannot be consumed
### 3. Complex Configuration Requirements
- Users find it difficult to configure reasonable parallelism ratios
- Inappropriate configurations lead to resource waste or insufficient
throughput
- Especially challenging on elastic resources where capacity changes
dynamically
## Solution
This PR introduces `DownstreamCapacityBackpressurePolicy` that provides:
### 1. Simplified User Configuration with Adaptive Concurrency
- Ray Data automatically adjusts parallelism based on actual pipeline
performance
- When upstream is blocked due to backpressure, resources are released
to allow downstream scaling up
- Self-adaptive mechanism reduces the need for manual tuning and complex
configuration
### 2. Consistent Pipeline Throughput
- Objects output by upstream operators are consumed by downstream as
quickly as possible
- Ensures stability, saves memory resources, and avoids unnecessary
object rebuilding risks
- Maintains balanced flow throughout the entire pipeline
## Key Benefits
### 🚀 Performance Improvements
- Prevents memory bloat and reduces object spilling
- Maintains optimal memory utilization across the pipeline
- Eliminates performance degradation from memory pressure
### 🛡️ Enhanced Stability
- Handles instance preemptions gracefully
- Reduces object rebuilding in dynamic environments
- Maintains pipeline stability under varying cluster conditions
### ⚙️ Simplified Operations
- Reduces complex configuration requirements
- Provides self-adaptive parallelism adjustment
- Works effectively on elastic resources
### 💰 Resource Efficiency
- Prevents resource waste from unbalanced processing
- Optimizes resource allocation across pipeline stages
- Reduces unnecessary compute overhead
## Configuration
Users can configure the backpressure behavior via DataContext:
```python
ctx = ray.data.DataContext.get_current()
# Set ratio threshold (default: inf, disabled)
ctx.downstream_capacity_backpressure_ratio = 2.0
# Set absolute threshold (default: sys.maxsize, disabled)
ctx.downstream_capacity_backpressure_max_queued_bundles = 4000
```
### Default Behavior
- By default, backpressure is disabled (thresholds set to infinity) to
maintain backward compatibility
- Users can enable it by setting appropriate threshold values
## Impact & Results
This implementation successfully addresses the core challenges:
✅ **Performance & Stability**: Eliminates memory pressure and spilling
issues
✅ **Resource Efficiency**: Prevents waste in preemption scenarios and
dynamic environments
✅ **Configuration Simplicity**: Reduces complex user configuration
requirements
✅ **Adaptive Throughput**: Maintains consistent pipeline performance
The solution provides a foundation for more intelligent, self-adaptive
Ray Data pipelines that can handle dynamic cluster conditions while
maintaining optimal performance and resource utilization.
---
<!-- Please give a short summary of the change and the problem this
solves. -->
## Related issue number
<!-- For example: "Closes ray-project#1234" -->
## Checks
- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Signed-off-by: dragongu <andrewgu@vip.qq.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>1 parent d9d84d1 commit 89d9045
File tree
6 files changed
+246
-0
lines changed- python/ray/data
- _internal
- actor_autoscaler
- execution/backpressure_policy
- tests
6 files changed
+246
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1323 | 1323 | | |
1324 | 1324 | | |
1325 | 1325 | | |
| 1326 | + | |
| 1327 | + | |
| 1328 | + | |
| 1329 | + | |
| 1330 | + | |
| 1331 | + | |
| 1332 | + | |
| 1333 | + | |
| 1334 | + | |
| 1335 | + | |
| 1336 | + | |
| 1337 | + | |
| 1338 | + | |
| 1339 | + | |
1326 | 1340 | | |
1327 | 1341 | | |
1328 | 1342 | | |
| |||
Lines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
Lines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
5 | 8 | | |
6 | 9 | | |
7 | 10 | | |
| |||
14 | 17 | | |
15 | 18 | | |
16 | 19 | | |
| 20 | + | |
17 | 21 | | |
18 | 22 | | |
19 | 23 | | |
| |||
33 | 37 | | |
34 | 38 | | |
35 | 39 | | |
| 40 | + | |
36 | 41 | | |
37 | 42 | | |
38 | 43 | | |
python/ray/data/_internal/execution/backpressure_policy/downstream_capacity_backpressure_policy.py
Lines changed: 94 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
534 | 534 | | |
535 | 535 | | |
536 | 536 | | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
537 | 540 | | |
538 | 541 | | |
539 | 542 | | |
| |||
Lines changed: 127 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
0 commit comments