[Data] Revisiting OpResourceAllocator to make data flow explicit#57788
[Data] Revisiting OpResourceAllocator to make data flow explicit#57788alexeykudinkin merged 7 commits intoray-project:masterfrom
OpResourceAllocator to make data flow explicit#57788Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
OpResourceAllocator APIsOpResourceAllocator APIs
OpResourceAllocator APIsOpResourceAllocator to make data flow explicit
python/ray/data/_internal/execution/backpressure_policy/resource_budget_backpressure_policy.py
Outdated
Show resolved
Hide resolved
| return self._actor_pool.get_actor_info() | ||
|
|
||
| def get_max_concurrency_limit(self) -> Optional[int]: | ||
| return self._actor_pool.max_size() * self._actor_pool.max_actor_concurrency() |
There was a problem hiding this comment.
Out of scope for this PR since this is an existing issue, but if self._actor_pool.max_size() is float("inf"), I think we'd probably want to return None rather than float("inf") for consistency with the return type
There was a problem hiding this comment.
Good call
- Looked t/h the code and we need to holistically clean this up (since we define max_size as int)
| 5000.0, | ||
| ] | ||
| task_completion_time: float = metric_field( | ||
| task_completion_time_s: float = metric_field( |
There was a problem hiding this comment.
Do we need to update test_stats.py and the dashboard code after renaming these metrics?
There was a problem hiding this comment.
Yep, will do
| @@ -266,6 +267,19 @@ def total_enqueued_input_bundles(self) -> int: | |||
|
|
|||
| return self._pending_dispatch_input_bundles_count() + internal_queue_size | |||
There was a problem hiding this comment.
If we change the internal queue size to represent blocks rather than bundles, then total_enqueued_input_bundles will return incorrect values, and DownstreamCapacityBackpressurePolicy will break.
I think even if we update total_enqueued_input_bundles to represent blocks, we'd still need to update the DownstreamCapacityBackpressurePolicy logic:
There was a problem hiding this comment.
Yeah, we need to fix that across the board
| """Returns Operator's internal queue size""" | ||
| """Returns Operator's internal queue size (in blocks)""" | ||
| ... |
There was a problem hiding this comment.
What are we hoping to achieve by changing the unit of internal_queue_size from bundles to blocks?
There was a problem hiding this comment.
I just realized that we're assuming that every bundle holds just 1 block, which is not enforced
python/ray/data/_internal/execution/streaming_executor_state.py
Outdated
Show resolved
Hide resolved
| def __init__(self, topology: "Topology"): | ||
| self._topology = topology | ||
| self._idle_detector = self.IdleDetector() | ||
| self._ticker = 0 |
There was a problem hiding this comment.
I know this is updated in update_budgets, but is it used anywhere else?
Are subclasses required to increment this? If so, I think this should be an explicit part of the interface
There was a problem hiding this comment.
Missed to clean up
| @abstractmethod | ||
| def can_submit_new_task(self, op: PhysicalOperator) -> bool: | ||
| """Return whether the given operator can submit a new task.""" | ||
| ... |
There was a problem hiding this comment.
What's the motivation for copying this from the backpressure policy interface to here? Would the implementation ever be non-trivial?
If the implementation of this method is always going to be like below, it might be better to remove the method to make the OpResourceAllocator interface deeper and simpler
def can_submit_new_task(self, op):
return op.incremental_resource_usage().satisfies_limit(budget)There was a problem hiding this comment.
Idea here is that the logic whether task can be scheduled should live w/ Resource Allocator (it will be more complicated than the one you referred above)
e271ecb to
e763d0c
Compare
5b23ecb to
9d757b7
Compare
bb87078 to
ea982b3
Compare
bveeramani
left a comment
There was a problem hiding this comment.
Looks reasonable to me.
Let's merge #58030 first to minimize size of the diff, and then merge this one?
| @ray.remote | ||
| def test_import(): | ||
| import file_module | ||
|
|
5379f9c to
e8ab2c8
Compare
b7ec22e to
7881742
Compare
bad2cd7 to
65e7295
Compare
c42f2c3 to
0b1eb1b
Compare
65e7295 to
7165108
Compare
0b1eb1b to
6d9160a
Compare
Signed-off-by: Alexey Kudinkin <ak@anyscale.com> # Conflicts: # python/ray/data/_internal/execution/streaming_executor_state.py Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
6d9160a to
eeb01a7
Compare
| op, | ||
| task_resource_usage=self._op_usages, | ||
| output_object_store_usage=self._mem_op_outputs, | ||
| ) |
There was a problem hiding this comment.
Bug: Inconsistent Return Types in Resource Management
Type mismatch bug: ResourceManager.max_task_output_bytes_to_read() declares return type as int but calls self._op_resource_allocator.max_task_output_bytes_to_read() which returns Optional[int]. The abstract method in OpResourceAllocator and its implementation in ReservationOpResourceAllocator can return None, but the wrapper method signature promises to always return int. This will cause runtime type errors when None is returned but an int is expected by callers.
…ay-project#57788) <!-- Thank you for contributing to Ray! 🚀 --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description This change primarily converts `OpResourceAllocator` APIs to make data flow explicit by exposing required params in the APIs. Additionally: 1. Abstracting common methods inside `OpResourceAllocator` base-class. 2. Adding allocation to progress bar in verbose mode logging budgets & allocations. 3. Adding byte-size of all enqueued blocks to the progress bar ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes⚠️ - [ ] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…ay-project#57788) <!-- Thank you for contributing to Ray! 🚀 --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description This change primarily converts `OpResourceAllocator` APIs to make data flow explicit by exposing required params in the APIs. Additionally: 1. Abstracting common methods inside `OpResourceAllocator` base-class. 2. Adding allocation to progress bar in verbose mode logging budgets & allocations. 3. Adding byte-size of all enqueued blocks to the progress bar ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes⚠️ - [ ] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…ay-project#57788) <!-- Thank you for contributing to Ray! 🚀 --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description This change primarily converts `OpResourceAllocator` APIs to make data flow explicit by exposing required params in the APIs. Additionally: 1. Abstracting common methods inside `OpResourceAllocator` base-class. 2. Adding allocation to progress bar in verbose mode logging budgets & allocations. 3. Adding byte-size of all enqueued blocks to the progress bar ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes⚠️ - [ ] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
… concurrency (#59331) #57788 renamed the `max_task_concurrency()` as `get_max_concurrency_limit()`. However, the PR didn't remove the original ``max_task_concurrency()` method on the `PhysicalOperator` base class (probably an oversight?). This PR fixes the issue by removing the dead method. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…59672) #57788 added `task_resource_usage` and `output_object_store_usage` to `OpResourceAllocator.max_task_output_bytes_to_read`. But, the parameters aren't actually used anywhere, so this PR removes them. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
…ay-project#59672) ray-project#57788 added `task_resource_usage` and `output_object_store_usage` to `OpResourceAllocator.max_task_output_bytes_to_read`. But, the parameters aren't actually used anywhere, so this PR removes them. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
Update the panel's Prometheus expr to use ray_data_task_completion_time_excl_backpressure_s instead of ray_data_task_completion_time_without_backpressure. The metric was renamed in ray-project#57788 (op_runtime_metrics) but the data_dashboard_panels.py panel was not updated, causing the chart to show no data.
Update the panel's Prometheus expr to use ray_data_task_completion_time_excl_backpressure_s instead of ray_data_task_completion_time_without_backpressure. The metric was renamed in ray-project#57788 (op_runtime_metrics) but the data_dashboard_panels.py panel was not updated, causing the chart to show no data. Signed-off-by: kriyanshii <kriyanshishah06@gmail.com>
…tric name (#60481) ## Description Fixes the broken **"Task Completion Time Without Backpressure"** metrics chart in the Ray Data Grafana dashboard. The panel was querying `ray_data_task_completion_time_without_backpressure`, which no longer exists. PR #57788 renamed the underlying metric to `task_completion_time_excl_backpressure_s` in `op_runtime_metrics.py`, but the Grafana panel in `data_dashboard_panels.py` was not updated. This PR updates the panel's Prometheus `expr` to use `ray_data_task_completion_time_excl_backpressure_s` so the chart displays data again. **Change:** Single-line fix in `data_dashboard_panels.py` — replace the old metric name with the correct one in the panel's `expr`. The formula (average task completion time excluding backpressure over a 5-minute window) is unchanged. ## Related issues Fixes the regression from #57788 (metric rename). Related to Ray Data monitoring / dashboard. Closes: #60163 ## Additional information - **Metric flow:** `op_runtime_metrics.task_completion_time_excl_backpressure_s` → Stats uses `data_{name}` → Metrics agent adds `ray_` namespace → **`ray_data_task_completion_time_excl_backpressure_s`** - **Manual verification:** Run a Ray Data job with Grafana + Prometheus (see [cluster metrics](https://docs.ray.io/en/latest/cluster/metrics.html)), then confirm the "Task Completion Time Without Backpressure" panel shows data. Signed-off-by: kriyanshii <kriyanshishah06@gmail.com>
…tric name (ray-project#60481) ## Description Fixes the broken **"Task Completion Time Without Backpressure"** metrics chart in the Ray Data Grafana dashboard. The panel was querying `ray_data_task_completion_time_without_backpressure`, which no longer exists. PR ray-project#57788 renamed the underlying metric to `task_completion_time_excl_backpressure_s` in `op_runtime_metrics.py`, but the Grafana panel in `data_dashboard_panels.py` was not updated. This PR updates the panel's Prometheus `expr` to use `ray_data_task_completion_time_excl_backpressure_s` so the chart displays data again. **Change:** Single-line fix in `data_dashboard_panels.py` — replace the old metric name with the correct one in the panel's `expr`. The formula (average task completion time excluding backpressure over a 5-minute window) is unchanged. ## Related issues Fixes the regression from ray-project#57788 (metric rename). Related to Ray Data monitoring / dashboard. Closes: ray-project#60163 ## Additional information - **Metric flow:** `op_runtime_metrics.task_completion_time_excl_backpressure_s` → Stats uses `data_{name}` → Metrics agent adds `ray_` namespace → **`ray_data_task_completion_time_excl_backpressure_s`** - **Manual verification:** Run a Ray Data job with Grafana + Prometheus (see [cluster metrics](https://docs.ray.io/en/latest/cluster/metrics.html)), then confirm the "Task Completion Time Without Backpressure" panel shows data. Signed-off-by: kriyanshii <kriyanshishah06@gmail.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
…tric name (ray-project#60481) ## Description Fixes the broken **"Task Completion Time Without Backpressure"** metrics chart in the Ray Data Grafana dashboard. The panel was querying `ray_data_task_completion_time_without_backpressure`, which no longer exists. PR ray-project#57788 renamed the underlying metric to `task_completion_time_excl_backpressure_s` in `op_runtime_metrics.py`, but the Grafana panel in `data_dashboard_panels.py` was not updated. This PR updates the panel's Prometheus `expr` to use `ray_data_task_completion_time_excl_backpressure_s` so the chart displays data again. **Change:** Single-line fix in `data_dashboard_panels.py` — replace the old metric name with the correct one in the panel's `expr`. The formula (average task completion time excluding backpressure over a 5-minute window) is unchanged. ## Related issues Fixes the regression from ray-project#57788 (metric rename). Related to Ray Data monitoring / dashboard. Closes: ray-project#60163 ## Additional information - **Metric flow:** `op_runtime_metrics.task_completion_time_excl_backpressure_s` → Stats uses `data_{name}` → Metrics agent adds `ray_` namespace → **`ray_data_task_completion_time_excl_backpressure_s`** - **Manual verification:** Run a Ray Data job with Grafana + Prometheus (see [cluster metrics](https://docs.ray.io/en/latest/cluster/metrics.html)), then confirm the "Task Completion Time Without Backpressure" panel shows data. Signed-off-by: kriyanshii <kriyanshishah06@gmail.com>
…tric name (ray-project#60481) ## Description Fixes the broken **"Task Completion Time Without Backpressure"** metrics chart in the Ray Data Grafana dashboard. The panel was querying `ray_data_task_completion_time_without_backpressure`, which no longer exists. PR ray-project#57788 renamed the underlying metric to `task_completion_time_excl_backpressure_s` in `op_runtime_metrics.py`, but the Grafana panel in `data_dashboard_panels.py` was not updated. This PR updates the panel's Prometheus `expr` to use `ray_data_task_completion_time_excl_backpressure_s` so the chart displays data again. **Change:** Single-line fix in `data_dashboard_panels.py` — replace the old metric name with the correct one in the panel's `expr`. The formula (average task completion time excluding backpressure over a 5-minute window) is unchanged. ## Related issues Fixes the regression from ray-project#57788 (metric rename). Related to Ray Data monitoring / dashboard. Closes: ray-project#60163 ## Additional information - **Metric flow:** `op_runtime_metrics.task_completion_time_excl_backpressure_s` → Stats uses `data_{name}` → Metrics agent adds `ray_` namespace → **`ray_data_task_completion_time_excl_backpressure_s`** - **Manual verification:** Run a Ray Data job with Grafana + Prometheus (see [cluster metrics](https://docs.ray.io/en/latest/cluster/metrics.html)), then confirm the "Task Completion Time Without Backpressure" panel shows data. Signed-off-by: kriyanshii <kriyanshishah06@gmail.com> Signed-off-by: 400Ping <jiekaichang@apache.org>
…ay-project#59672) ray-project#57788 added `task_resource_usage` and `output_object_store_usage` to `OpResourceAllocator.max_task_output_bytes_to_read`. But, the parameters aren't actually used anywhere, so this PR removes them. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
…tric name (ray-project#60481) ## Description Fixes the broken **"Task Completion Time Without Backpressure"** metrics chart in the Ray Data Grafana dashboard. The panel was querying `ray_data_task_completion_time_without_backpressure`, which no longer exists. PR ray-project#57788 renamed the underlying metric to `task_completion_time_excl_backpressure_s` in `op_runtime_metrics.py`, but the Grafana panel in `data_dashboard_panels.py` was not updated. This PR updates the panel's Prometheus `expr` to use `ray_data_task_completion_time_excl_backpressure_s` so the chart displays data again. **Change:** Single-line fix in `data_dashboard_panels.py` — replace the old metric name with the correct one in the panel's `expr`. The formula (average task completion time excluding backpressure over a 5-minute window) is unchanged. ## Related issues Fixes the regression from ray-project#57788 (metric rename). Related to Ray Data monitoring / dashboard. Closes: ray-project#60163 ## Additional information - **Metric flow:** `op_runtime_metrics.task_completion_time_excl_backpressure_s` → Stats uses `data_{name}` → Metrics agent adds `ray_` namespace → **`ray_data_task_completion_time_excl_backpressure_s`** - **Manual verification:** Run a Ray Data job with Grafana + Prometheus (see [cluster metrics](https://docs.ray.io/en/latest/cluster/metrics.html)), then confirm the "Task Completion Time Without Backpressure" panel shows data. Signed-off-by: kriyanshii <kriyanshishah06@gmail.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…ay-project#57788) <!-- Thank you for contributing to Ray! 🚀 --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description This change primarily converts `OpResourceAllocator` APIs to make data flow explicit by exposing required params in the APIs. Additionally: 1. Abstracting common methods inside `OpResourceAllocator` base-class. 2. Adding allocation to progress bar in verbose mode logging budgets & allocations. 3. Adding byte-size of all enqueued blocks to the progress bar ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes⚠️ - [ ] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
… concurrency (ray-project#59331) ray-project#57788 renamed the `max_task_concurrency()` as `get_max_concurrency_limit()`. However, the PR didn't remove the original ``max_task_concurrency()` method on the `PhysicalOperator` base class (probably an oversight?). This PR fixes the issue by removing the dead method. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ay-project#59672) ray-project#57788 added `task_resource_usage` and `output_object_store_usage` to `OpResourceAllocator.max_task_output_bytes_to_read`. But, the parameters aren't actually used anywhere, so this PR removes them. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…tric name (ray-project#60481) ## Description Fixes the broken **"Task Completion Time Without Backpressure"** metrics chart in the Ray Data Grafana dashboard. The panel was querying `ray_data_task_completion_time_without_backpressure`, which no longer exists. PR ray-project#57788 renamed the underlying metric to `task_completion_time_excl_backpressure_s` in `op_runtime_metrics.py`, but the Grafana panel in `data_dashboard_panels.py` was not updated. This PR updates the panel's Prometheus `expr` to use `ray_data_task_completion_time_excl_backpressure_s` so the chart displays data again. **Change:** Single-line fix in `data_dashboard_panels.py` — replace the old metric name with the correct one in the panel's `expr`. The formula (average task completion time excluding backpressure over a 5-minute window) is unchanged. ## Related issues Fixes the regression from ray-project#57788 (metric rename). Related to Ray Data monitoring / dashboard. Closes: ray-project#60163 ## Additional information - **Metric flow:** `op_runtime_metrics.task_completion_time_excl_backpressure_s` → Stats uses `data_{name}` → Metrics agent adds `ray_` namespace → **`ray_data_task_completion_time_excl_backpressure_s`** - **Manual verification:** Run a Ray Data job with Grafana + Prometheus (see [cluster metrics](https://docs.ray.io/en/latest/cluster/metrics.html)), then confirm the "Task Completion Time Without Backpressure" panel shows data. Signed-off-by: kriyanshii <kriyanshishah06@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…tric name (ray-project#60481) ## Description Fixes the broken **"Task Completion Time Without Backpressure"** metrics chart in the Ray Data Grafana dashboard. The panel was querying `ray_data_task_completion_time_without_backpressure`, which no longer exists. PR ray-project#57788 renamed the underlying metric to `task_completion_time_excl_backpressure_s` in `op_runtime_metrics.py`, but the Grafana panel in `data_dashboard_panels.py` was not updated. This PR updates the panel's Prometheus `expr` to use `ray_data_task_completion_time_excl_backpressure_s` so the chart displays data again. **Change:** Single-line fix in `data_dashboard_panels.py` — replace the old metric name with the correct one in the panel's `expr`. The formula (average task completion time excluding backpressure over a 5-minute window) is unchanged. ## Related issues Fixes the regression from ray-project#57788 (metric rename). Related to Ray Data monitoring / dashboard. Closes: ray-project#60163 ## Additional information - **Metric flow:** `op_runtime_metrics.task_completion_time_excl_backpressure_s` → Stats uses `data_{name}` → Metrics agent adds `ray_` namespace → **`ray_data_task_completion_time_excl_backpressure_s`** - **Manual verification:** Run a Ray Data job with Grafana + Prometheus (see [cluster metrics](https://docs.ray.io/en/latest/cluster/metrics.html)), then confirm the "Task Completion Time Without Backpressure" panel shows data. Signed-off-by: kriyanshii <kriyanshishah06@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
This change primarily converts
OpResourceAllocatorAPIs to make data flow explicit by exposing required params in the APIs.Additionally:
OpResourceAllocatorbase-class.Related issues
Types of change
Checklist
Does this PR introduce breaking changes?
Testing:
Code Quality:
git commit -s)Documentation:
doc/source/(if applicable)Additional context