Skip to content

[data] [docs] Adding unstructured data templates from ray summit 2025#57063

Merged
angelinalg merged 77 commits intoray-project:masterfrom
Aydin-ab:add-etl-tpch-template
Nov 26, 2025
Merged

[data] [docs] Adding unstructured data templates from ray summit 2025#57063
angelinalg merged 77 commits intoray-project:masterfrom
Aydin-ab:add-etl-tpch-template

Conversation

@Aydin-ab
Copy link
Contributor

@Aydin-ab Aydin-ab commented Sep 30, 2025

Why are these changes needed?

Converting Ray Data training content from ray summit 2025 into an example in the docs + an anyscale template in the console

author: @soffer-anyscale

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Aydin-ab Aydin-ab requested review from a team as code owners September 30, 2025 23:35
@Aydin-ab Aydin-ab marked this pull request as draft September 30, 2025 23:35
cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an excellent beginner-friendly example for Ray Data ETL using the TPC-H benchmark. The notebook is well-structured and provides a good overview of the Extract-Transform-Load process with practical examples. I've identified a few critical issues in the notebook's code snippets that would cause CI failures, and a broken link in the documentation. Additionally, I've provided some suggestions to enhance code quality and clarity. Once these issues are addressed, this will be a great addition to the Ray documentation.

@Aydin-ab Aydin-ab marked this pull request as ready for review October 2, 2025 17:29
@Aydin-ab Aydin-ab requested a review from a team as a code owner October 2, 2025 17:29
@angelinalg angelinalg added the go add ONLY when ready to merge, run all tests label Oct 2, 2025
cursor[bot]

This comment was marked as outdated.

@Aydin-ab Aydin-ab force-pushed the add-etl-tpch-template branch from c6b5efa to 7fb5273 Compare October 2, 2025 18:18
@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues release-test release test labels Oct 2, 2025
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
- Convert to second person, active voice throughout
- Update all headings to sentence case formatting
- Replace HTML alert blocks with Docusaurus admonitions (:::note, :::tip, :::caution)
- Fix list punctuation and sentence structure per style guide
- Improve grammar and readability while preserving all technical content

Applied rules: Google Dev Docs style, Docusaurus formatting, accessibility improvements

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
@Aydin-ab Aydin-ab force-pushed the add-etl-tpch-template branch from b911442 to da27fb3 Compare October 2, 2025 22:43
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Aydin-ab and others added 4 commits November 24, 2025 09:54
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…ray)

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…y 2.52.0 images, we have to reinstall pandas after installing unstructured otherwise pandas is not compiled with the right numpy binaries (ray compiles with numpy 1 and unstructured installs numpy 2). This example is supposed to be a tempalte on the ansycale console and we need to make it a good experience for the user. Using Runtime Dependencies on the console won't work because the --force-reinstall and --no-cache-dir will not be propagated to the workers by anyscale. We can either: 1. create a customized image using ray 2.52.0 as base then installing unstructured and later pandas. 2. use runtime envs in ray.init() to install them across workers. Solution 1 might be confusing to anyscale users who might wonder why making a new image for this specific template instead of using runtime dependencies (let's avoid telling them the --force-reinstall flag doesn't work which might be bad image..). Solution 2 is better because it introduces a good ray pattern (using runtime envs). In the future, when ray supports numpy 2 we could remove it and instead tell the user to pip isntall usntructured directly (with a notebook cell for example). If this example wasn't supposed to be an anyscale tempalte then i would have gone for solution 1 because this issue would just happen with the CI testing anyway

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
@angelinalg angelinalg merged commit 76be448 into ray-project:master Nov 26, 2025
6 checks passed
@Aydin-ab Aydin-ab deleted the add-etl-tpch-template branch November 26, 2025 18:15
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests release-test release test unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants