[data] [docs] Adding unstructured data templates from ray summit 2025#57063
Merged
angelinalg merged 77 commits intoray-project:masterfrom Nov 26, 2025
Merged
[data] [docs] Adding unstructured data templates from ray summit 2025#57063angelinalg merged 77 commits intoray-project:masterfrom
angelinalg merged 77 commits intoray-project:masterfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces an excellent beginner-friendly example for Ray Data ETL using the TPC-H benchmark. The notebook is well-structured and provides a good overview of the Extract-Transform-Load process with practical examples. I've identified a few critical issues in the notebook's code snippets that would cause CI failures, and a broken link in the documentation. Additionally, I've provided some suggestions to enhance code quality and clarity. Once these issues are addressed, this will be a great addition to the Ray documentation.
c6b5efa to
7fb5273
Compare
Aydin-ab
commented
Oct 2, 2025
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
- Convert to second person, active voice throughout - Update all headings to sentence case formatting - Replace HTML alert blocks with Docusaurus admonitions (:::note, :::tip, :::caution) - Fix list punctuation and sentence structure per style guide - Improve grammar and readability while preserving all technical content Applied rules: Google Dev Docs style, Docusaurus formatting, accessibility improvements Signed-off-by: Aydin Abiar <aydin@anyscale.com>
b911442 to
da27fb3
Compare
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Show resolved
Hide resolved
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…ray) Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
angelinalg
reviewed
Nov 24, 2025
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com>
angelinalg
reviewed
Nov 24, 2025
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
angelinalg
approved these changes
Nov 24, 2025
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Outdated
Show resolved
Hide resolved
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com>
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/content/unstructured-data-ingestion.ipynb
Show resolved
Hide resolved
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
doc/source/data/examples/unstructured-data-ingestion/ci/build.sh
Outdated
Show resolved
Hide resolved
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
doc/source/data/examples/unstructured-data-ingestion/ci/build.sh
Outdated
Show resolved
Hide resolved
doc/source/data/examples/unstructured-data-ingestion/ci/build.sh
Outdated
Show resolved
Hide resolved
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…y 2.52.0 images, we have to reinstall pandas after installing unstructured otherwise pandas is not compiled with the right numpy binaries (ray compiles with numpy 1 and unstructured installs numpy 2). This example is supposed to be a tempalte on the ansycale console and we need to make it a good experience for the user. Using Runtime Dependencies on the console won't work because the --force-reinstall and --no-cache-dir will not be propagated to the workers by anyscale. We can either: 1. create a customized image using ray 2.52.0 as base then installing unstructured and later pandas. 2. use runtime envs in ray.init() to install them across workers. Solution 1 might be confusing to anyscale users who might wonder why making a new image for this specific template instead of using runtime dependencies (let's avoid telling them the --force-reinstall flag doesn't work which might be bad image..). Solution 2 is better because it introduces a good ray pattern (using runtime envs). In the future, when ray supports numpy 2 we could remove it and instead tell the user to pip isntall usntructured directly (with a notebook cell for example). If this example wasn't supposed to be an anyscale tempalte then i would have gone for solution 1 because this issue would just happen with the CI testing anyway Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
SheldonTsen
pushed a commit
to SheldonTsen/ray
that referenced
this pull request
Dec 1, 2025
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
…ray-project#57063) Signed-off-by: peterxcli <peterxcli@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Converting Ray Data training content from ray summit 2025 into an example in the docs + an anyscale template in the console
author: @soffer-anyscale
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.