-
Notifications
You must be signed in to change notification settings - Fork 7.3k
[data] [docs] Adding unstructured data templates from ray summit 2025 #57063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
angelinalg
merged 77 commits into
ray-project:master
from
Aydin-ab:add-etl-tpch-template
Nov 26, 2025
Merged
Changes from 70 commits
Commits
Show all changes
77 commits
Select commit
Hold shift + click to select a range
7a38e31
remove non existent xgboost example
Aydin-ab 6d54a5a
init readme
Aydin-ab 32e9ce1
use jupyter nbconvert
Aydin-ab a0bb0f3
add configs/ folder
Aydin-ab 565865f
add ci/ folder
Aydin-ab 1ec3fbc
add release test configs
Aydin-ab 2d3ac7f
add to examples.yml index
Aydin-ab 6163337
discover to bazel build
Aydin-ab a37713f
fix typo
Aydin-ab 0415662
new notebook content
Aydin-ab 977fe80
fix: use correct bazel build file
Aydin-ab 437be2c
rename notebook to avoid bazel error + vale compliant
Aydin-ab f92e626
add new framework and use case
Aydin-ab 9220d66
change byod type to cpu
Aydin-ab da27fb3
[cursor] Apply documentation style guide to Ray Data ETL notebook
Aydin-ab 6cac6fc
fix test notebook file name
Aydin-ab 274a718
remove docusaurus formatting (cant display it on anyscale template fr…
Aydin-ab 67e0345
angelina review: Apply suggestions from code review
Aydin-ab 5ef92c8
disable tests for now
Aydin-ab 1498811
updated readme
Aydin-ab 657a9b9
rename folder + add Run on anyscale button
Aydin-ab 2e5ae2f
removing reference to xgboost tutorial that doesnt exist anymore
Aydin-ab 8834f48
refactor to new folder name'
Aydin-ab 353301e
lint issues
Aydin-ab 5227d8e
Merge branch 'master' into add-etl-tpch-template
Aydin-ab e28d352
add groupby to vale vocab
Aydin-ab 0d3dc9b
rename filegroup
Aydin-ab 867187d
move to content/ folder
Aydin-ab ea068ae
remove batch-inference-optimization
Aydin-ab fff72a8
move to content folder
Aydin-ab 2ab8cdf
nitpicks
Aydin-ab b5b92c4
sync README.md
Aydin-ab 7745556
updated examples.yml
Aydin-ab 80ca720
update build.sh to minimal
Aydin-ab e22fae3
update tests
Aydin-ab eeba5c3
update byod_* images
Aydin-ab 1abf37b
configure tests in releaase_tests.yaml
Aydin-ab 38de677
adding general examples filegroup for ci config yaml files
Aydin-ab 3c92745
move filegroup to doc BUILD.bazel
Aydin-ab f678fa5
vale compliant
Aydin-ab ee0b5e0
fix ref to new header titles
Aydin-ab df8e2ec
add use case and framework for new templates
Aydin-ab f118a85
remove .ipynb and .md from toctree (generated separately with example…
Aydin-ab f9623ad
fix path to README.md in conf.py
Aydin-ab 372ac13
use closer BUILD.bazel
Aydin-ab ecb994d
remove . from glob patterns
Aydin-ab e6d06c1
add back ray-overview examples configs
Aydin-ab 56f8df5
use orphan metadata insteaed of sclude
Aydin-ab e3129f6
remove etl template
Aydin-ab a25ba24
adding debug logs (TODO: remove)
Aydin-ab e95c723
sync README.md
Aydin-ab 60a6041
finish removing etl template
Aydin-ab 33715fb
Revert "adding debug logs (TODO: remove)"
Aydin-ab 82d5abe
remove superfluous/LLM verbosity
Aydin-ab 5dc0be4
update section ref
Aydin-ab 0c7dbfb
change to ray llm image
Aydin-ab 785ad62
Merge branch 'master' into add-etl-tpch-template
Aydin-ab 81eab1c
fix cut config
Aydin-ab b1353b3
readding xgboost example (not purpose of this PR)
Aydin-ab 216a0de
Merge branch 'master' into add-etl-tpch-template
Aydin-ab 35439e1
remove concurrency/num cpu + adding with_columns() suggestion
Aydin-ab df3f4f1
bugfix adding lit()
Aydin-ab 38cb428
Merge branch 'master' into add-etl-tpch-template
Aydin-ab 1270bef
declutter vale now that we removed the etl tempalte
Aydin-ab 14cf60f
Merge branch 'master' into add-etl-tpch-template
Aydin-ab f638121
remove quickstart heading + remove STEP comments
Aydin-ab 66f3d15
unfreeze unstructured version (unstable backward compat with nightly …
Aydin-ab d147bc0
adding buttong to anyscale tmeplate and github
Aydin-ab 03de24d
Apply suggestions from code review
Aydin-ab 886690d
Apply suggestions from code review
Aydin-ab 1bfc1f0
sync markdwon
Aydin-ab 422437b
add helper script
Aydin-ab 218d531
change to cpu image
Aydin-ab 214dd48
use [pdf]
Aydin-ab f166431
reinstall pandas to avoid deps errors
Aydin-ab 17217f6
using runtime envs in ray init(): At this point, with our anyscale/ra…
Aydin-ab 46281bb
nitpick remove todo comment
Aydin-ab File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
17 changes: 17 additions & 0 deletions
17
doc/source/data/examples/unstructured-data-ingestion/ci/aws.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| cloud_id: {{env["ANYSCALE_CLOUD_ID"]}} | ||
| region: us-west-2 | ||
|
|
||
| head_node_type: | ||
| name: head | ||
| instance_type: m5.2xlarge | ||
| resources: | ||
| cpu: 0 | ||
| gpu: 0 | ||
| worker_node_types: | ||
| - name: 8CPU-32GB | ||
| instance_type: m5.2xlarge | ||
| min_workers: 10 | ||
| max_workers: 10 | ||
|
|
||
| flags: | ||
| allow-cross-zone-autoscaling: true |
6 changes: 6 additions & 0 deletions
6
doc/source/data/examples/unstructured-data-ingestion/ci/build.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| #!/bin/bash | ||
|
|
||
| set -exo pipefail | ||
|
|
||
| # Install runtime deps | ||
| pip install unstructured | ||
Aydin-ab marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Aydin-ab marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Aydin-ab marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
17 changes: 17 additions & 0 deletions
17
doc/source/data/examples/unstructured-data-ingestion/ci/gce.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| cloud_id: {{env["ANYSCALE_CLOUD_ID"]}} | ||
| region: us-central1 | ||
|
|
||
| head_node_type: | ||
| name: head | ||
| instance_type: n1-standard-8 | ||
| resources: | ||
| cpu: 0 | ||
| gpu: 0 | ||
| worker_node_types: | ||
| - name: 8CPU-32GB | ||
| instance_type: n1-standard-8 | ||
| min_workers: 10 | ||
| max_workers: 10 | ||
|
|
||
| flags: | ||
| allow-cross-zone-autoscaling: true |
86 changes: 86 additions & 0 deletions
86
doc/source/data/examples/unstructured-data-ingestion/ci/nb2py.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| #!/usr/bin/env python3 | ||
| import argparse | ||
| import nbformat | ||
|
|
||
|
|
||
| def convert_notebook( | ||
| input_path: str, output_path: str, ignore_cmds: bool = False | ||
| ) -> None: | ||
| """ | ||
| Read a Jupyter notebook and write a Python script, converting all %%bash | ||
| cells and IPython "!" commands into subprocess.run calls that raise on error. | ||
| Cells that load or autoreload extensions are ignored. | ||
| """ | ||
| nb = nbformat.read(input_path, as_version=4) | ||
| with open(output_path, "w") as out: | ||
| for cell in nb.cells: | ||
| # Only process code cells | ||
| if cell.cell_type != "code": | ||
| continue | ||
|
|
||
| lines = cell.source.splitlines() | ||
| # Skip cells that load or autoreload extensions | ||
| if any( | ||
| l.strip().startswith("%load_ext autoreload") | ||
| or l.strip().startswith("%autoreload all") | ||
| for l in lines | ||
| ): | ||
| continue | ||
|
|
||
| # Detect a %%bash cell | ||
| if lines and lines[0].strip().startswith("%%bash"): | ||
| if ignore_cmds: | ||
| continue | ||
| bash_script = "\n".join(lines[1:]).rstrip() | ||
| out.write("import subprocess\n") | ||
| out.write( | ||
| f"subprocess.run(r'''{bash_script}''',\n" | ||
| " shell=True,\n" | ||
| " check=True,\n" | ||
| " executable='/bin/bash')\n\n" | ||
| ) | ||
| else: | ||
| # Detect any IPython '!' shell commands in code lines | ||
| has_bang = any(line.lstrip().startswith("!") for line in lines) | ||
| if has_bang: | ||
| if ignore_cmds: | ||
| continue | ||
| out.write("import subprocess\n") | ||
| for line in lines: | ||
| stripped = line.lstrip() | ||
| if stripped.startswith("!"): | ||
| cmd = stripped[1:].lstrip() | ||
| out.write( | ||
| f"subprocess.run(r'''{cmd}''',\n" | ||
| " shell=True,\n" | ||
| " check=True,\n" | ||
| " executable='/bin/bash')\n" | ||
| ) | ||
| else: | ||
| out.write(line.rstrip() + "\n") | ||
| out.write("\n") | ||
| else: | ||
| # Regular Python cell: | ||
| code = cell.source.rstrip() | ||
| # Example of filtering cells by content | ||
| # if "client.chat.completions.create" in code: | ||
| # continue # Model isn't deployed in CI so skip cells calling the service | ||
| # else, dump as-is | ||
| out.write(cell.source.rstrip() + "\n\n") | ||
|
|
||
|
|
||
| def main() -> None: | ||
| parser = argparse.ArgumentParser( | ||
| description="Convert a Jupyter notebook to a Python script, preserving bash cells and '!' commands as subprocess calls unless ignored with --ignore-cmds." | ||
| ) | ||
| parser.add_argument("input_nb", help="Path to the input .ipynb file") | ||
| parser.add_argument("output_py", help="Path for the output .py script") | ||
| parser.add_argument( | ||
| "--ignore-cmds", action="store_true", help="Ignore bash cells and '!' commands" | ||
| ) | ||
| args = parser.parse_args() | ||
| convert_notebook(args.input_nb, args.output_py, ignore_cmds=args.ignore_cmds) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
12 changes: 12 additions & 0 deletions
12
doc/source/data/examples/unstructured-data-ingestion/ci/tests.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| #!/bin/bash | ||
|
|
||
| # Don't use nbconvert or jupytext unless you're willing | ||
| # to check each subprocess unit and validate that errors | ||
| # aren't being consumed/hidden | ||
|
|
||
| set -exo pipefail | ||
|
|
||
| # TODO once runnable on nightly, uncomment these lines to properly test | ||
| python ci/nb2py.py content/unstructured-data-ingestion.ipynb unstructured-data-ingestion.py | ||
| python unstructured-data-ingestion.py | ||
| rm unstructured-data-ingestion.py | ||
Aydin-ab marked this conversation as resolved.
Show resolved
Hide resolved
Aydin-ab marked this conversation as resolved.
Show resolved
Hide resolved
Aydin-ab marked this conversation as resolved.
Show resolved
Hide resolved
Aydin-ab marked this conversation as resolved.
Show resolved
Hide resolved
Aydin-ab marked this conversation as resolved.
Show resolved
Hide resolved
|
||
14 changes: 14 additions & 0 deletions
14
doc/source/data/examples/unstructured-data-ingestion/configs/aws.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| head_node_type: | ||
| name: head | ||
| instance_type: m5.2xlarge | ||
| resources: | ||
| cpu: 0 | ||
| gpu: 0 | ||
| worker_node_types: | ||
| - name: 8CPU-32GB | ||
| instance_type: m5.2xlarge | ||
| min_workers: 10 | ||
| max_workers: 10 | ||
|
|
||
| flags: | ||
| allow-cross-zone-autoscaling: true |
14 changes: 14 additions & 0 deletions
14
doc/source/data/examples/unstructured-data-ingestion/configs/gce.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| head_node_type: | ||
| name: head | ||
| instance_type: n1-standard-8 | ||
| resources: | ||
| cpu: 0 | ||
| gpu: 0 | ||
| worker_node_types: | ||
| - name: 8CPU-32GB | ||
| instance_type: n1-standard-8 | ||
| min_workers: 10 | ||
| max_workers: 10 | ||
|
|
||
| flags: | ||
| allow-cross-zone-autoscaling: true |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.