Skip to content
Merged
Show file tree
Hide file tree
Changes from 70 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
7a38e31
remove non existent xgboost example
Aydin-ab Sep 29, 2025
6d54a5a
init readme
Aydin-ab Sep 29, 2025
32e9ce1
use jupyter nbconvert
Aydin-ab Sep 29, 2025
a0bb0f3
add configs/ folder
Aydin-ab Sep 29, 2025
565865f
add ci/ folder
Aydin-ab Sep 29, 2025
1ec3fbc
add release test configs
Aydin-ab Sep 29, 2025
2d3ac7f
add to examples.yml index
Aydin-ab Sep 30, 2025
6163337
discover to bazel build
Aydin-ab Sep 30, 2025
a37713f
fix typo
Aydin-ab Oct 1, 2025
0415662
new notebook content
Aydin-ab Oct 1, 2025
977fe80
fix: use correct bazel build file
Aydin-ab Oct 1, 2025
437be2c
rename notebook to avoid bazel error + vale compliant
Aydin-ab Oct 2, 2025
f92e626
add new framework and use case
Aydin-ab Oct 2, 2025
9220d66
change byod type to cpu
Aydin-ab Oct 2, 2025
da27fb3
[cursor] Apply documentation style guide to Ray Data ETL notebook
Aydin-ab Oct 2, 2025
6cac6fc
fix test notebook file name
Aydin-ab Oct 3, 2025
274a718
remove docusaurus formatting (cant display it on anyscale template fr…
Aydin-ab Oct 3, 2025
67e0345
angelina review: Apply suggestions from code review
Aydin-ab Oct 4, 2025
5ef92c8
disable tests for now
Aydin-ab Oct 7, 2025
1498811
updated readme
Aydin-ab Oct 7, 2025
657a9b9
rename folder + add Run on anyscale button
Aydin-ab Oct 7, 2025
2e5ae2f
removing reference to xgboost tutorial that doesnt exist anymore
Aydin-ab Oct 7, 2025
8834f48
refactor to new folder name'
Aydin-ab Oct 7, 2025
353301e
lint issues
Aydin-ab Oct 7, 2025
5227d8e
Merge branch 'master' into add-etl-tpch-template
Aydin-ab Oct 7, 2025
e28d352
add groupby to vale vocab
Aydin-ab Oct 7, 2025
0d3dc9b
rename filegroup
Aydin-ab Oct 21, 2025
867187d
move to content/ folder
Aydin-ab Nov 5, 2025
ea068ae
remove batch-inference-optimization
Aydin-ab Nov 6, 2025
fff72a8
move to content folder
Aydin-ab Nov 6, 2025
2ab8cdf
nitpicks
Aydin-ab Nov 6, 2025
b5b92c4
sync README.md
Aydin-ab Nov 6, 2025
7745556
updated examples.yml
Aydin-ab Nov 6, 2025
80ca720
update build.sh to minimal
Aydin-ab Nov 6, 2025
e22fae3
update tests
Aydin-ab Nov 6, 2025
eeba5c3
update byod_* images
Aydin-ab Nov 6, 2025
1abf37b
configure tests in releaase_tests.yaml
Aydin-ab Nov 6, 2025
38de677
adding general examples filegroup for ci config yaml files
Aydin-ab Nov 6, 2025
3c92745
move filegroup to doc BUILD.bazel
Aydin-ab Nov 6, 2025
f678fa5
vale compliant
Aydin-ab Nov 6, 2025
ee0b5e0
fix ref to new header titles
Aydin-ab Nov 7, 2025
df8e2ec
add use case and framework for new templates
Aydin-ab Nov 7, 2025
f118a85
remove .ipynb and .md from toctree (generated separately with example…
Aydin-ab Nov 7, 2025
f9623ad
fix path to README.md in conf.py
Aydin-ab Nov 7, 2025
372ac13
use closer BUILD.bazel
Aydin-ab Nov 7, 2025
ecb994d
remove . from glob patterns
Aydin-ab Nov 7, 2025
e6d06c1
add back ray-overview examples configs
Aydin-ab Nov 7, 2025
56f8df5
use orphan metadata insteaed of sclude
Aydin-ab Nov 10, 2025
e3129f6
remove etl template
Aydin-ab Nov 10, 2025
a25ba24
adding debug logs (TODO: remove)
Aydin-ab Nov 10, 2025
e95c723
sync README.md
Aydin-ab Nov 10, 2025
60a6041
finish removing etl template
Aydin-ab Nov 10, 2025
33715fb
Revert "adding debug logs (TODO: remove)"
Aydin-ab Nov 10, 2025
82d5abe
remove superfluous/LLM verbosity
Aydin-ab Nov 10, 2025
5dc0be4
update section ref
Aydin-ab Nov 10, 2025
0c7dbfb
change to ray llm image
Aydin-ab Nov 13, 2025
785ad62
Merge branch 'master' into add-etl-tpch-template
Aydin-ab Nov 13, 2025
81eab1c
fix cut config
Aydin-ab Nov 13, 2025
b1353b3
readding xgboost example (not purpose of this PR)
Aydin-ab Nov 13, 2025
216a0de
Merge branch 'master' into add-etl-tpch-template
Aydin-ab Nov 14, 2025
35439e1
remove concurrency/num cpu + adding with_columns() suggestion
Aydin-ab Nov 17, 2025
df3f4f1
bugfix adding lit()
Aydin-ab Nov 18, 2025
38cb428
Merge branch 'master' into add-etl-tpch-template
Aydin-ab Nov 21, 2025
1270bef
declutter vale now that we removed the etl tempalte
Aydin-ab Nov 22, 2025
14cf60f
Merge branch 'master' into add-etl-tpch-template
Aydin-ab Nov 24, 2025
f638121
remove quickstart heading + remove STEP comments
Aydin-ab Nov 24, 2025
66f3d15
unfreeze unstructured version (unstable backward compat with nightly …
Aydin-ab Nov 24, 2025
d147bc0
adding buttong to anyscale tmeplate and github
Aydin-ab Nov 24, 2025
03de24d
Apply suggestions from code review
Aydin-ab Nov 24, 2025
886690d
Apply suggestions from code review
Aydin-ab Nov 24, 2025
1bfc1f0
sync markdwon
Aydin-ab Nov 24, 2025
422437b
add helper script
Aydin-ab Nov 24, 2025
218d531
change to cpu image
Aydin-ab Nov 24, 2025
214dd48
use [pdf]
Aydin-ab Nov 25, 2025
f166431
reinstall pandas to avoid deps errors
Aydin-ab Nov 25, 2025
17217f6
using runtime envs in ray init(): At this point, with our anyscale/ra…
Aydin-ab Nov 26, 2025
46281bb
nitpick remove todo comment
Aydin-ab Nov 26, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions .vale/styles/config/vocabularies/Data/accept.txt
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
assess_quality
autoscaler
[Aa]vro
[Bb]ackpressure
Dask
[Dd]ata('s)?
[Dd]atasource(s)?
[Dd]iscretizer(s)?
docstrings
dtype
FLAC
[Gg]roupby
[Gg]roup[bB]y
[Hh]asher(s)?
[Hh]udi
[Ii]ndexable
Expand All @@ -19,14 +21,18 @@ MCAP
Modin
[Mm]ultiget(s)?
ndarray(s)?
NLP
[Oo]utqueue(s)?
PDFs
[Pp]ipelined
Predibase('s)?
[Pp]refetch
[Pp]refetching
[Pp]reprocess
[Pp]reprocessor(s)?
process_file
[Pp]ushdown
queryable
RGB
runai
[Ss]calers
Expand All @@ -36,5 +42,3 @@ UDF(s)?
VLM(s)?
XGBoost
YOLO
[Ss]harding
[Ss]harded
8 changes: 7 additions & 1 deletion doc/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -700,9 +700,15 @@ doctest(
env = {"RAY_TRAIN_V2_ENABLED": "1"},
)

# --------------------------------------------------------------------
# Discover the Anyscale Jobs compute configs .yaml for release tests in CI
# --------------------------------------------------------------------

filegroup(
name = "example_configs",
srcs = glob(["source/ray-overview/examples/**/*.yaml"]),
srcs = glob([
"source/ray-overview/examples/**/*.yaml"
]),
visibility = ["//release:__pkg__"],
)

Expand Down
1 change: 1 addition & 0 deletions doc/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,7 @@ def __init__(self, version: str):
"train/examples/**/README.md",
"serve/tutorials/deployment-serve-llm/README.*",
"serve/tutorials/deployment-serve-llm/*/notebook.ipynb",
"data/examples/**/content/README.md",
"ray-overview/examples/llamafactory-llm-fine-tune/README.ipynb",
"ray-overview/examples/llamafactory-llm-fine-tune/**/*.ipynb",
] + autogen_files
Expand Down
6 changes: 6 additions & 0 deletions doc/source/custom_directives.py
Original file line number Diff line number Diff line change
Expand Up @@ -452,6 +452,10 @@ class UseCase(ExampleEnum):
GENERATIVE_AI = "Generative AI"
COMPUTER_VISION = "Computer Vision"
NATURAL_LANGUAGE_PROCESSING = "Natural Language Processing"
ETL = "ETL"
DATA_INGESTION = "Data Ingestion"
DATA_WAREHOUSING = "Data Warehousing"
DOCUMENT_PROCESSING = "Document Processing"

@classmethod
def formatted_name(cls):
Expand Down Expand Up @@ -493,7 +497,9 @@ class Framework(ExampleEnum):
HUGGINGFACE = "Hugging Face"
DATAJUICER = "Data-Juicer"
VLLM = "vLLM"
PANDAS = "Pandas"
ANY = "Any"
UNSTRUCTURED = "Unstructured"

@classmethod
def formatted_name(cls):
Expand Down
9 changes: 9 additions & 0 deletions doc/source/data/examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,12 @@ examples:
frameworks:
- xgboost
link: ../train/examples/xgboost/distributed-xgboost-lightgbm
- title: Unstructured Data Ingestion and Processing
skill_level: advanced
frameworks:
- Transformers
- Unstructured
use_cases:
- document processing
- data ingestion
link: examples/unstructured-data-ingestion/content/unstructured-data-ingestion
9 changes: 9 additions & 0 deletions doc/source/data/examples/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,12 @@ py_test_run_all_notebooks(
data = ["//doc/source/data/examples:data_examples"],
tags = ["exclusive", "team:data", "gpu"],
)

filegroup(
name = "data_example_configs",
srcs = glob([
"**/ci/aws.yaml",
"**/ci/gce.yaml"
]),
visibility = ["//release:__pkg__"],
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
region: us-west-2

head_node_type:
name: head
instance_type: m5.2xlarge
resources:
cpu: 0
gpu: 0
worker_node_types:
- name: 8CPU-32GB
instance_type: m5.2xlarge
min_workers: 10
max_workers: 10

flags:
allow-cross-zone-autoscaling: true
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash

set -exo pipefail

# Install runtime deps
pip install unstructured
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
region: us-central1

head_node_type:
name: head
instance_type: n1-standard-8
resources:
cpu: 0
gpu: 0
worker_node_types:
- name: 8CPU-32GB
instance_type: n1-standard-8
min_workers: 10
max_workers: 10

flags:
allow-cross-zone-autoscaling: true
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
#!/usr/bin/env python3
import argparse
import nbformat


def convert_notebook(
input_path: str, output_path: str, ignore_cmds: bool = False
) -> None:
"""
Read a Jupyter notebook and write a Python script, converting all %%bash
cells and IPython "!" commands into subprocess.run calls that raise on error.
Cells that load or autoreload extensions are ignored.
"""
nb = nbformat.read(input_path, as_version=4)
with open(output_path, "w") as out:
for cell in nb.cells:
# Only process code cells
if cell.cell_type != "code":
continue

lines = cell.source.splitlines()
# Skip cells that load or autoreload extensions
if any(
l.strip().startswith("%load_ext autoreload")
or l.strip().startswith("%autoreload all")
for l in lines
):
continue

# Detect a %%bash cell
if lines and lines[0].strip().startswith("%%bash"):
if ignore_cmds:
continue
bash_script = "\n".join(lines[1:]).rstrip()
out.write("import subprocess\n")
out.write(
f"subprocess.run(r'''{bash_script}''',\n"
" shell=True,\n"
" check=True,\n"
" executable='/bin/bash')\n\n"
)
else:
# Detect any IPython '!' shell commands in code lines
has_bang = any(line.lstrip().startswith("!") for line in lines)
if has_bang:
if ignore_cmds:
continue
out.write("import subprocess\n")
for line in lines:
stripped = line.lstrip()
if stripped.startswith("!"):
cmd = stripped[1:].lstrip()
out.write(
f"subprocess.run(r'''{cmd}''',\n"
" shell=True,\n"
" check=True,\n"
" executable='/bin/bash')\n"
)
else:
out.write(line.rstrip() + "\n")
out.write("\n")
else:
# Regular Python cell:
code = cell.source.rstrip()
# Example of filtering cells by content
# if "client.chat.completions.create" in code:
# continue # Model isn't deployed in CI so skip cells calling the service
# else, dump as-is
out.write(cell.source.rstrip() + "\n\n")


def main() -> None:
parser = argparse.ArgumentParser(
description="Convert a Jupyter notebook to a Python script, preserving bash cells and '!' commands as subprocess calls unless ignored with --ignore-cmds."
)
parser.add_argument("input_nb", help="Path to the input .ipynb file")
parser.add_argument("output_py", help="Path for the output .py script")
parser.add_argument(
"--ignore-cmds", action="store_true", help="Ignore bash cells and '!' commands"
)
args = parser.parse_args()
convert_notebook(args.input_nb, args.output_py, ignore_cmds=args.ignore_cmds)


if __name__ == "__main__":
main()
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash

# Don't use nbconvert or jupytext unless you're willing
# to check each subprocess unit and validate that errors
# aren't being consumed/hidden

set -exo pipefail

# TODO once runnable on nightly, uncomment these lines to properly test
python ci/nb2py.py content/unstructured-data-ingestion.ipynb unstructured-data-ingestion.py
python unstructured-data-ingestion.py
rm unstructured-data-ingestion.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
head_node_type:
name: head
instance_type: m5.2xlarge
resources:
cpu: 0
gpu: 0
worker_node_types:
- name: 8CPU-32GB
instance_type: m5.2xlarge
min_workers: 10
max_workers: 10

flags:
allow-cross-zone-autoscaling: true
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
head_node_type:
name: head
instance_type: n1-standard-8
resources:
cpu: 0
gpu: 0
worker_node_types:
- name: 8CPU-32GB
instance_type: n1-standard-8
min_workers: 10
max_workers: 10

flags:
allow-cross-zone-autoscaling: true
Loading