Skip to content

Commit 69927ec

Browse files
bveeramanidstrodtman
authored andcommitted
[Data] Fix broken code snippets in user guides (#55519)
In #51334, we discovered we weren't actually testing code snippets in our user guides. As a result, there are several broken code snippets in our guides. This PR fixes some of those code snippets, and re-enables testing on the user guides. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
1 parent 1fbef17 commit 69927ec

File tree

9 files changed

+97
-51
lines changed

9 files changed

+97
-51
lines changed

bazel/python.bzl

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,24 @@ def _convert_target_to_import_path(t):
1717
# 3) Replace '/' with '.' to form an import path.
1818
return t.replace("/", ".")
1919

20+
def doctest_each(files, gpu = False, deps=[], srcs=[], data=[], args=[], size="medium", tags=[], pytest_plugin_file="//bazel:default_doctest_pytest_plugin.py", **kwargs):
21+
# Unlike the `doctest` macro, `doctest_each` runs `pytest` on each file separately.
22+
# This is useful to run tests in parallel and more clearly report the test results.
23+
for file in files:
24+
doctest(
25+
files = [file],
26+
gpu = gpu,
27+
name = paths.split_extension(file)[0],
28+
deps = deps,
29+
srcs = srcs,
30+
data = data,
31+
args = args,
32+
size = size,
33+
tags = tags,
34+
pytest_plugin_file = pytest_plugin_file,
35+
**kwargs
36+
)
37+
2038
def doctest(files, gpu = False, name="doctest", deps=[], srcs=[], data=[], args=[], size="medium", tags=[], pytest_plugin_file="//bazel:default_doctest_pytest_plugin.py", **kwargs):
2139
# NOTE: If you run `pytest` on `__init__.py`, it tries to test all files in that
2240
# package. We don't want that, so we exclude it from the list of input files.

doc/BUILD.bazel

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
load("@py_deps_buildkite//:requirements.bzl", ci_require = "requirement")
22
load("@rules_python//python:defs.bzl", "py_test")
3-
load("//bazel:python.bzl", "doctest", "py_test_run_all_notebooks", "py_test_run_all_subdirectory")
3+
load("//bazel:python.bzl", "doctest", "doctest_each", "py_test_run_all_notebooks", "py_test_run_all_subdirectory")
44

55
exports_files(["test_myst_doc.py"])
66

@@ -480,8 +480,7 @@ doctest(
480480
tags = ["team:core"],
481481
)
482482

483-
doctest(
484-
name = "doctest[data]",
483+
doctest_each(
485484
files = glob(
486485
include = [
487486
"source/data/**/*.md",
@@ -492,15 +491,9 @@ doctest(
492491
"source/data/batch_inference.rst",
493492
"source/data/transforming-data.rst",
494493
# These tests are currently failing.
495-
"source/data/loading-data.rst",
496-
"source/data/data-internals.rst",
497-
"source/data/inspecting-data.rst",
498-
"source/data/loading-data.rst",
499-
"source/data/performance-tips.rst",
500-
"source/data/saving-data.rst",
501-
"source/data/working-with-images.rst",
502494
"source/data/working-with-llms.rst",
503-
"source/data/working-with-pytorch.rst",
495+
# These don't contain code snippets.
496+
"source/data/api/**/*.rst",
504497
],
505498
),
506499
pytest_plugin_file = "//python/ray/data:tests/doctest_pytest_plugin.py",

doc/source/data/data-internals.rst

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -179,12 +179,19 @@ To add custom optimization rules, implement a class that extends ``Rule`` and co
179179

180180
import ray
181181
from ray.data._internal.logical.interfaces import Rule
182+
from ray.data._internal.logical.optimizers import get_logical_ruleset
182183

183184
class CustomRule(Rule):
184185
def apply(self, plan):
185186
...
186187

187-
ray.data._internal.logical.optimizers.DEFAULT_LOGICAL_RULES.append(CustomRule)
188+
logical_ruleset = get_logical_ruleset()
189+
logical_ruleset.add(CustomRule)
190+
191+
.. testcode::
192+
:hide:
193+
194+
logical_ruleset.remove(CustomRule)
188195

189196
Types of physical operators
190197
~~~~~~~~~~~~~~~~~~~~~~~~~~~

doc/source/data/inspecting-data.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -123,12 +123,11 @@ of the returned batch, set ``batch_format``.
123123
print(batch)
124124

125125
.. testoutput::
126-
:options: +NORMALIZE_WHITESPACE
126+
:options: +MOCK
127127

128128
sepal length (cm) sepal width (cm) ... petal width (cm) target
129129
0 5.1 3.5 ... 0.2 0
130130
1 4.9 3.0 ... 0.2 0
131-
<BLANKLINE>
132131

133132
For more information on working with batches, see
134133
:ref:`Transforming batches <transforming_batches>` and
@@ -143,7 +142,10 @@ Ray Data calculates statistics during execution for each operator, such as wall
143142
To view stats about your :class:`Datasets <ray.data.Dataset>`, call :meth:`Dataset.stats() <ray.data.Dataset.stats>` on an executed dataset. The stats are also persisted under `/tmp/ray/session_*/logs/ray-data/ray-data.log`.
144143
For more on how to read this output, see :ref:`Monitoring Your Workload with the Ray Data Dashboard <monitoring-your-workload>`.
145144

145+
.. This snippet below is skipped because of https://github.com/ray-project/ray/issues/54101.
146+
146147
.. testcode::
148+
:skipif: True
147149

148150
import ray
149151
import datasets

doc/source/data/loading-data.rst

Lines changed: 23 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -486,13 +486,16 @@ Ray Data interoperates with distributed data processing frameworks like `Daft <h
486486
:func:`~ray.data.from_daft`. This function executes the Daft dataframe and constructs a ``Dataset`` backed by the resultant arrow data produced
487487
by your Daft query.
488488

489+
.. warning::
490+
:func:`~ray.data.from_daft` doesn't work with PyArrow 14 and later. For more
491+
information, see `this issue <https://github.com/ray-project/ray/issues/54837>`__.
492+
489493
.. testcode::
494+
:skipif: True
490495

491496
import daft
492497
import ray
493498

494-
ray.init()
495-
496499
df = daft.from_pydict({"int_col": [i for i in range(10000)], "str_col": [str(i) for i in range(10000)]})
497500
ds = ray.data.from_daft(df)
498501

@@ -512,7 +515,12 @@ Ray Data interoperates with distributed data processing frameworks like `Daft <h
512515
``Dataset`` backed by the distributed Pandas DataFrame partitions that underly
513516
the Dask DataFrame.
514517

518+
..
519+
We skip the code snippet below because `from_dask` doesn't work with PyArrow
520+
14 and later. For more information, see https://github.com/ray-project/ray/issues/54837
521+
515522
.. testcode::
523+
:skipif: True
516524

517525
import dask.dataframe as dd
518526
import pandas as pd
@@ -569,21 +577,21 @@ Ray Data interoperates with distributed data processing frameworks like `Daft <h
569577
call :func:`~ray.data.read_iceberg`. This function creates a ``Dataset`` backed by
570578
the distributed files that underlie the Iceberg table.
571579

572-
..
573-
574580
.. testcode::
575581
:skipif: True
576582

577-
>>> import ray
578-
>>> from pyiceberg.expressions import EqualTo
579-
>>> ds = ray.data.read_iceberg(
580-
... table_identifier="db_name.table_name",
581-
... row_filter=EqualTo("column_name", "literal_value"),
582-
... catalog_kwargs={"name": "default", "type": "glue"}
583-
... )
583+
import ray
584+
from pyiceberg.expressions import EqualTo
584585

586+
ds = ray.data.read_iceberg(
587+
table_identifier="db_name.table_name",
588+
row_filter=EqualTo("column_name", "literal_value"),
589+
catalog_kwargs={"name": "default", "type": "glue"}
590+
)
591+
ds.show(3)
585592

586593
.. testoutput::
594+
:options: +MOCK
587595

588596
{'col1': 0, 'col2': '0'}
589597
{'col1': 1, 'col2': '1'}
@@ -622,6 +630,7 @@ Ray Data interoperates with distributed data processing frameworks like `Daft <h
622630
DataFrame.
623631

624632
.. testcode::
633+
:skipif: True
625634

626635
import mars
627636
import mars.dataframe as md
@@ -668,7 +677,10 @@ Ray Data interoperates with HuggingFace, PyTorch, and TensorFlow datasets.
668677
`IterableDatasetDict <https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.IterableDatasetDict>`_
669678
objects aren't supported.
670679

680+
.. This snippet below is skipped because of https://github.com/ray-project/ray/issues/54837.
681+
671682
.. testcode::
683+
:skipif: True
672684

673685
import ray.data
674686
from datasets import load_dataset

doc/source/data/performance-tips.rst

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ For example, the following code batches multiple files into the same read task t
5151
ray.init(num_cpus=2)
5252

5353
# Repeat the iris.csv file 16 times.
54-
ds = ray.data.read_csv(["example://iris.csv"] * 16)
54+
ds = ray.data.read_csv(["s3://anonymous@ray-example-data/iris.csv"] * 16)
5555
print(ds.materialize())
5656

5757
.. testoutput::
@@ -81,7 +81,7 @@ Notice how the number of output blocks is equal to ``override_num_blocks`` in th
8181
ray.init(num_cpus=2)
8282

8383
# Repeat the iris.csv file 16 times.
84-
ds = ray.data.read_csv(["example://iris.csv"] * 16, override_num_blocks=16)
84+
ds = ray.data.read_csv(["s3://anonymous@ray-example-data/iris.csv"] * 16, override_num_blocks=16)
8585
print(ds.materialize())
8686

8787
.. testoutput::
@@ -143,7 +143,7 @@ For example, the following code executes :func:`~ray.data.read_csv` with only on
143143
# Pretend there are two CPUs.
144144
ray.init(num_cpus=2)
145145

146-
ds = ray.data.read_csv("example://iris.csv").map(lambda row: row)
146+
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv").map(lambda row: row)
147147
print(ds.materialize().stats())
148148

149149
.. testoutput::
@@ -171,7 +171,7 @@ For example, this code sets the number of files equal to ``override_num_blocks``
171171
# Pretend there are two CPUs.
172172
ray.init(num_cpus=2)
173173

174-
ds = ray.data.read_csv("example://iris.csv", override_num_blocks=1).map(lambda row: row)
174+
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv", override_num_blocks=1).map(lambda row: row)
175175
print(ds.materialize().stats())
176176

177177
.. testoutput::
@@ -205,15 +205,21 @@ calling :func:`~ray.data.Dataset.select_columns`, since column selection is push
205205
.. testcode::
206206

207207
import ray
208+
208209
# Read just two of the five columns of the Iris dataset.
209-
ray.data.read_parquet(
210+
ds = ray.data.read_parquet(
210211
"s3://anonymous@ray-example-data/iris.parquet",
211212
columns=["sepal.length", "variety"],
212213
)
214+
215+
print(ds.schema())
213216

214217
.. testoutput::
215218

216-
Dataset(num_rows=150, schema={sepal.length: double, variety: string})
219+
Column Type
220+
------ ----
221+
sepal.length double
222+
variety string
217223

218224

219225
.. _data_memory:

doc/source/data/saving-data.rst

Lines changed: 25 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -228,7 +228,7 @@ number of files & their sizes (since every block could potentially carry the row
228228
print_directory_tree("/tmp/sales_partitioned")
229229

230230
.. testoutput::
231-
:options: +NORMALIZE_WHITESPACE
231+
:options: +MOCK
232232

233233
sales_partitioned/
234234
city=NYC/
@@ -301,24 +301,10 @@ Ray Data interoperates with distributed data processing frameworks like `Daft <h
301301
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
302302

303303
df = ds.to_daft()
304-
305-
.. tab-item:: Dask
306-
307-
To convert a :class:`~ray.data.dataset.Dataset` to a
308-
`Dask DataFrame <https://docs.dask.org/en/stable/dataframe.html>`__, call
309-
:meth:`Dataset.to_dask() <ray.data.Dataset.to_dask>`.
310-
311-
.. testcode::
312-
313-
import ray
314-
315-
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
316-
317-
df = ds.to_dask()
318-
319-
df
304+
print(df)
320305

321306
.. testoutput::
307+
:options: +MOCK
322308

323309
╭───────────────────┬──────────────────┬───────────────────┬──────────────────┬────────╮
324310
│ sepal length (cm) ┆ sepal width (cm) ┆ petal length (cm) ┆ petal width (cm) ┆ target │
@@ -345,13 +331,33 @@ Ray Data interoperates with distributed data processing frameworks like `Daft <h
345331
(Showing first 8 of 150 rows)
346332

347333

334+
.. tab-item:: Dask
335+
336+
To convert a :class:`~ray.data.dataset.Dataset` to a
337+
`Dask DataFrame <https://docs.dask.org/en/stable/dataframe.html>`__, call
338+
:meth:`Dataset.to_dask() <ray.data.Dataset.to_dask>`.
339+
340+
..
341+
We skip the code snippet below because `to_dask` doesn't work with PyArrow
342+
14 and later. For more information, see https://github.com/ray-project/ray/issues/54837
343+
344+
.. testcode::
345+
:skipif: True
346+
347+
import ray
348+
349+
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
350+
351+
df = ds.to_dask()
352+
348353
.. tab-item:: Spark
349354

350355
To convert a :class:`~ray.data.dataset.Dataset` to a `Spark DataFrame
351356
<https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html>`__,
352357
call :meth:`Dataset.to_spark() <ray.data.Dataset.to_spark>`.
353358

354359
.. testcode::
360+
:skipif: True
355361

356362
import ray
357363
import raydp
@@ -367,6 +373,7 @@ Ray Data interoperates with distributed data processing frameworks like `Daft <h
367373
df = ds.to_spark(spark)
368374

369375
.. testcode::
376+
:skipif: True
370377
:hide:
371378

372379
raydp.stop_spark()
@@ -390,6 +397,7 @@ Ray Data interoperates with distributed data processing frameworks like `Daft <h
390397
:meth:`Dataset.to_mars() <ray.data.Dataset.to_mars>`.
391398

392399
.. testcode::
400+
:skipif: True
393401

394402
import ray
395403

doc/source/data/working-with-images.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ To view the full list of supported file formats, see the
147147

148148
Column Type
149149
------ ----
150-
image numpy.ndarray(shape=(32, 32, 3), dtype=uint8)
150+
img struct<bytes: binary, path: string>
151151
label int64
152152

153153

doc/source/data/working-with-pytorch.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -229,8 +229,8 @@ You can use built-in Torch transforms from ``torchvision``, ``torchtext``, and `
229229

230230
Column Type
231231
------ ----
232-
text <class 'object'>
233-
tokenized_text <class 'object'>
232+
text string
233+
tokenized_text list<item: string>
234234

235235
.. _batch_inference_pytorch:
236236

0 commit comments

Comments
 (0)