Skip to content

Commit 2f2363d

Browse files
omatthew98angelinalg
authored andcommitted
[Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages (ray-project#44093)
--------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
1 parent 0d4c7bd commit 2f2363d

File tree

6 files changed

+188
-116
lines changed

6 files changed

+188
-116
lines changed

doc/source/data/custom-datasource-example.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.. _custom_datasource:
2+
13
Advanced: Read and Write Custom File Types
24
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
35

doc/source/data/inspecting-data.rst

Lines changed: 42 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ This guide shows you how to:
1111
* `Describe datasets <#describing-datasets>`_
1212
* `Inspect rows <#inspecting-rows>`_
1313
* `Inspect batches <#inspecting-batches>`_
14-
* `Inspect execution statistics <#inspecting-stats>`_
14+
* `Inspect execution statistics <#inspecting-execution-statistics>`_
1515

1616
.. _describing-datasets:
1717

@@ -151,46 +151,58 @@ Inspecting execution statistics
151151
Ray Data calculates statistics during execution for each operator, such as wall clock time and memory usage.
152152

153153
To view stats about your :class:`Datasets <ray.data.Dataset>`, call :meth:`Dataset.stats() <ray.data.Dataset.stats>` on an executed dataset. The stats are also persisted under `/tmp/ray/session_*/logs/ray-data.log`.
154+
For more on how to read this output, see :ref:`Monitoring Your Workload with the Ray Data Dashboard <monitoring-your-workload>`.
154155

155156
.. testcode::
157+
156158
import ray
157-
import time
159+
import datasets
160+
161+
def f(batch):
162+
return batch
158163

159-
def pause(x):
160-
time.sleep(.0001)
161-
return x
164+
def g(row):
165+
return True
162166

167+
hf_ds = datasets.load_dataset("mnist", "mnist")
163168
ds = (
164-
ray.data.read_csv("s3://anonymous@air-example-data/iris.csv")
165-
.map(lambda x: x)
166-
.map(pause)
169+
ray.data.from_huggingface(hf_ds["train"])
170+
.map_batches(f)
171+
.filter(g)
172+
.materialize()
167173
)
168174

169-
for batch in ds.iter_batches():
170-
pass
171-
172175
print(ds.stats())
173176

174177
.. testoutput::
175178
:options: +MOCK
176179

177-
Operator 1 ReadCSV->SplitBlocks(4): 1 tasks executed, 4 blocks produced in 0.22s
178-
* Remote wall time: 222.1ms min, 222.1ms max, 222.1ms mean, 222.1ms total
179-
* Remote cpu time: 15.6ms min, 15.6ms max, 15.6ms mean, 15.6ms total
180-
* Peak heap memory usage (MiB): 157953.12 min, 157953.12 max, 157953 mean
181-
* Output num rows: 150 min, 150 max, 150 mean, 150 total
182-
* Output size bytes: 6000 min, 6000 max, 6000 mean, 6000 total
180+
Operator 1 ReadParquet->SplitBlocks(32): 1 tasks executed, 32 blocks produced in 2.92s
181+
* Remote wall time: 103.38us min, 1.34s max, 42.14ms mean, 1.35s total
182+
* Remote cpu time: 102.0us min, 164.66ms max, 5.37ms mean, 171.72ms total
183+
* UDF time: 0us min, 0us max, 0.0us mean, 0us total
184+
* Peak heap memory usage (MiB): 266375.0 min, 281875.0 max, 274491 mean
185+
* Output num rows per block: 1875 min, 1875 max, 1875 mean, 60000 total
186+
* Output size bytes per block: 537986 min, 555360 max, 545963 mean, 17470820 total
187+
* Output rows per task: 60000 min, 60000 max, 60000 mean, 1 tasks used
183188
* Tasks per node: 1 min, 1 max, 1 mean; 1 nodes used
184-
* Extra metrics: {'obj_store_mem_freed': 5761}
185-
186-
Dataset iterator time breakdown:
187-
* Total time user code is blocked: 5.68ms
188-
* Total time in user code: 0.96us
189-
* Total time overall: 238.93ms
190-
* Num blocks local: 0
191-
* Num blocks remote: 0
192-
* Num blocks unknown location: 1
193-
* Batch iteration time breakdown (summed across prefetch threads):
194-
* In ray.get(): 2.16ms min, 2.16ms max, 2.16ms avg, 2.16ms total
195-
* In batch creation: 897.67us min, 897.67us max, 897.67us avg, 897.67us total
196-
* In batch formatting: 836.87us min, 836.87us max, 836.87us avg, 836.87us total
189+
* Operator throughput:
190+
* Ray Data throughput: 20579.80984833993 rows/s
191+
* Estimated single node throughput: 44492.67361278733 rows/s
192+
193+
Operator 2 MapBatches(f)->Filter(g): 32 tasks executed, 32 blocks produced in 3.63s
194+
* Remote wall time: 675.48ms min, 1.0s max, 797.07ms mean, 25.51s total
195+
* Remote cpu time: 673.41ms min, 897.32ms max, 768.09ms mean, 24.58s total
196+
* UDF time: 661.65ms min, 978.04ms max, 778.13ms mean, 24.9s total
197+
* Peak heap memory usage (MiB): 152281.25 min, 286796.88 max, 164231 mean
198+
* Output num rows per block: 1875 min, 1875 max, 1875 mean, 60000 total
199+
* Output size bytes per block: 530251 min, 547625 max, 538228 mean, 17223300 total
200+
* Output rows per task: 1875 min, 1875 max, 1875 mean, 32 tasks used
201+
* Tasks per node: 32 min, 32 max, 32 mean; 1 nodes used
202+
* Operator throughput:
203+
* Ray Data throughput: 16512.364546087643 rows/s
204+
* Estimated single node throughput: 2352.3683708977856 rows/s
205+
206+
Dataset throughput:
207+
* Ray Data throughput: 11463.372316361854 rows/s
208+
* Estimated single node throughput: 25580.963670075285 rows/s

doc/source/data/loading-data.rst

Lines changed: 75 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -243,7 +243,7 @@ To read formats other than Parquet, see the :ref:`Input/Output reference <input-
243243
petal.width double
244244
variety string
245245

246-
.. tab-item:: ABL
246+
.. tab-item:: ABS
247247

248248
To read files from Azure Blob Storage, install the
249249
`Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage <https://pypi.org/project/adlfs/>`_
@@ -454,6 +454,11 @@ Ray Data interoperates with distributed data processing frameworks like
454454
:ref:`Dask <dask-on-ray>`, :ref:`Spark <spark-on-ray>`, :ref:`Modin <modin-on-ray>`, and
455455
:ref:`Mars <mars-on-ray>`.
456456

457+
.. note::
458+
459+
The Ray Community provides these operations but may not actively maintain them. If you run into issues,
460+
create a GitHub issue `here <https://github.com/ray-project/ray/issues>`__.
461+
457462
.. tab-set::
458463

459464
.. tab-item:: Dask
@@ -573,21 +578,25 @@ Ray Data interoperates with distributed data processing frameworks like
573578
Loading data from ML libraries
574579
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
575580

576-
Ray Data interoperates with HuggingFace and TensorFlow datasets.
581+
Ray Data interoperates with HuggingFace, PyTorch, and TensorFlow datasets.
577582

578583
.. tab-set::
579584

580585
.. tab-item:: HuggingFace
581586

582-
To convert a 🤗 Dataset to a Ray Datasets, call
587+
To convert a HuggingFace Dataset to a Ray Datasets, call
583588
:func:`~ray.data.from_huggingface`. This function accesses the underlying Arrow
584589
table and converts it to a Dataset directly.
585590

586591
.. warning::
587-
:class:`~ray.data.from_huggingface` doesn't support parallel
588-
reads. This isn't an issue with in-memory 🤗 Datasets, but may fail with
589-
large memory-mapped 🤗 Datasets. Also, 🤗 ``IterableDataset`` objects aren't
590-
supported.
592+
:class:`~ray.data.from_huggingface` only supports parallel reads in certain
593+
instances, namely for untransformed public HuggingFace Datasets. For those datasets,
594+
Ray Data uses `hosted parquet files <https://huggingface.co/docs/datasets-server/parquet#list-parquet-files>`_
595+
to perform a distributed read; otherwise, Ray Data uses a single node read.
596+
This behavior shouldn't be an issue with in-memory HuggingFace Datasets, but may cause a failure with
597+
large memory-mapped HuggingFace Datasets. Additionally, HuggingFace `DatasetDict <https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.DatasetDict>`_ and
598+
`IterableDatasetDict <https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.IterableDatasetDict>`_
599+
objects aren't supported.
591600

592601
.. testcode::
593602

@@ -603,6 +612,31 @@ Ray Data interoperates with HuggingFace and TensorFlow datasets.
603612

604613
[{'text': ''}, {'text': ' = Valkyria Chronicles III = \n'}]
605614

615+
.. tab-item:: PyTorch
616+
617+
To convert a PyTorch dataset to a Ray Dataset, call :func:`~ray.data.from_torch`.
618+
619+
.. testcode::
620+
621+
import ray
622+
from torch.utils.data import Dataset
623+
from torchvision import datasets
624+
from torchvision.transforms import ToTensor
625+
626+
tds = datasets.CIFAR10(root="data", train=True, download=True, transform=ToTensor())
627+
ds = ray.data.from_torch(tds)
628+
629+
print(ds)
630+
631+
.. testoutput::
632+
:options: +MOCK
633+
634+
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to data/cifar-10-python.tar.gz
635+
100%|███████████████████████| 170498071/170498071 [00:07<00:00, 23494838.54it/s]
636+
Extracting data/cifar-10-python.tar.gz to data
637+
Dataset(num_rows=50000, schema={item: object})
638+
639+
606640
.. tab-item:: TensorFlow
607641

608642
To convert a TensorFlow dataset to a Ray Dataset, call :func:`~ray.data.from_tf`.
@@ -799,45 +833,41 @@ Call :func:`~ray.data.read_sql` to read data from a database that provides a
799833
query="SELECT title, score FROM movie WHERE year >= 1980",
800834
)
801835

802-
.. _reading_bigquery:
836+
.. tab-item:: BigQuery
803837

804-
Reading BigQuery
805-
~~~~~~~~~~~~~~~~
838+
To read from BigQuery, install the
839+
`Python Client for Google BigQuery <https://cloud.google.com/python/docs/reference/bigquery/latest>`_ and the `Python Client for Google BigQueryStorage <https://cloud.google.com/python/docs/reference/bigquerystorage/latest>`_.
806840

807-
To read from BigQuery, install the
808-
`Python Client for Google BigQuery <https://cloud.google.com/python/docs/reference/bigquery/latest>`_ and the `Python Client for Google BigQueryStorage <https://cloud.google.com/python/docs/reference/bigquerystorage/latest>`_.
809-
810-
.. code-block:: console
811-
812-
pip install google-cloud-bigquery
813-
pip install google-cloud-bigquery-storage
841+
.. code-block:: console
814842
815-
To read data from BigQuery, call :func:`~ray.data.read_bigquery` and specify the project id, dataset, and query (if applicable).
843+
pip install google-cloud-bigquery
844+
pip install google-cloud-bigquery-storage
816845
817-
.. testcode::
818-
:skipif: True
846+
To read data from BigQuery, call :func:`~ray.data.read_bigquery` and specify the project id, dataset, and query (if applicable).
819847

820-
import ray
848+
.. testcode::
849+
:skipif: True
821850

822-
# Read the entire dataset (do not specify query)
823-
ds = ray.data.read_bigquery(
824-
project_id="my_gcloud_project_id",
825-
dataset="bigquery-public-data.ml_datasets.iris",
826-
)
851+
import ray
827852

828-
# Read from a SQL query of the dataset (do not specify dataset)
829-
ds = ray.data.read_bigquery(
830-
project_id="my_gcloud_project_id",
831-
query = "SELECT * FROM `bigquery-public-data.ml_datasets.iris` LIMIT 50",
832-
)
853+
# Read the entire dataset. Do not specify query.
854+
ds = ray.data.read_bigquery(
855+
project_id="my_gcloud_project_id",
856+
dataset="bigquery-public-data.ml_datasets.iris",
857+
)
833858

834-
# Write back to BigQuery
835-
ds.write_bigquery(
836-
project_id="my_gcloud_project_id",
837-
dataset="destination_dataset.destination_table",
838-
overwrite_table=True,
839-
)
859+
# Read from a SQL query of the dataset. Do not specify dataset.
860+
ds = ray.data.read_bigquery(
861+
project_id="my_gcloud_project_id",
862+
query = "SELECT * FROM `bigquery-public-data.ml_datasets.iris` LIMIT 50",
863+
)
840864

865+
# Write back to BigQuery
866+
ds.write_bigquery(
867+
project_id="my_gcloud_project_id",
868+
dataset="destination_dataset.destination_table",
869+
overwrite_table=True,
870+
)
841871

842872
.. _reading_mongodb:
843873

@@ -928,16 +958,19 @@ Loading other datasources
928958

929959
If Ray Data can't load your data, subclass
930960
:class:`~ray.data.Datasource`. Then, construct an instance of your custom
931-
datasource and pass it to :func:`~ray.data.read_datasource`.
961+
datasource and pass it to :func:`~ray.data.read_datasource`. To write results, you might
962+
also need to subclass :class:`ray.data.Datasink`. Then, create an instance of your custom
963+
datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details, see
964+
:ref:`Advanced: Read and Write Custom File Types <custom_datasource>`.
932965

933966
.. testcode::
934967
:skipif: True
935968

936969
# Read from a custom datasource.
937970
ds = ray.data.read_datasource(YourCustomDatasource(), **read_args)
938971
939-
# Write to a custom datasource.
940-
ds.write_datasource(YourCustomDatasource(), **write_args)
972+
# Write to a custom datasink.
973+
ds.write_datasink(YourCustomDatasink())
941974

942975
Performance considerations
943976
==========================
@@ -950,5 +983,5 @@ utilize the cluster, ranging from ``1...override_num_blocks`` tasks. In other wo
950983
the higher the ``override_num_blocks``, the smaller the data blocks in the Dataset and
951984
hence more opportunities for parallel execution.
952985

953-
For more information on how to tune the number of output blocks, see
954-
:ref:`Tuning output blocks for read <read_output_blocks>`.
986+
For more information on how to tune the number of output blocks and other suggestions
987+
for optimizing read performance, see `Optimizing reads <performance-tips.html#optimizing-reads>`__.

doc/source/data/saving-data.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ the appropriate scheme. URI can point to buckets or folders.
8383
filesystem = gcsfs.GCSFileSystem(project="my-google-project")
8484
ds.write_parquet("gcs://my-bucket/my-folder", filesystem=filesystem)
8585

86-
.. tab-item:: ABL
86+
.. tab-item:: ABS
8787

8888
To save data to Azure Blob Storage, install the
8989
`Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage <https://pypi.org/project/adlfs/>`_

0 commit comments

Comments
 (0)