You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ray Data calculates statistics during execution for each operator, such as wall clock time and memory usage.
152
152
153
153
To view stats about your :class:`Datasets <ray.data.Dataset>`, call :meth:`Dataset.stats() <ray.data.Dataset.stats>` on an executed dataset. The stats are also persisted under `/tmp/ray/session_*/logs/ray-data.log`.
154
+
For more on how to read this output, see :ref:`Monitoring Your Workload with the Ray Data Dashboard <monitoring-your-workload>`.
Copy file name to clipboardExpand all lines: doc/source/data/loading-data.rst
+75-42Lines changed: 75 additions & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -243,7 +243,7 @@ To read formats other than Parquet, see the :ref:`Input/Output reference <input-
243
243
petal.width double
244
244
variety string
245
245
246
-
.. tab-item:: ABL
246
+
.. tab-item:: ABS
247
247
248
248
To read files from Azure Blob Storage, install the
249
249
`Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage <https://pypi.org/project/adlfs/>`_
@@ -454,6 +454,11 @@ Ray Data interoperates with distributed data processing frameworks like
454
454
:ref:`Dask <dask-on-ray>`, :ref:`Spark <spark-on-ray>`, :ref:`Modin <modin-on-ray>`, and
455
455
:ref:`Mars <mars-on-ray>`.
456
456
457
+
.. note::
458
+
459
+
The Ray Community provides these operations but may not actively maintain them. If you run into issues,
460
+
create a GitHub issue `here <https://github.com/ray-project/ray/issues>`__.
461
+
457
462
.. tab-set::
458
463
459
464
.. tab-item:: Dask
@@ -573,21 +578,25 @@ Ray Data interoperates with distributed data processing frameworks like
573
578
Loading data from ML libraries
574
579
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
575
580
576
-
Ray Data interoperates with HuggingFace and TensorFlow datasets.
581
+
Ray Data interoperates with HuggingFace, PyTorch, and TensorFlow datasets.
577
582
578
583
.. tab-set::
579
584
580
585
.. tab-item:: HuggingFace
581
586
582
-
To convert a 🤗 Dataset to a Ray Datasets, call
587
+
To convert a HuggingFace Dataset to a Ray Datasets, call
583
588
:func:`~ray.data.from_huggingface`. This function accesses the underlying Arrow
584
589
table and converts it to a Dataset directly.
585
590
586
591
.. warning::
587
-
:class:`~ray.data.from_huggingface` doesn't support parallel
588
-
reads. This isn't an issue with in-memory 🤗 Datasets, but may fail with
589
-
large memory-mapped 🤗 Datasets. Also, 🤗 ``IterableDataset`` objects aren't
590
-
supported.
592
+
:class:`~ray.data.from_huggingface` only supports parallel reads in certain
593
+
instances, namely for untransformed public HuggingFace Datasets. For those datasets,
594
+
Ray Data uses `hosted parquet files <https://huggingface.co/docs/datasets-server/parquet#list-parquet-files>`_
595
+
to perform a distributed read; otherwise, Ray Data uses a single node read.
596
+
This behavior shouldn't be an issue with in-memory HuggingFace Datasets, but may cause a failure with
597
+
large memory-mapped HuggingFace Datasets. Additionally, HuggingFace `DatasetDict <https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.DatasetDict>`_ and
To convert a TensorFlow dataset to a Ray Dataset, call :func:`~ray.data.from_tf`.
@@ -799,45 +833,41 @@ Call :func:`~ray.data.read_sql` to read data from a database that provides a
799
833
query="SELECT title, score FROM movie WHERE year >= 1980",
800
834
)
801
835
802
-
.. _reading_bigquery:
836
+
.. tab-item:: BigQuery
803
837
804
-
Reading BigQuery
805
-
~~~~~~~~~~~~~~~~
838
+
To read from BigQuery, install the
839
+
`Python Client for Google BigQuery <https://cloud.google.com/python/docs/reference/bigquery/latest>`_ and the `Python Client for Google BigQueryStorage <https://cloud.google.com/python/docs/reference/bigquerystorage/latest>`_.
806
840
807
-
To read from BigQuery, install the
808
-
`Python Client for Google BigQuery <https://cloud.google.com/python/docs/reference/bigquery/latest>`_ and the `Python Client for Google BigQueryStorage <https://cloud.google.com/python/docs/reference/bigquerystorage/latest>`_.
809
-
810
-
.. code-block:: console
811
-
812
-
pip install google-cloud-bigquery
813
-
pip install google-cloud-bigquery-storage
841
+
.. code-block:: console
814
842
815
-
To read data from BigQuery, call :func:`~ray.data.read_bigquery` and specify the project id, dataset, and query (if applicable).
843
+
pip install google-cloud-bigquery
844
+
pip install google-cloud-bigquery-storage
816
845
817
-
.. testcode::
818
-
:skipif: True
846
+
To read data from BigQuery, call :func:`~ray.data.read_bigquery` and specify the project id, dataset, and query (if applicable).
819
847
820
-
import ray
848
+
.. testcode::
849
+
:skipif: True
821
850
822
-
# Read the entire dataset (do not specify query)
823
-
ds = ray.data.read_bigquery(
824
-
project_id="my_gcloud_project_id",
825
-
dataset="bigquery-public-data.ml_datasets.iris",
826
-
)
851
+
import ray
827
852
828
-
# Read from a SQL query of the dataset (do not specify dataset)
829
-
ds = ray.data.read_bigquery(
830
-
project_id="my_gcloud_project_id",
831
-
query = "SELECT * FROM `bigquery-public-data.ml_datasets.iris` LIMIT 50",
832
-
)
853
+
# Read the entire dataset. Do not specify query.
854
+
ds = ray.data.read_bigquery(
855
+
project_id="my_gcloud_project_id",
856
+
dataset="bigquery-public-data.ml_datasets.iris",
857
+
)
833
858
834
-
# Write back to BigQuery
835
-
ds.write_bigquery(
836
-
project_id="my_gcloud_project_id",
837
-
dataset="destination_dataset.destination_table",
838
-
overwrite_table=True,
839
-
)
859
+
# Read from a SQL query of the dataset. Do not specify dataset.
860
+
ds = ray.data.read_bigquery(
861
+
project_id="my_gcloud_project_id",
862
+
query = "SELECT * FROM `bigquery-public-data.ml_datasets.iris` LIMIT 50",
863
+
)
840
864
865
+
# Write back to BigQuery
866
+
ds.write_bigquery(
867
+
project_id="my_gcloud_project_id",
868
+
dataset="destination_dataset.destination_table",
869
+
overwrite_table=True,
870
+
)
841
871
842
872
.. _reading_mongodb:
843
873
@@ -928,16 +958,19 @@ Loading other datasources
928
958
929
959
If Ray Data can't load your data, subclass
930
960
:class:`~ray.data.Datasource`. Then, construct an instance of your custom
931
-
datasource and pass it to :func:`~ray.data.read_datasource`.
961
+
datasource and pass it to :func:`~ray.data.read_datasource`. To write results, you might
962
+
also need to subclass :class:`ray.data.Datasink`. Then, create an instance of your custom
963
+
datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details, see
964
+
:ref:`Advanced: Read and Write Custom File Types <custom_datasource>`.
0 commit comments