[Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages#44093
Conversation
|
Docs Pages from PR: |
doc/source/data/loading-data.rst
Outdated
There was a problem hiding this comment.
| will be used to perform a distributed read, otherwise a single node read will be used. | |
| will be used to perform a distributed read; otherwise, a single node read will be used. |
doc/source/data/loading-data.rst
Outdated
There was a problem hiding this comment.
can we include links to HF docs for DatasetDict and IteraableDatasetDict?
doc/source/data/loading-data.rst
Outdated
There was a problem hiding this comment.
| large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` or ``IteraableDatasetDict`` | |
| large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` and ``IteraableDatasetDict`` |
doc/source/data/loading-data.rst
Outdated
There was a problem hiding this comment.
| To convert a PyTorch dataset to a Ray Dataset, call ::func:`~ray.data.from_torch`. | |
| To convert a PyTorch dataset to a Ray Dataset, call :func:`~ray.data.from_torch`. |
doc/source/data/loading-data.rst
Outdated
There was a problem hiding this comment.
| datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details see the guide | |
| datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details, see |
There was a problem hiding this comment.
| The user defined function passed to :meth:`~ray.data.Dataset.map_batches` is more flexible. As atches | |
| The user defined function passed to :meth:`~ray.data.Dataset.map_batches` is more flexible. As batches |
There was a problem hiding this comment.
| can be represented in multiple ways (more on this in :ref:`Configuring batch format <configure_batch_format>`), so the function should be of type | |
| ``Callable[DataBatch, DataBatch]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In | |
| can be represented in multiple ways (more on this in :ref:`Configuring batch format <configure_batch_format>`), the function should be of type | |
| ``Callable[DataBatch, DataBatch]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In |
There was a problem hiding this comment.
| other words your function should input and output a batch of data which can be represented as a | |
| pandas dataframe or a dictionary with string keys and NumPy ndarrays values. Your function does not need | |
| other words, your function should take as input and output a batch of data which can be represented as a | |
| pandas DataFrame or a dictionary with string keys and NumPy ndarrays values. Your function does not need |
There was a problem hiding this comment.
| to return a batch in the same format as it is input, so you could input a pandas dataframe and output a | |
| dictionary of NumPy ndarrays. For example your function might look like: | |
| to return a batch in the same format as its input, so you could input a pandas DataFrame and output a | |
| dictionary of NumPy ndarrays. For example, your function might look like: |
There was a problem hiding this comment.
| The user defined function can also return an iterator that yields batches, so the function can also | |
| be of type ``Callable[DataBatch, Iterator[[DataBatch]]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. | |
| In this case your function would look like: | |
| The user defined function can also be a Python generator that yields batches, so the function can also | |
| be of type ``Callable[DataBatch, Iterator[[DataBatch]]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. | |
| In this case, your function would look like: |
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
bb0802a to
e55ca12
Compare
Signed-off-by: Matthew Owen <mowen@anyscale.com>
angelinalg
left a comment
There was a problem hiding this comment.
Just some nits. Excuse any mangling in the suggestions when I tried to change passive voice to active voice. Please correct as needed. Very nice job overall. Consider using Vale to catch some of these copy edits I made. (go/vale)
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
| .. tab-item:: ABS | ||
|
|
||
| To save data to Azure Blob Storage, install the | ||
| `Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage <https://pypi.org/project/adlfs/>`_ |
There was a problem hiding this comment.
same as read, also add a tip on how to tune configs for write failure retries.
There was a problem hiding this comment.
Discussed offline, will add the tip on configs later.
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
3fb69d0 to
94ac4b4
Compare
|
This breaks data doc test, I'm putting up a revert to double check (https://buildkite.com/ray-project/postmerge/builds/3645) |
#44093 broke one of the data doc tests that was not run on premerge (example here: https://buildkite.com/ray-project/postmerge/builds/3645). This adds missing imports to fix that. Signed-off-by: Matthew Owen <mowen@anyscale.com>
…r, and Saving Data pages (ray-project#44093) --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
ray-project#44093 broke one of the data doc tests that was not run on premerge (example here: https://buildkite.com/ray-project/postmerge/builds/3645). This adds missing imports to fix that. Signed-off-by: Matthew Owen <mowen@anyscale.com>
…r, and Saving Data pages (#44093) (#44221) Docs only cherry pick for release. Note: this cherry-pick includes four commits which are all related to changing the Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages. They are rolled together to reduce cherry-picking overhead and are all part of the logical update to these pages. The PRs included in this cherry-pick: Main overhaul of listed pages, Two fixes to doc tests that were broken by the above (fix 1, fix 2). Additional small change to explain how to use credentials that was added after initial merge of main overhaul --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
…r, and Saving Data pages (ray-project#44093) --------- Signed-off-by: Matthew Owen <mowen@anyscale.com> Signed-off-by: Matthew Owen <omatthew98@berkeley.edu> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
ray-project#44093 broke one of the data doc tests that was not run on premerge (example here: https://buildkite.com/ray-project/postmerge/builds/3645). This adds missing imports to fix that. Signed-off-by: Matthew Owen <mowen@anyscale.com>
Why are these changes needed?
This PR is to update Ray Data documentation for Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages as discussed offline.
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.