You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To remove overloaded semantics, we will undeprecate the compute
parameter, deprecate the concurrency parameters, and ask users to use
`ActorPoolStrategy` and `TaskPoolStrategy` directly, which makes the
logic straightforward.
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Copy file name to clipboardExpand all lines: doc/source/data/transforming-data.rst
+10-5Lines changed: 10 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -259,9 +259,11 @@ To transform data with a Python class, complete these steps:
259
259
1. Implement a class. Perform setup in ``__init__`` and transform data in ``__call__``.
260
260
261
261
2. Call :meth:`~ray.data.Dataset.map_batches`, :meth:`~ray.data.Dataset.map`, or
262
-
:meth:`~ray.data.Dataset.flat_map`. Pass the number of concurrent workers to use with the ``concurrency`` argument. Each worker transforms a partition of data in parallel.
263
-
Fixing the number of concurrent workers gives the most predictable performance, but you can also pass a tuple of ``(min, max)`` to allow Ray Data to automatically
264
-
scale the number of concurrent workers.
262
+
:meth:`~ray.data.Dataset.flat_map`. Pass a compute strategy with the ``compute``
263
+
argument to control how many workers Ray uses. Each worker transforms a partition
264
+
of data in parallel. Use ``ray.data.TaskPoolStrategy(size=n)`` to cap the number of
265
+
concurrent tasks, or ``ray.data.ActorPoolStrategy(...)`` to run callable classes on
266
+
a fixed or autoscaling actor pool.
265
267
266
268
.. tab-set::
267
269
@@ -288,7 +290,10 @@ To transform data with a Python class, complete these steps:
288
290
289
291
ds = (
290
292
ray.data.from_numpy(np.ones((32, 100)))
291
-
.map_batches(TorchPredictor, concurrency=2)
293
+
.map_batches(
294
+
TorchPredictor,
295
+
compute=ray.data.ActorPoolStrategy(size=2),
296
+
)
292
297
)
293
298
294
299
.. testcode::
@@ -322,7 +327,7 @@ To transform data with a Python class, complete these steps:
fn: The function to apply to each row, or a class type
343
343
that can be instantiated to create such a callable.
344
-
compute: This argument is deprecated. Use ``concurrency`` argument.
344
+
compute: The compute strategy to use for the map operation.
345
+
346
+
* If ``compute`` is not specified, will use ``ray.data.TaskPoolStrategy()`` to launch concurrent tasks based on the available resources and number of input blocks.
347
+
348
+
* Use ``ray.data.TaskPoolStrategy(size=n)`` to launch at most ``n`` concurrent Ray tasks.
349
+
350
+
* Use ``ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers.
351
+
352
+
* Use ``ray.data.ActorPoolStrategy(min_size=m, max_size=n)`` to use an autoscaling actor pool from ``m`` to ``n`` workers.
353
+
354
+
* Use ``ray.data.ActorPoolStrategy(min_size=m, max_size=n, initial_size=initial)`` to use an autoscaling actor pool from ``m`` to ``n`` workers, with an initial size of ``initial``.
355
+
345
356
fn_args: Positional arguments to pass to ``fn`` after the first argument.
346
357
These arguments are top-level arguments to the underlying Ray task.
347
358
fn_kwargs: Keyword arguments to pass to ``fn``. These arguments are
The actual size of the batch provided to ``fn`` may be smaller than
591
582
``batch_size`` if ``batch_size`` doesn't evenly divide the block(s) sent
592
583
to a given map task. Default ``batch_size`` is ``None``.
593
-
compute: This argument is deprecated. Use ``concurrency`` argument.
584
+
compute: The compute strategy to use for the map operation.
585
+
586
+
* If ``compute`` is not specified, will use ``ray.data.TaskPoolStrategy()`` to launch concurrent tasks based on the available resources and number of input blocks.
587
+
588
+
* Use ``ray.data.TaskPoolStrategy(size=n)`` to launch at most ``n`` concurrent Ray tasks.
589
+
590
+
* Use ``ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers.
591
+
592
+
* Use ``ray.data.ActorPoolStrategy(min_size=m, max_size=n)`` to use an autoscaling actor pool from ``m`` to ``n`` workers.
593
+
594
+
* Use ``ray.data.ActorPoolStrategy(min_size=m, max_size=n, initial_size=initial)`` to use an autoscaling actor pool from ``m`` to ``n`` workers, with an initial size of ``initial``.
595
+
594
596
batch_format: If ``"default"`` or ``"numpy"``, batches are
595
597
``Dict[str, numpy.ndarray]``. If ``"pandas"``, batches are
596
598
``pandas.DataFrame``. If ``"pyarrow"``, batches are
fn: The function or generator to apply to each record, or a class type
1306
1288
that can be instantiated to create such a callable.
1307
-
compute: This argument is deprecated. Use ``concurrency`` argument.
1289
+
compute: The compute strategy to use for the map operation.
1290
+
1291
+
* If ``compute`` is not specified, will use ``ray.data.TaskPoolStrategy()`` to launch concurrent tasks based on the available resources and number of input blocks.
1292
+
1293
+
* Use ``ray.data.TaskPoolStrategy(size=n)`` to launch at most ``n`` concurrent Ray tasks.
1294
+
1295
+
* Use ``ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers.
1296
+
1297
+
* Use ``ray.data.ActorPoolStrategy(min_size=m, max_size=n)`` to use an autoscaling actor pool from ``m`` to ``n`` workers.
1298
+
1299
+
* Use ``ray.data.ActorPoolStrategy(min_size=m, max_size=n, initial_size=initial)`` to use an autoscaling actor pool from ``m`` to ``n`` workers, with an initial size of ``initial``.
1300
+
1308
1301
fn_args: Positional arguments to pass to ``fn`` after the first argument.
1309
1302
These arguments are top-level arguments to the underlying Ray task.
1310
1303
fn_kwargs: Keyword arguments to pass to ``fn``. These arguments are
example, specify `num_gpus=1` to request 1 GPU for each parallel map
1321
1314
worker.
1322
1315
memory: The heap memory in bytes to reserve for each parallel map worker.
1323
-
concurrency: The semantics of this argument depend on the type of ``fn``:
1324
-
1325
-
* If ``fn`` is a function and ``concurrency`` isn't set (default), the
1326
-
actual concurrency is implicitly determined by the available
1327
-
resources and number of input blocks.
1328
-
1329
-
* If ``fn`` is a function and ``concurrency`` is an int ``n``, Ray Data
1330
-
launches *at most* ``n`` concurrent tasks.
1331
-
1332
-
* If ``fn`` is a class and ``concurrency`` is an int ``n``, Ray Data
1333
-
uses an actor pool with *exactly* ``n`` workers.
1334
-
1335
-
* If ``fn`` is a class and ``concurrency`` is a tuple ``(m, n)``, Ray
1336
-
Data uses an autoscaling actor pool from ``m`` to ``n`` workers.
1337
-
1338
-
* If ``fn`` is a class and ``concurrency`` is a tuple ``(m, n, initial)``, Ray
1339
-
Data uses an autoscaling actor pool from ``m`` to ``n`` workers, with an initial size of ``initial``.
1340
-
1341
-
* If ``fn`` is a class and ``concurrency`` isn't set (default), this
1342
-
method raises an error.
1343
-
1316
+
concurrency: This argument is deprecated. Use ``compute`` argument.
1344
1317
ray_remote_args_fn: A function that returns a dictionary of remote args
1345
1318
passed to each map worker. The purpose of this argument is to generate
1346
1319
dynamic arguments for each actor/task, and will be called each time
@@ -1451,33 +1424,24 @@ def filter(
1451
1424
fn_constructor_kwargs: Keyword arguments to pass to ``fn``'s constructor.
1452
1425
This can only be provided if ``fn`` is a callable class. These arguments
1453
1426
are top-level arguments in the underlying Ray actor construction task.
1454
-
compute: This argument is deprecated. Use ``concurrency`` argument.
1455
-
num_cpus: The number of CPUs to reserve for each parallel map worker.
1456
-
num_gpus: The number of GPUs to reserve for each parallel map worker. For
1457
-
example, specify `num_gpus=1` to request 1 GPU for each parallel map
1458
-
worker.
1459
-
memory: The heap memory in bytes to reserve for each parallel map worker.
1460
-
concurrency: The semantics of this argument depend on the type of ``fn``:
1427
+
compute: The compute strategy to use for the map operation.
1461
1428
1462
-
* If ``fn`` is a function and ``concurrency`` isn't set (default), the
1463
-
actual concurrency is implicitly determined by the available
1464
-
resources and number of input blocks.
1429
+
* If ``compute`` is not specified, will use ``ray.data.TaskPoolStrategy()`` to launch concurrent tasks based on the available resources and number of input blocks.
1465
1430
1466
-
* If ``fn`` is a function and ``concurrency`` is an int ``n``, Ray Data
1467
-
launches *at most* ``n`` concurrent tasks.
1431
+
* Use ``ray.data.TaskPoolStrategy(size=n)`` to launch at most ``n`` concurrent Ray tasks.
1468
1432
1469
-
* If ``fn`` is a class and ``concurrency`` is an int ``n``, Ray Data
1470
-
uses an actor pool with *exactly* ``n`` workers.
1433
+
* Use ``ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers.
1471
1434
1472
-
* If ``fn`` is a class and ``concurrency`` is a tuple ``(m, n)``, Ray
1473
-
Data uses an autoscaling actor pool from ``m`` to ``n`` workers.
1435
+
* Use ``ray.data.ActorPoolStrategy(min_size=m, max_size=n)`` to use an autoscaling actor pool from ``m`` to ``n`` workers.
1474
1436
1475
-
* If ``fn`` is a class and ``concurrency`` is a tuple ``(m, n, initial)``, Ray
1476
-
Data uses an autoscaling actor pool from ``m`` to ``n`` workers, with an initial size of ``initial``.
1477
-
1478
-
* If ``fn`` is a class and ``concurrency`` isn't set (default), this
1479
-
method raises an error.
1437
+
* Use ``ray.data.ActorPoolStrategy(min_size=m, max_size=n, initial_size=initial)`` to use an autoscaling actor pool from ``m`` to ``n`` workers, with an initial size of ``initial``.
1480
1438
1439
+
num_cpus: The number of CPUs to reserve for each parallel map worker.
1440
+
num_gpus: The number of GPUs to reserve for each parallel map worker. For
1441
+
example, specify `num_gpus=1` to request 1 GPU for each parallel map
1442
+
worker.
1443
+
memory: The heap memory in bytes to reserve for each parallel map worker.
1444
+
concurrency: This argument is deprecated. Use ``compute`` argument.
1481
1445
ray_remote_args_fn: A function that returns a dictionary of remote args
1482
1446
passed to each map worker. The purpose of this argument is to generate
1483
1447
dynamic arguments for each actor/task, and will be called each time
0 commit comments