Fuse pipelines with different numbers of tasks

Last week @tomwhite and @dcherian and I discussed possible future optimizations for Cubed - this is my attempt to elucidate what I was suggesting.

### Motivation

The best-case scenario for a cubed computation is that all sequential operations get fused, because then no writing to intermediate stores is required. With no writing then every chunk moves through the whole calculation in parallel, despite multiple operations happening to it along the way. In general we can't guarantee zero intermediate stores being required because we also want to guarantee predictable memory usage during a full shuffle, but we might nevertheless aspire to fuse everything else :grin: 

### Idea

Currently Cubed's optimization pass fuses some blockwise operations together, but it can only fuse blockwise operations that have the [same number of tasks](https://github.com/tomwhite/cubed/blob/93ad984e7b0445164ab11b3c3f3a3b7db6c3bc97/cubed/primitive/blockwise.py#L251). If we could find a way to fuse blockwise operations with different numbers of tasks then potentially anything up to a full shuffle (see #282) could be fused.

### Use cases

It's possible to construct cubed plans in which blockwise operations with different numbers of tasks occur sequentially.

```python
from cubed.core.plan import visit_nodes

def print_num_tasks_per_pipeline(plan: cubed.core.Plan, optimize_graph: bool = False):
    """Print the number of tasks needed to execute each pipeline in this plan."""
    dag = plan.optimize().dag if optimize_graph else plan.dag.copy()
    for _, node in visit_nodes(dag, resume=None):
        print(f"{node['name']}: op_name = {node['op_name']}, num_tasks = {node['pipeline'].num_tasks}")
```

This can happen with `concat`:

```python
a = xp.asarray([[1, 2, 3], [4, 5, 6], [7, 8, 9]], chunks=(2, 2))
b = xp.asarray([[1, 1, 1], [1, 1, 1], [1, 1, 1]], chunks=(2, 2))
c = xp.concat([a + 1, b])
c.visualize()
```

![image](https://github.com/tomwhite/cubed/assets/35968931/016192cd-39e4-4a09-8e97-7be5f453ddb6)

(I don't really understand what all the side inputs are in these graphs - I hope they don't invalidate what I'm suggesting!)

```python
print_num_tasks_per_pipeline(c.plan)
```
```
array-010: op_name = blockwise, num_tasks = 4
array-013: op_name = blockwise, num_tasks = 6
```

Or `matmul`:

```python
import cubed.random

spec = cubed.Spec(allowed_mem=2_000_000_000)

a = cubed.random.random(
    (50000, 50000), chunks=(5000, 5000), spec=spec
)  # 200MB chunks
b = cubed.random.random(
    (50000, 50000), chunks=(5000, 5000), spec=spec
)  # 200MB chunks
c = xp.astype(a, xp.float32)
d = xp.astype(b, xp.float32)
e = xp.matmul(c, d)
e.visualize(optimize_graph=True)
```
![image](https://github.com/tomwhite/cubed/assets/35968931/4b998e86-4b34-4171-a006-14d93e8d28b6)
```python
print_num_tasks_per_pipeline(e.plan, optimize_graph=True)
```
```
array-139: op_name = blockwise, num_tasks = 100
array-140: op_name = blockwise, num_tasks = 100
array-141: op_name = blockwise, num_tasks = 1000
array-145: op_name = blockwise, num_tasks = 300
array-150: op_name = blockwise, num_tasks = 100
```

### Implementation ideas

By definition 1 task == processing one Cubed chunk, but Cubed also currently assumes that 1 Zarr chunk == 1 Cubed chunk. This is generally what sets the number of tasks in a stage, and hence which pipelines can be fused. To fuse other pipelines we have to generalize this relationship. We can't open multiple Cubed chunks per Zarr chunk because reading/writing to different parts of the same Zarr chunk would sacrifice idempotence. 

However we could imagine opening multiple Zarr chunks for one Cubed chunk. (As long as the total size of the Zarr chunks opened for 1 Cubed chunk is < `allowed_mem`.) This would make the number of tasks for a pipeline choosable (within some range), and we could choose how many Zarr chunks to open such that the number of tasks now matches between two consecutive pipelines. 

Another way to maybe think about this is that if during your computation you have smaller chunks than your `allowed_mem` budget was set for, then as you still only load one chunk per container, you are potentially "wasting" all that extra RAM overhead you requested. Opening more chunks per container allows for using that extra RAM in some cases, and if you can fit all the extra chunks you need to get from one pipeline to another you could now just fuse those two pipelines.

### Questions

1) **Does this make any sense?**
2) **Is there actually a realistic use case for this?**
3) **Is this different from what's suggested in #136?** I tried recreating that example and noticed that most of the arrays in that subgraph have the same number of tasks, but wasn't sure if that was a coincidence.
4) **What would the actual fusion look like now?** Calling each operation on a larger array (1 Cubed chunk, corresponding to multiple Zarr chunks), then doing blockwise fusion as normal? 
5) **How does this scale?** Not much use if can only fuse by having `allowed_mem >> chunksize`. But if "batches" can be submitted then might work?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fuse pipelines with different numbers of tasks #284

Motivation

Idea

Use cases

Implementation ideas

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fuse pipelines with different numbers of tasks #284

Description

Motivation

Idea

Use cases

Implementation ideas

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions