Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ However, once you have too much data to fit into memory, for whatever reason, th
:maxdepth: 1

api.md
zarr-configuration.md
changelog.md
contributing.md
references.md
Expand Down
46 changes: 46 additions & 0 deletions docs/zarr-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Zarr Configuration

If you are using a local file system, use {doc}`zarrs-python <zarrs:index>`:

```python
import zarr
zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})
```

Otherwise normal use {mod}`zarr` without {doc}`zarrs-python <zarrs:index>` (wich does not support, for example, remote stores).

## `zarrs` Performance

Please look at {doc}`zarrs-python <zarrs:index>`'s docs for more info but there are two important setting to consider:

```python
zarr.config.set({
"threading.max_workers": None,
"codec_pipeline": {
"direct_io": False
}
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run a quick benchmark to which extend this affects performance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a gist that highlights the issue, I don't think we can really do much else than make people aware of the problem. Like I said direct_io should not harm performance

```

The `threading.max_workers` will control how many threads are used by `zarrs`, and by extension, our data loader.
This parameter is global and controls both the rust parallelism and the Python parallelism.
If you notice thrashing or similar oversubscription behavior of threads, please open an issue.

Some **linux** file systems' [performance may suffer][] from the high level of parallelism combined with a full page cache in RAM.
To bypass the page cache, use `direct_io` - there should not be a performance difference.
If this setting is set on a system that does not support `direct_io`, file reading will fall back to normal buffered io.

## `zarr-python` performance

In this case, likely the store of interest is in the cloud.
Please see zarr python's {doc}`zarr:user-guide/config` for more info but likely of most interest aside from the above mentioned `threading.max_workers` is
Comment on lines +35 to +36
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should quickly check the performance without zarrs. Last time I checked, you need a lot bigger chunk sizes without zarrs. This will probably the case as well if you work with a store in the cloud (way higher latency - so bigger package size could be beneficial).

Just as some guidelines to the user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you want to proceed here? Create new benchmarks? Can we make an issue for this? I don't think this guide is making any recommendations, it's more just so users have the information


```python
zarr.config.set({"async.concurrency": 64})
```

which is 64 by default.
See the [zarr page on concurrency][] for more information.

[performance may suffer]: https://gist.github.com/ilan-gold/705bd36329b0e19542286385b09b421b
[zarr page on concurrency]: https://zarr.readthedocs.io/en/latest/user-guide/consolidated_metadata/#synchronization-and-concurrency
Loading