Skip to content

Extension proposal: multiscale arrays v0.1 #50

@joshmoore

Description

@joshmoore

This issue has been migrated to an image.sc topic after the 2020-05-06 community discussion. Authors are still encouraged to make use of the specification in their own libraries. As the v3 extension mechanism matures, the specification will be updated and registered as appropriate. Feedback and request changes are welcome either on this repository or on image.sc.


As a first draft of support for the multiscale use-case (#23), this issue proposes an intermediate nomenclature for describing groups of Zarr arrays which are scaled down versions of one another, e.g.:

example/
├── 0    # Full-sized array
├── 1    # Scaled down 0, e.g. 0.5; for images, in the X&Y dimensions
├── 2    # Scaled down 1, ...
├── 3    # Scaled down 2, ...
└── 4    # Etc.

This layout was independently developed in a number of implementations and has since been implemented in others, including:

Using a common metadata representation across implementations:

  1. fosters a common vocabulary between existing implementations
  2. enables other implementations to reliably detect multiscale arrays
  3. permits the upgrade of v0.1 arrays to future versions of this or other extension
  4. tests this extension for limitations against multiple use cases

A basic example of the metadata that is added to the containing Zarr group is seen here:

{
  “multiscales”: [
    {
      “datasets” : [
          {"path": "0"},
          {"path": "1"},
          {"path": "2"},
          {"path": "3"},
          {"path": "4"}
        ]
      “version” : “0.1”
    }
     // See the detailed example below for optional metadata
  ]
}

Process

An RFC process for Zarr does not yet exist. Additionally, the v3 spec is a work-in-progress. However, since the implementations listed above as well as others are already being developed, I'd propose that if a consensus can be reached here, this issue should be turned into an .rst file similar to those in the v3 branches (e.g. filters) and used as a temporary spec for defining arrays with the understanding that this a prototype intended to be amended and brought into the general extension mechanism as it develops.

I'd welcome any suggestions/feedback, but especially around:

  • Better terms for "multiscale" and "series"
  • The most useful enum values
  • Is this already too complicated? (Limit to one series per group?) or on the flip side:
  • Are there existing use cases that aren't supported? (Note: I'm aware of some examples like BDV's N5 format but I'd suggest they are higher-level than just "multiscale arrays".)

Deadline for a first round of comments: March 15, 2020
Deadline for a second round of comments: April 15, 2020

Detailed example

Color key (according to https://www.ietf.org/rfc/rfc2119.txt):

- MUST     : If these values are not present, the multiscale series will not be detected.
! SHOULD   : Missing values may cause issues in future versions.
+ MAY      : Optional values which can be readily omitted.
# UNPARSED : When updating between versions, no transformation will be performed on these values.

Color-coded example:

-{
-  "multiscales": [
-    {
!      "version": "0.1",
!      "name": "example",
-      "datasets": [
-        {"path": "0"},
-        {"path": "1"},
-        {"path": "2"}
-      ],
!      "type": "gaussian",
!      "metadata": {
+        "method":
#          "skiimage.transform.pyramid_gaussian",
+        "version":
#          "0.16.1",
+        "args":
#          [true],
+        "kwargs":
#          {"multichannel": true}
!      }
-    }
-  ]
-}

Explanation

  • Multiple multiscale series of datasets can be present in a single group.
  • By convention, the first multiscale should be chosen if all else is equal.
  • Alternatively, a multiscale can be chosen by name or with slightly more effort, but the zarray metadata like chunk size.
  • The paths to the arrays are ordered from largest to smallest.
  • These paths could potentially point to datasets in other groups via “../foo/0” in the future. For now, the identifiers MUST be local to the annotated group.
  • These values SHOULD (MUST?) come from the enumeration below.
  • The metadata example is taken from https://scikit-image.org/docs/dev/api/skimage.transform.html#skimage.transform.pyramid_reduce

Type enumeration:

Sample code

#!/usr/bin/env python
import argparse
import zarr
import numpy as np
from skimage import data
from skimage.transform import pyramid_gaussian, pyramid_laplacian

parser = argparse.ArgumentParser()
parser.add_argument("zarr_directory")
ns = parser.parse_args()


# 1. Setup of data and Zarr directory
base = np.tile(data.astronaut(), (2, 2, 1))

gaussian = list(
    pyramid_gaussian(base, downscale=2, max_layer=4, multichannel=True)
)

laplacian = list(
    pyramid_laplacian(base, downscale=2, max_layer=4, multichannel=True)
)

store = zarr.DirectoryStore(ns.zarr_directory)
grp = zarr.group(store)
grp.create_dataset("base", data=base)


# 2. Generate datasets
series_G = []
for g, dataset in enumerate(gaussian):
    if g == 0:
        path = "base"
    else:
        path = "G%s" % g
        grp.create_dataset(path, data=gaussian[g])
    series_G.append({"path": path})

series_L = []
for l, dataset in enumerate(laplacian):
    if l == 0:
        path = "base"
    else:
        path = "L%s" % l
        grp.create_dataset(path, data=laplacian[l])
    series_L.append({"path": path})


# 3. Generate metadata block
multiscales = []
for name, series in (("gaussian", series_G),
                     ("laplacian", series_L)):
    multiscale = {
      "version": "0.1",
      "name": name,
      "datasets": series,
      "type": name,
    }
    multiscales.append(multiscale)
grp.attrs["multiscales"] = multiscales

which results in a .zattrs file of the form:

{
    "multiscales": [
        {
            "datasets": [
                {
                    "path": "base"
                },
                {
                    "path": "G1"
                },
                {
                    "path": "G2"
                },
                {
                    "path": "G3"
                },
                {
                    "path": "G4"
                }
            ],
            "name": "gaussian",
            "type": "gaussian",
            "version": "0.1"
        },
        {
            "datasets": [
                {
                    "path": "base"
                },
                {
                    "path": "L1"
                },
                {
                    "path": "L2"
                },
                {
                    "path": "L3"
                },
                {
                    "path": "L4"
                }
            ],
            "name": "laplacian",
            "type": "laplacian",
            "version": "0.1"
        }
    ]
}

and the following on-disk layout:

/var/folders/z5/txc_jj6x5l5cm81r56ck1n9c0000gn/T/tmp77n1ga3r.zarr
├── G1
│   ├── 0.0.0
...
│   └── 3.1.1
├── G2
│   ├── 0.0.0
│   ├── 0.1.0
│   ├── 1.0.0
│   └── 1.1.0
├── G3
│   ├── 0.0.0
│   └── 1.0.0
├── G4
│   └── 0.0.0
├── L1
│   ├── 0.0.0
...
│   └── 3.1.1
├── L2
│   ├── 0.0.0
│   ├── 0.1.0
│   ├── 1.0.0
│   └── 1.1.0
├── L3
│   ├── 0.0.0
│   └── 1.0.0
├── L4
│   └── 0.0.0
└── base
    ├── 0.0.0
...
    └── 1.1.1

9 directories, 54 files
Revision Source Date Description
6 External feedback on twitter and image.sc 2020-05-06 Remove "scale"; clarify ordering and naming
5 External bug report from @mtbc 2020-04-21 Fixed error in the simple example
4 #50 (comment) 2020-04-08 Changed "name" to "path"
3 Discussions up through #50 (comment) 2020-04-01 Updated naming schema
2 #50 (comment) 2020-03-07 Fixed typo
1 @joshmoore 2020-03-06 Original text from in person discussions

Thanks to @ryan-williams, @jakirkham, @freeman-lab, @petebankhead, @jni, @sofroniewn, @chris-allan, and anyone else whose GitHub account I've forgotten for the preliminary discussions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions