Core protocol v3.0 - stores and storage protocol #30

alimanfoo · 2019-05-14T23:41:39Z

This PR has work towards straw man content for sections on stores and storage protocol.

alimanfoo · 2019-05-14T23:45:30Z

This isn't really ready for review yet, but comments and questions are welcome if anyone is looking.

I was originally going to stick close to the v2 protocol, but in trying to write this I've had a deeper think about inefficiencies in the v2 protocol when layered on top of storage with high latency (e.g., cloud object storage). There are in fact some horrible inefficiencies with v2, which can be worked around with the consolidated metadata extension, but which in fact do not need to be quite so bad if the protocol is redesigned more with the limitations of cloud storage in mind. So this PR currently has something quite different from the v2 protocol, but which should be much more efficient to cloud storage and distributed use in general. I'll try to explain more in coming days the thinking behind the current proposed solution in this PR.

docs/protocol/core/v3.0.rst

alimanfoo · 2019-05-16T09:24:09Z

OK, here is an attempt to illustrate the problems with the current v2.0 protocol when used with high latency stores like cloud object stores. Basically, there are certain common tasks like listing the children of a group, or browsing a hierarchy, or creating groups or arrays, which end up requiring loads of store operations, which works fine on a local file system where each store operation has low latency, but works badly on high latency stores because each operation requires network communication.

I believe the proposal for v3.0 currently in this PR solves these problems, but need a bit more time to work that through and unpack why and how. Will try to follow up with more on that next week.

alimanfoo · 2019-05-16T09:28:00Z

Btw just to add, I think consolidated metadata is an important feature and should end up being a protocol extension. However, the core protocol could be redesigned so that at least certain basic tasks like exploring hierarchies work much better on cloud storage without always requiring consolidated metadata.

alimanfoo · 2019-05-16T11:31:17Z

Just to say I've pushed a different solution, still aiming to minimise store operations but maybe a little more intuitive.

tam203 · 2019-05-17T16:14:21Z

I see that list is a required method on a store, this makes me feel a little concerned since we have been working with s3 backed Zarrs with 10s of 1000s of keys. A list on this does not perform well.

Is the reasoning behind moving the metadata and data out into data/root/... and meta/root/... hoping that you only have to perform a full list under meta/root and can avoid listing all the data chunks?

My understanding is that listing on S3 like stores being faster based on a prefix (such as meta/) rather than a whole bucket is an implementation detail rather than a given so I'm slightly worried if we need to do any list operations.

I guess the alternative is to have this information (such as what groups there are in some central place (such in the root metadata) or tree-ing so parent knows their children but not their parents of grandchildren.

I need to speed sometime writing my thought up on this more coherently but I'd be keen to hear your thinking.

alimanfoo · 2019-05-17T22:36:59Z

Hi @tam203, a few quick responses, I'll follow up with more detail next week...

I see that list is a required method on a store, this makes me feel a little concerned since we have been working with s3 backed Zarrs with 10s of 1000s of keys. A list on this does not perform well.

Yes, listing all keys will not work well with 1000s of keys, to be avoided if possible. Object stores do support list with a prefix, and also list with prefix and delimiter (analogous to "directory listing"), I think we could design the protocol so those can be used where available.

Is the reasoning behind moving the metadata and data out into data/root/... and meta/root/... hoping that you only have to perform a full list under meta/root and can avoid listing all the data chunks?

Yes, list with "meta/" as prefix gets you all the metadata keys, which is a way to get the whole hierarchy if you need it.

My understanding is that listing on S3 like stores being faster based on a prefix (such as meta/) rather than a whole bucket is an implementation detail rather than a given so I'm slightly worried if we need to do any list operations.

I'd be interested in any performance experience of using list with a prefix or list with a prefix and delimiter on buckets with lots of keys.

I guess the alternative is to have this information (such as what groups there are in some central place (such in the root metadata) or tree-ing so parent knows their children but not their parents of grandchildren.

I think that we can balance these needs between the core protocol, which provides some basic functionality and is robust to lots of things happening in parallel (including adding new groups and arrays), and protocol extensions like consolidated metadata and your proposal for groups to list their children, which provide ways of "snapshotting" either the whole hierarchy or parts of the hierarchy.

But will be good to unpack all that and work it through carefully.

alimanfoo · 2019-05-20T17:47:10Z

Here's an attempt to explain why it's worth reconsidering some aspects
of the zarr core storage protocol, particular how storage keys are
constructed and how hierarchy structure is created and discovered.

Creating hierarchies; implicit groups

Consider a user that is creating a hierarchy. They are creating groups
and arrays, with some nesting. E.g., a user might want to create a
hierarchy with arrays at paths "/foo/bar/baz" and "/foo/bar/qux".

The current zarr protocol (v2) says that if creating a node at some
path P, then groups MUST be created at all ancestor paths of P. E.g.,
if an array is created at path "/foo/bar/baz" then groups must also be
created at paths "/foo/bar", "/foo" and "/". In the v2 protocol the
existence of a group at a path "/foo/bar" is marked by the presence of
a group metadata document under the storage key "foo/bar/.zgroup". So
checking if a group "/foo/bar" exists means performing
get("foo/bar/.zgroup"), and creating the group means performing
set("foo/bar/.zgroup", value) where value is a group metadata
document. Similar for the ancestor groups "/foo" and "/".

There's a couple of potential problems with this. First, when creating
an array at path "/foo/bar/baz", the zarr implementation has to first
check that groups exist at all ancestor paths, and create them if they
don't exist. That's at least 3 and up to 6 operations against the
store API. If the store has a relatively high latency for each
operation, then that can start to take up noticeable time, although
possibly not the end of the world.

However, what if the arrays "/foo/bar/baz" and "/foo/bar/qux" are
being created by two separate worker processes working in parallel?
Each process will check the ancestor paths, trying to create any
ancestor groups that don't already exist. But there is a race
condition, because in between checking if a group exists and creating
the group, another process may have created the same group. So two
processes can end up overwriting each other. Again possibly not the
end of the world if group metadata documents don't contain
anything. But if there was any information in those documents that
differed between the two workers, then one could get lost.

Instead of this, I've been thinking about an approach that (1)
requires less store operations, and (2) is a bit more robust to nodes
being created in parallel.

The protocol change would be to say that, if a node is being create at
some path P, then it is not necessary to explicitly create groups
at all ancestor paths of P. Instead, all ancestor groups are
implicitly created.

E.g., if a user creates an array at path "/foo/bar/baz", they do
not need to check if groups exist at ancestor paths "/foo/bar",
"/bar" and "/". These groups are implicitly created. So to be
concrete, to create an array at path "/foo/bar/baz" only a single
store operation is required, which is set(key, value), where
key is the array metadata key and value is an array metadata
document.

Hierarchy node discovery

Say a user is given enough information to open a zarr hierarchy, but is
not given any information about the names of the groups or arrays
within the hierarchy. I.e., they want to start exploring the hierarchy
to discovery what it contains.

Let's assume the user explores from the top down. I.e., they ask, what
are the children of the root node called, and are they arrays or
groups. They then might pick a child group, and ask about it's
children, etc.

With the current v2 protocol, there are two problems, and which one is
encountered depends on what operations the store supports.

If the store supports a directory-style listing operation
listdir(prefix), then with the v2 protocol, finding the children
of a node takes a lot of store operations. This is illustrated in the
notebook I posted previously.

If the store does not support directory-style listing operation,
but only supports list() which returns all store keys, then that's
only a single store operation. However, all the data chunk keys will
be returned as well as the metadata keys, which will be impractical
for any modest number of arrays and chunks.

There is also a third problem, which is that some stores might not
support any type of list operation at all. E.g., Ryan previously
created an HTTP store, which does not support any list operations. You
have to use it with the consolidated metadata extension, and the user
opening it has to know that consolidated metadata has been used. This
is all fine, but what do we say about this in the core protocol? Does
list() become an optional operation in the store interface? If so,
then how does anyone discover nodes in the hierarchy?

There are several possible solutions for this, and I don't have a
strong opinion yet about the best approach. What's currently in this
PR is actually not a complete solution, so needs to be revisited I
think.

Conclusions

It would be good if v3.0 could come up with an approach that resolves
both of the above problems. I.e., (1) parallel creation of hierarchy
nodes required as little as possible synchronisation between workers,
and (2) we had a solution for how to do discovery of hierarchy
structure that worked well on high latency stores.

My proposal for (1) is to allow implicit groups, as described above.

I don't have a strong opinion about a solution to (2) yet. If we say
that list() is an optional part of the store interface, then we
need to support something like consolidated metadata and/or allowing
groups to list their children within the group metadata document. Then
if so, do either of these become part of the core protocol, or do we
leave them to a protocol extension? And how do we explain the fact
that the consolidated metadata might be out of sync with the actual
state of the hierarchy, and so some consolidation process needs to be
manually run?

Just thinking out loud, interested in thoughts if anyone has managed
to follow this far, tricky :-)

alimanfoo · 2019-05-20T21:23:13Z

Just to mention I've reverted the draft in this PR to something simpler and a little closer to current zarr v2 and n5. For now I've still left in the prefixes "meta" and "data" for metadata and chunks respectively, I think that's still worth considering. But still need to figure out how the protocol supports discovery of the hierarchy structure.

tam203 · 2019-05-21T10:08:35Z

@alimanfoo supper write up of the problem... I'm going to go away and think some more about it.

I'd be really keen to not required list (the HTTP store is one good reason as is inefficient listing on object stores) and I'm really keen to support implicit groups but the resolution of the two is certainly not clear.

For what it's worth I think splitting metadata and data under different prefixes a good idea to potentially help with listing if required.

alimanfoo · 2019-05-21T22:09:01Z

Just to have a straw man, I've pushed a possible solution here, which addresses the issues described above. Not precious about this at all, just wanted to illustrate one possible approach. Here's the essence:

Use separate prefixes "meta" and "data" for metadata and chunks respectively. This allows for more efficient listing of metadata keys.
Allow implicit groups, i.e., groups whose existence is implied by some descendant, but don't have a metadata document in the store themselves.
Define store operations list, listpre and listdir, all of which are optional. listdir provides functionality a bit like listing directories on a file system. It would be possible to natively support listdir on file system storage (obviously) and also on cloud object storage via listing a bucket with prefix and delimiter arguments.
Use metadata keys that allow for discovering the children of any group and differentiating which children are arrays or groups via a single listdir call, if that operation is supported by the store.
If a store does not support any of the list... operations then discovery of nodes in a hierarchy is not possible, and the contents of the hierarchy have to be communicated via some other means, e.g., protocol extension or out of band communication.

Very happy to discuss or consider alternatives.

alimanfoo · 2019-05-21T22:10:47Z

docs/protocol/core/v3.0.rst

+    | Parameters: `key`
+    | Output: `value`
+
+``set`` - Store a (`key`, `value`) pair.


Perhaps set and delete should be required for writeable stores? I.e., you could have read-only stores.

Yes, definitely - the contract should specify that any store opened in read mode will only touch get

alimanfoo · 2019-05-21T22:14:13Z

I think this is ready for wider discussion, @zarr-developers/core-devs any comments welcome.

alimanfoo · 2019-05-21T22:23:31Z

@tam203 very interested in your thoughts here.

martindurant · 2019-05-22T13:51:28Z

docs/protocol/core/v3.0.rst

+    | Parameters: none
+    | Output: set of `keys`
+
+``listpre`` - Retrieve all keys with a given prefix.


I don't think any of my filesystems map layers have this, but clearly they can and should

martindurant · 2019-05-22T13:52:38Z

docs/protocol/core/v3.0.rst

+
+(This subsection is not normative.)
+
+TODO describe possible implementations.


A python dict? Probably the simplest, except the methods are not named as above

Yes that would be a good example. I was thinking to use this subsection just to describe briefly a small variety of different possible implementations, to reinforce that this protocol is not just about file system storage.

tam203 · 2019-05-24T15:26:34Z

@alimanfoo Sorry for being slow.

So I think this is a good idea, I certainly prefer it to list be mandatory. If you want to avoid any list like operation but you do want a hierarchy then you can use 'explicit' groups, is that correct?

tam203 · 2019-05-24T15:46:43Z

@alimanfoo I was considering a different approach that my guess people would feel is maybe outside the core of zarr but I would be keen to discuss. I've not had the time to properly work it through but I thought I'd share.

What I was thinking is somewhat a version of the metadata consolidation. So I would allow implicit groups but also include in the root some item of metadata that points to all the child groups. The twist I was thinking was that this could be a mapping rather than a list. So

root attrs: meta/group.json

...
sets: {
  "v1": ["foo":"foo_v1"]
  "v2": [
       "foo":"foo_v2",
       "foo/bar":"bar_v2"
  ]
}
...

In the version above there are two "sets" (not the best word just made up for now). A set would be a collection of "semantic paths" mapped to "real paths". Thinking about it maybe it would actually be better to only have one "set" for one zarr. Let's call that a "mapping":

root attrs my_zarr_v1: meta/group.json

...
mapping: [
    "foo":"foo",
    "foo/bar":"bar"
]
...

root attrs my_zarr_v2: meta/group.json

...
mapping: [
   "foo":"foo_v2",
   "foo/bar":"bar"
]
...

Above I've created two zarrs but that share some underlying data.

It's probably clear the other motivation for this approach it allows live additions to zarrs in a "safe-ish" way. Additions can be made to the zarr on disk but till those modifications are reflected in the root metadata it's like they haven't happened. In this way it also could allow for versioning part's of a big zarr group and updating those versions. The old version could still work and if done correctly the eventually consistent nature of some backends doesn't matter (v1 looks under a different key to v2).

It also would allow sharing of zarrays between multiple zarr groups (useful for example with common metadata such as lat-lon grid or maybe a land mask).

My current thinking is to keep the actuall metadata in the groups but to provide a route to discover them.

I'm sorry this is a bit of a rant/stream of consciousness, I just wanted to get something down before the weekend. I'm going to try to create a notebook explaining this better but in the meantime any thoughts welcome.

alimanfoo · 2019-05-24T21:27:19Z

Thanks @tam203, those are all great thoughts.

FWIW I think it would be very doable to define these kinds of features via one or more protocol extensions. Within a protocol extension I think it would be fine to define whatever rules and semantics you like for embedding additional metadata within a group metadata document. So currently I think my favoured approach would be to have a minimal core protocol that provides support for a core feature set without any additional metadata, then allow freedom for these other needs to be addressed via extensions. If the extensions are widely useful then hopefully they would get implemented widely, but decoupling them from the core would just give us a way to work through the basics first.

Very happy to discuss though.

… an extension

alimanfoo · 2019-06-11T20:57:42Z

This PR has become a bit of a beast I'm afraid, lot's of editing on related pieces has ended up here. I think probably the best thing for now will be to merge into the dev branch, then I'll write up some information on where this has got to, and we can review, revisit and rework any parts as needed.

alimanfoo · 2019-06-11T20:59:43Z

I've merged but will write up some information over on #16 and very happy to revisit any aspect of this.

lots of WIP thinking through storage protocol

43a4e6e

alimanfoo commented May 15, 2019

View reviewed changes

docs/protocol/core/v3.0.rst Outdated Show resolved Hide resolved

docs/protocol/core/v3.0.rst Outdated Show resolved Hide resolved

docs/protocol/core/v3.0.rst Outdated Show resolved Hide resolved

docs/protocol/core/v3.0.rst Outdated Show resolved Hide resolved

change key formation, simplify

420f438

alimanfoo force-pushed the core-protocol-v3.0-stores branch from 8c14a6a to 420f438 Compare May 16, 2019 11:29

tam203 mentioned this pull request May 17, 2019

Proposal: group to list it's children #15

Open

revert to something simpler

fc5bf66

alimanfoo added 2 commits May 21, 2019 21:33

Merge branch 'core-protocol-v3.0-dev' into core-protocol-v3.0-stores

3492aae

rework to make straw man

3841b42

alimanfoo commented May 21, 2019

View reviewed changes

alimanfoo mentioned this pull request May 21, 2019

WIP: Core protocol v3.0 development branch #16

Merged

martindurant reviewed May 22, 2019

View reviewed changes

ryan-williams mentioned this pull request May 29, 2019

Database sources where each array element is a separate database row zarr-developers/zarr-python#438

Open

alimanfoo added 2 commits June 7, 2019 18:13

Merge branch 'core-protocol-v3.0-dev' into core-protocol-v3.0-stores

6caee8a

editing on store interface, implementations, protocol

d842cbd

alimanfoo added 7 commits June 8, 2019 00:30

some editing on metadata; add bootstrap

e3c0a99

remove complex data types; support compressor only, leave filters for…

129574a

… an extension

restructure terminology and concepts section

5b9e631

fix doc title

d24d63c

add filters spec placeholder; fix titles

e243190

add placeholder for complex numbers

927699b

fix link

cd8a75e

alimanfoo merged commit 88c4617 into zarr-developers:core-protocol-v3.0-dev Jun 11, 2019

alimanfoo changed the title ~~WIP: Core protocol v3.0 - stores and storage protocol~~ Core protocol v3.0 - stores and storage protocol Jun 11, 2019

alimanfoo deleted the core-protocol-v3.0-stores branch June 11, 2019 20:59

alimanfoo mentioned this pull request Jul 25, 2019

Storage to S3, LIST operations zarr-developers/zarr-python#460

Open

yarikoptic mentioned this pull request Sep 21, 2021

Design for Zarr support dandi/dandi-archive#295

Merged


		(This subsection is not normative.)

		TODO describe possible implementations.

Core protocol v3.0 - stores and storage protocol #30

Core protocol v3.0 - stores and storage protocol #30

Uh oh!

Conversation

alimanfoo commented May 14, 2019

Uh oh!

alimanfoo commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alimanfoo commented May 16, 2019

Uh oh!

alimanfoo commented May 16, 2019

Uh oh!

alimanfoo commented May 16, 2019

Uh oh!

tam203 commented May 17, 2019

Uh oh!

alimanfoo commented May 17, 2019

Uh oh!

alimanfoo commented May 20, 2019

Uh oh!

alimanfoo commented May 20, 2019

Uh oh!

tam203 commented May 21, 2019

Uh oh!

alimanfoo commented May 21, 2019

Uh oh!

alimanfoo May 21, 2019

Choose a reason for hiding this comment

Uh oh!

martindurant May 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alimanfoo commented May 21, 2019

Uh oh!

alimanfoo commented May 21, 2019

Uh oh!

martindurant May 22, 2019

Choose a reason for hiding this comment

Uh oh!

martindurant May 22, 2019

Choose a reason for hiding this comment

Uh oh!

alimanfoo May 22, 2019

Choose a reason for hiding this comment

Uh oh!

tam203 commented May 24, 2019

Uh oh!

tam203 commented May 24, 2019

Uh oh!

alimanfoo commented May 24, 2019

Uh oh!

alimanfoo commented Jun 11, 2019

Uh oh!

alimanfoo commented Jun 11, 2019

Uh oh!

Uh oh!

alimanfoo commented May 14, 2019 •

edited

Loading

martindurant May 22, 2019 •

edited

Loading