-
-
Notifications
You must be signed in to change notification settings - Fork 329
Fixed '/' prepend bug in ABSStore #526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shikharsg, these are questions really more than comments, generally LGTM.
def _append_path_to_prefix(path, prefix): | ||
return '/'.join([normalize_storage_path(prefix), | ||
normalize_storage_path(path)]) | ||
def _append_path_to_prefix(self, path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, what happens here if self.prefix
is None
(default)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During instantiation in the __init__
function, any prefix
supplied is passed to the zarr.util.normalize_storage_path
function before storing it in self.prefix
, so if prefix is None
it will be stored as ''
in self.prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah cool, thanks. In which case maybe the default value for the prefix parameter could the empty string rather than None, might be a little more obvious. Also if prefix is normalised in the constructor then probably could remove calls to normalize self.prefix in the methods.
zarr/tests/test_storage.py
Outdated
account_key='bar', blob_service_kwargs={'is_emulated': True}) | ||
store.rmdir() | ||
return store | ||
|
||
def test_iterators_with_prefix(self): | ||
store = self.create_store(prefix='test_prefix') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this makes sense as haven't looked super closely at the implementation, but maybe worth testing multiple prefixes here, with and without leading or trailing slashes, just to confirm everything behaves as you expect? E.g.:
for prefix in 'test_prefix', '/test_prefix', 'test_prefix/', 'test/prefix', '', None:
store = self.create_store(prefix=prefix)
# ... tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since prefix
is passed to normalize_storage_path
before storing in self.prefix
any leading or trailing slashes will be removed.
Putting in some print statements in the tests:
def test_iterators_with_prefix(self):
for prefix in ['test_prefix', '/test_prefix', 'test_prefix/', 'test/prefix', '', None]:
print(prefix)
store = self.create_store(prefix=prefix)
# test iterator methods on empty store
assert 0 == len(store)
assert set() == set(store)
assert set() == set(store.keys())
assert set() == set(store.values())
assert set() == set(store.items())
# setup some values
store['a'] = b'aaa'
store['b'] = b'bbb'
store['c/d'] = b'ddd'
store['c/e/f'] = b'fff'
print([blob for blob in store.client.list_blob_names('test')])
# test iterators on store with data
assert 4 == len(store)
assert {'a', 'b', 'c/d', 'c/e/f'} == set(store)
assert {'a', 'b', 'c/d', 'c/e/f'} == set(store.keys())
assert {b'aaa', b'bbb', b'ddd', b'fff'} == set(store.values())
assert ({('a', b'aaa'), ('b', b'bbb'), ('c/d', b'ddd'), ('c/e/f', b'fff')} ==
set(store.items()))
Output is this:
test_prefix
['test_prefix/a', 'test_prefix/b', 'test_prefix/c/d', 'test_prefix/c/e/f']
/test_prefix
['test_prefix/a', 'test_prefix/b', 'test_prefix/c/d', 'test_prefix/c/e/f']
test_prefix/
['test_prefix/a', 'test_prefix/b', 'test_prefix/c/d', 'test_prefix/c/e/f']
test/prefix
['test/prefix/a', 'test/prefix/b', 'test/prefix/c/d', 'test/prefix/c/e/f']
['a', 'b', 'c/d', 'c/e/f']
None
['a', 'b', 'c/d', 'c/e/f']
Which is what you would expect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks.
Sorry if I am a little confused by this. But are you trying to put a Zarr group inside of a blob storage account not inside a blob container? |
@tjcrone sorry if that wasn't clear. All metadata (.zgroup, .zarray...) and data(chunks) will be blobs inside a particular container. Nothing is at the storage account level. The issue was that ABSStore prepends an extra '/' to all blob names. So if you opened a group in the root of the container the blobs would be named as ['/.zgroup', '/foo/.zgroup'], instead of ['.zgroup', 'foo/.zgroup']. The former is also awkward to look at in the storage explorer(see pictures in issue: #525). |
I don't think we ever contemplated the idea of turning a blob container into a Zarr group, which is what you are doing here by passing prefix=None. I guess we always thought that Zarr groups (sometimes referred to as Zarr "files") would have names, and live inside blob containers, which seems like the right thing to do. Maybe it is fine to do this, but at the moment I remain skeptical. @alimanfoo, do you have an opinion here? Would you recommend treating a blob container like a Zarr group? More info on Azure blob containers here: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction. |
Ah i understand now. Yes, this PR will allow a zarr container to become a group(also array for that matter). Before this PR we would probably have a structure like this: |
@martindurant does this functionality exist in |
In your example group1 is the name of the Zarr "file", which is typically given a .zarr extension, but that's not required, and it is inside of a blob container. I think the idea of treating a blob container as a Zarr "file" might be a bad idea, primarily because a container is not a blob. They are created and destroyed differently. I think if you tried to create a Zarr file that was a container rather than a blob inside a container, it would not work. Have you tried doing this? |
FWIW I don't see any problem in putting a zarr group (or array) at the root of a bucket, if that's what you want to do, i.e., you don't want to use the bucket for anything else. Should work fine on S3 and GCS. |
@shikharsg , I think so? Should try! |
I think we should maintain as much parity with the S3/GCS functionality as possible. So if it is possible to treat an S3 bucket as a Zarr group or array, and it has the functionality of creating a bucket just like it would create a file when the Zarr object needs creating, I'm fine with this. |
@@ -1994,8 +1994,7 @@ def _append_path_to_prefix(self, path): | |||
if self.prefix == '': | |||
return normalize_storage_path(path) | |||
else: | |||
return '/'.join([normalize_storage_path(self.prefix), | |||
normalize_storage_path(path)]) | |||
return '/'.join([self.prefix, normalize_storage_path(path)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what happens here if prefix='/'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the prefix
is normalized in the constructor, if you pass in '/'
for the prefix
it will still be stored as ''
in self.prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay great. Thanks.
Have we confirmed that the S3 and GCS will allow placing a Zarr group or array at the root of a bucket? If they do, we should merge this. I really think that treating an entire bucket or container as a Zarr file is an odd thing to do, perhaps because it removes a layer of abstraction that I think is useful. But maybe it's me that's odd. I will obviously defer to @alimanfoo, on this. |
I guess it is sortof like putting the contents of a file at the root of a file system (not as a file in /, but actually at the root, because you don't plan to use the file system anyway. I don't think most file systems will let you do this, because the root of the file system is a container for storing files and other containers, so doing this would be weird. But honestly if y'all think that storing Zarr collections at containers not in them is a good idea, then let's try it and see how it goes. |
With S3 it works: >>> s3 = s3fs.S3FileSystem(secret='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY', key='AKIAIOSFODNN7EXAMPLE', anon=False, client_kwargs={'endpoint_url':'http://localhost:9000'})
>>> store = s3fs.S3Map(root='test', s3=s3, check=False)
>>> root = zarr.group(store=store) Couldn't try GCS as I don't have an account and there doesn't seem to be an emulator. |
Thanks for the helpful comments @alimanfoo @tjcrone @martindurant . If there are no further reviews/objections can I go ahead and merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Fixes '/' prepend bug in
ABSStore
, issue: #525TODO: