-
-
Notifications
You must be signed in to change notification settings - Fork 329
Database sources where each array element is a separate database row #438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I dug into this a bit. The quick route I was hoping for is not possible. The current interface between the In an example like: store = zarr.DBElementStore(…)
z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
z[3, 3] = 42 Ideally I'd like the last line to just perform an My storage layer could, for each chunk, look up the corresponding Otherwise, some more radical change to the This seems relevant to zarr-developers/zarr-specs#30. |
Have you played with structured arrays at all? Zarr also supports these and this sounds like a potentially good match for what you are describing, but maybe I'm missing something. |
Hi Ryan, I think what you are describing is more like a persistence mechanism for sparse arrays, where there is no need for any concept of chunking or compression (or dtype endianness). Possibly related: #424 which links to https://github.com/daletovar/zsparse |
Just to add, I think you're describing using a database to store a sparse matrix in COO (coordinate) format. Cf. scipy's coo_matrix. |
Thanks all. @jakirkham I don't think Scipy structured arrays get at my goal here, but it's possible I'm missing something. @alimanfoo zsparse seems like a wrapper on top of zarr for storing one logical sparse array as three underlying 1-D Zarr arrays. That's definitely a useful abstraction I've wanted in other contexts, so I'm glad to learn of it. What I want here diverges a bit deeper in the stack. Imagine: I have an existing database table with records that conceptually form a 2-D array (e.g. (idx1, idx2, value) triples), and want to query that table as if it was a 2-D Zarr array. The concept of chunks can still be meaningful in this world, but they would be virtual. The storage layer just gives a uniform API for accessing all the elements in the array, but each call-site could nevertheless interact with that formless layer in terms of a chunk-shape (reflecting e.g. a desire about how to parallelize over the full array). Compressed-chunk byte-blobs are not meaningful in this context, but the current Zarr storage interface is implemented entirely in terms of them, so that's what I'm wrestling with. It's interesting to think about how Zarr can encompass arrays where no physical manifestation of [compressed, whole-chunk blobs] exists in the underlying storage medium. I hoped I could just splice into the storage layer and support this, but now I'm understanding that more changes would be required. |
FWIW I think what you're after is essentially a scipy sparse COO matrix but using database columns to store the rows, columns and values, rather than numpy arrays. In this case I think the zarr abstractions of chunks and the key/value storage interface are just getting in the way. You might as well try to implement the numpy array API (at least |
My impression is that the existing DB backends for Zarr use small DB tables of chunks: each row has a string key (that would otherwise be a filesystem path) and a binary blob (that would otherwise be the compressed chunk file at that path).
I wanted to flag a need I keep seeing in single-cell, and have discussed with various folks (incl. @alasla, @mckinsel, @tomwhite, @laserson), which is to put lots of gene-expression matrices in a database (instead of storing each one in an HDF5 file, CSV, or Zarr directory), where each entry in these 2-D sparse matrices is stored as a database row (likely a (cell ID, gene ID, count) triple).
Generalizing, an N-D Zarr dataset can have each entry mapped to a DB row with N integer "key" columns and one "value" column (storing elements of the given Zarr dtype).
You can straightforwardly support existing Zarr access-patterns by indexing such a table on the "key" columns, and letting Zarr page full chunks into memory to operate on, as usual. Fetching a chunk from such a DB table is a simple DB query against an index (with appropriate chunk-size-multiple bounds on each dimension-column), and downstream code need not care that it is being fed a chunk that is entirely virtual.
This model can also trivially simulate concatenation, splitting, and re-chunking Zarr trees, potentially obviating a host of related problems (e.g. #297, #323, #392), and generally leads to a lot of questions about when you should ever store things in a filesystem instead of a database (possibly never 😝), how core filesystem-assumptions are to the essence of Zarr (not very, IMO, though we haven't really hashed this out), etc.
In any case, I am eager to make a Zarr backend for "entry"-level DBs like this, and will post any progress here. Any thoughts are welcome!
The text was updated successfully, but these errors were encountered: