Database sources where each array element is a separate database row

My impression is that the existing DB backends for Zarr use small DB tables of chunks: each row has a string key (that would otherwise be a filesystem path) and a binary blob (that would otherwise be the compressed chunk file at that path).

I wanted to flag a need I keep seeing in single-cell, and have discussed with various folks (incl. @alasla, @mckinsel, @tomwhite, @laserson), which is to put lots of gene-expression matrices in a database (instead of storing each one in an HDF5 file, CSV, or Zarr directory), where each entry in these 2-D sparse matrices is stored as a database row (likely a (cell ID, gene ID, count) triple).

Generalizing, an N-D Zarr dataset can have each entry mapped to a DB row with N integer "key" columns and one "value" column (storing elements of the given Zarr dtype).

You can straightforwardly support existing Zarr access-patterns by indexing such a table on the "key" columns, and letting Zarr page full chunks into memory to operate on, as usual. Fetching a chunk from such a DB table is a simple DB query against an index (with appropriate chunk-size-multiple bounds on each dimension-column), and downstream code need not care that it is being fed a chunk that is entirely virtual.

This model can also trivially simulate concatenation, splitting, and re-chunking Zarr trees, potentially obviating a host of related problems (e.g. #297, #323, #392), and generally leads to a lot of questions about when you should ever store things in a filesystem instead of a database (possibly never 😝), how core filesystem-assumptions are to the essence of Zarr (not very, IMO, though we haven't really hashed this out), etc. 

In any case, I am eager to make a Zarr backend for "entry"-level DBs like this, and will post any progress here. Any thoughts are welcome!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Database sources where each array element is a separate database row #438

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Database sources where each array element is a separate database row #438

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions