-
-
Notifications
You must be signed in to change notification settings - Fork 329
WIP: DataFrame support #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @alimanfoo |
Cool to see work in this direction. Two questions: Why the need for a separate Is this in scope for Zarr? I wonder if something like this should be external. It would be nice to see Zarr stay minimal and somewhat general. |
Frame is essentially a Group that is laid out. But a couple of issues that make it special.
Sure this could be a separate package, zframe or whatever. Though re-uses lots of the zarr machinery (which does multiple things, e.g. storage backends, hierarchies, and arrays). |
updated the top. forgot the object column :< |
Perhaps things like logical vs physical dtype mappings for dtypes could be stored in the group's metadata? Your point about filtering across all columns does serve as pretty good motivation for a separate class. Some special API on top of groups would be definitely desired. Perhaps the I think in my mind Pandas is mired in the unfortunate mess that is the real world while Zarr is still fairly clean. I'm somewhat in favor of building messy things on top of clean things, rather than within them. However the corollary here is that clean things need to make sure that people can extend them externally with ease. Perhaps there are lessons to be learned from this work? |
fix normalized array path on frame storage add pandas as dep (for now)
Any thoughts from others on this topic? |
Just a note -- it looks like you don't make use of date/time types directly, but I realized that these are ambiguously describe in the zarr spec (#85). I'm also not sure this makes sense for zarr proper. Adding a query language to zarr ( |
@jreback indeed very cool to see work in this direction. Here’s a few thoughts, very open to discussion on any of these points. I have a strong desire to keep zarr lean and mean. Partly that’s because I am cautious about taking on any more code commitments, but also because I want to make sure zarr fills a well-defined niche and complements other libraries as much as possible. In particular, the original motivation for zarr was to provide a storage partner to dask.array that made the most of the parallel computing capabilities of dask. That helped to give a clear scope, i.e., it was clear what problem I was trying to solve, which was how to design a chunked compressed N-dimensional array store that enabled writing in parallel to different regions of an array. There were also some simplifying assumptions I could make which really helped to reduce complexity. Much of the original inspiration for zarr came from the carray class in bcolz, but because I was targeting a dask.array usage context, I could do away with some things, e.g. carray has several optimisations for iterating over items and for appending to an array one item at-a-time, which I did not need to include in zarr. I say this because the interesting question that @mrocklin has now raised is, what would an ideal storage partner for dask.dataframe look like? I know a lot less about dask.dataframe and about out-of-core processing of tabular data in general, and so I have much less of a feel for what the problem we’re trying to solve is. I think what I would like to achieve in the short term is some set of minimal functionalities or enhancements or example code, either within zarr or pandas or dask or some sandbox package, that would allow @mrocklin to do some experimentation with dask.dataframe and zarr, and get some insight on where there could be significant benefits and synergy. I.e., where the niche is; what are the key problems not solved or at least not solved well by existing storage libraries? I am hoping @mrocklin and others can advise on what further minimal work would be required enable this experimentation. I’d like to proceed slowly because as I say I don’t understand the use cases and algorithms very well, so I don’t have a good sense of where there could be a lot of bang for buck, and anyway I’m sure some experiments are required and there may be some surprises. Having said all that, I am open to the idea that zarr might be extended to include some kind of Frame or Table class. I’d be happy to explore this, but I think there are a number of questions that need careful discussion. One question is, what would be the public API for a zarr Frame class? I.e., what are the public methods, including features supported via getitem and setitem? Clearly the natural thing to do is stick as close as possible to pandas, but where does zarr stop and dask.dataframe take over? To give an analogy, zarr.Array only supports indexing with slices; fancy indexing is left for dask.array or some other higher-level library. So what are the minimal set of features for Frame that would fill an important niche? Another question is, what type(s) of parallel processing would be anticipated zarr.Frame being used for, and how would zarr.Frame implement synchronization/locking within the public API to provide maximum support? To give another anology, zarr.Array was designed for use as the data destination for chunked algorithms, and so allows writing of contiguous regions via setitem, with write locking done at the chunk level so different regions of the array can be written to in parallel. What are the major parallel use cases and synchronisation issues for a Frame class? A third question is, technically, how to implement a Frame class, and what (if any) changes to the storage spec are required? In this PR you created a new class sibling to Array and Group. An alternative suggested by @mrocklin would be to sub-class or wrap the Group class. I tend to favour sub-classing Group as it could imply little or no changes to the storage spec, layering the functionality on the existing Group and metadata infrastructure. But there may be good architectural and/or performance/scalability reasons for doing otherwise that I am not aware of, so I don’t want to advocate strongly for any particular direction at this time and would rather delay making any commitment until knowing more. A final question is, what would zarr.Frame do that bcolz.ctable doesn’t do? Are they attempting to fill the same niche, or not? If not, what are the differences in target use cases? If so, then why is it better to implement in Zarr rather than work from the ctable class? Again I don’t want to stifle any discussion, but I think I’d like to understand more about what problem(s) we are trying to provide a better solution for, and then much of this would become clear. In any case it would be good to include @FrancescAlted in the conversation. Sorry for the long-winded response, again very cool to see this work, and very interested to hear thoughts. |
@alimanfoo thanks for your response. So here's some philosophy and rationale that hopefully answers some questions (and brings up some more!). this is probably TL; DR. but DO read it! I have recently been using To put some numbers on this. I have about 2000 files, which when expanded are 1.5B rows, say 20 columns (several TB's in total). However at any one time I only really load up a couple of columns into distributed memory (currently using about 1.5TB memory across a few machines with 150 cores ATM). This works pretty well now. IOW I can load it up, do some analysis, then In an ideal scenario, I would perform a pre-processing step where I read in the data, clean (including dtype conversions), maybe re-partition, and re-store to s3 in a binary format. Some conversions are pretty cheap so doing them after loading up from storage are cheap, so post-processing is not a big deal. IOW things like Bottom line is any 'real' post-load-transforms requires a storage format to:
So I did a mini-test of some storage formats. Here are my thoughts (no particular order), and obviously repeating some known things.
So every format has various tradeoffs. Note that I am not being critical of any author here, rather the reverse. I am looking at a usecase which I think is actually pretty common, but not fully addressed. So So I liked the base that I am indifferent where to put this:
|
Thanks @jreback, very interesting. FWIW I think working on this in a On Wed, Oct 19, 2016 at 12:08 PM, Jeff Reback [email protected]
Alistair Miles |
going to close this for now. parquet support is now fast & first class: https://github.com/martindurant/fastparquet/ (very recent), thanks @martindurant I know @wesm is integrating this into So in theory this is a good idea (prob as a separate package, maybe I do think that this is a nice alternative format. |
Thanks @jreback, happy to revisit in future if there is interest. |
very preliminary WIP for DataFrame support directly in zarr. This adds a new meta type like Array, Group, called Frame. It has top-level metadata (nrows, columns, dtypes) and stores the data in individual Arrays. These can be individuall compressed via different compressors, if the default is specified. (I would propose the default be what @mrocklin mentioned in castra to start).
Uses @alimanfoo pickle filter directly. chunking is done (but no selection yet). Currently just supporting set/getitem access for columnar access.
I would propose adding:
for setting / getting (main for chunking support). This preserves the pandas semantics of
[]
for columnar access, while.loc
is for both row/columns. This deviates a bit from zarr semantics though.minimal testing ATM, should be able to support all numpy dtypes that pandas supports; categorical soon. I had to add the notion of a storage dtype and a semantic one (eg.
M8[ns]
is actually stored asi8
, so the actual store shows that above). This is an implementation detail but I don't think a big deal. Need this for example, for category support (where we actually have a sub-group with codes/categories).I was trying to isolate pandas so that it would not be a requirement unless you actually tried to save a DataFrame. Currently it must be there to import zarr, but I think that could be relaxed.
I extracted some things from
Array
and created aBase
, more could be done here.Finally I think it might be worth adding a filter which does arrow encoding for object arrays (in reality just a pair of cython functions could do it without making feather a dep, e.g. https://github.com/wesm/feather/blob/master/python/feather/interop.h) and then saving as values, offsets (like how we do for categories). Might be interesting / better performed than pickle. IDK.
xref dask/dask#1599