Skip to content

wide table support #10243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

wide table support #10243

wants to merge 2 commits into from

Conversation

hhuuggoo
Copy link

addresses #6245

@jreback this isn't quite working yet for older arrays - the reason is because I have to address values_block_kind_0 and things like that. Currently what I'm doing is creating a group under self.group of the Table class, named 'attr', which has all the attributes as CArrays. However the issue is that values_block_kind_0 is actually written as an attr of the table dataset, not the group. I can definitely setup the code to read that stuff for backwards compatability, but is it ok if I put all carray attrs as datasets under the attr group, or is there some namespace collisioning that I'm not aware of?

@hhuuggoo
Copy link
Author

Actually I just fixed the issue, but I still need to know if the approach is sound (also i need to write tests)

@jreback
Copy link
Contributor

jreback commented May 31, 2015

don't pickle things - just store as a direct carray

@jreback
Copy link
Contributor

jreback commented May 31, 2015

don't change attribute names - leave them the same

@hhuuggoo
Copy link
Author

Wait I'm confused - I didn't think I changed any attribute names? What I meant was that some attrs are stored on the group and some on the table, but I need a place to put the carrays. I can create an attr group under the group housing the Table, but the attrs on the table are on a table dataset, there isn't a place to keep those separately, I was just going to stick them in the same attr group.

Also, I don't know how to store it as a direct carray. non_index_axes is something like this:

[(1, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])]  

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

you need to use a CArray that's the entire point of this. These are all just lists, you can store directly in a CArray. I don't recall exactly the format of non_index_axes, I think was a pair of (axes,list-of-values). So can then store this as a Carray for the values and a new attribute for the list of axes (as this can actually be a multiple axes for higher ndims).

@hhuuggoo
Copy link
Author

hhuuggoo commented Jun 1, 2015

OK I think I get that

what about values_block_kind_0 - are you ok with storing that under
/<datasetname>/attrs/values_block_kind_0?

Is there a particular reason why values_block_kind_0 is stored on the table, rather than on the group housing the table?

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

these attributes needs to be stored on the actual table node (this is actually not a table, but a Fixed store, FYI). as they are private to it (and not public to the group). you shouldn't be creating a new group (attrs) at all as that is not back compat.

@hhuuggoo
Copy link
Author

hhuuggoo commented Jun 1, 2015

In this case, values_block_kind_0 is a list of column names, and so if we only make the change to non_index_axes, we won't be able to support wide tables, as we'll just push the 64k limit problem to values_block_kind_x

If you don't want an attrs group created, where do you want the carrays to be created? I have it setup now where the carray based attrs are created inside the attrs group, but for backwards compatability we read them from the old location if the carray does not exist

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

no no no.

the point of a Carray is to store this data. Let me be more clear. You simply store the Carray as a value (not as an attribute), maybe was not clear on that.

@hhuuggoo
Copy link
Author

hhuuggoo commented Jun 1, 2015

I understand that the carray is a value - under which path do you want it to live?

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

I think it should live UNDER the Table (so its again hidden), sorry you do need to create this group.

@hhuuggoo
Copy link
Author

hhuuggoo commented Jun 1, 2015

I'm sorry - I understand the confusion now - I thought that the table was a dataset - it is not, it is a group containing datasets

@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

yes, you are free to put additional attributes there (e.g. the axis and so on, whatever). you just need to be able to back-compat read this

@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

@hhuuggoo want to update this?

@hhuuggoo
Copy link
Author

I do - but I won't be able to in the short term - do you want to close and I can re-submit?

@hhuuggoo
Copy link
Author

@jreback I re-did this doing wht you suggested

  • some of this can be cleaned up a bit, but I wanted to make sure you agreed with the approach before I did more
  • the crappy thing is that DataCol also needs to do this, because things like values_block_0_kind overflow otherwise. Which means they need access to the underlying _handle to create_carrays. I have some hacks in there right now to accomplish that

@@ -11,7 +11,8 @@
import itertools
import warnings
import os

from six import string_types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use compat.string_types

@jreback
Copy link
Contributor

jreback commented Aug 22, 2015

@hhuuggoo approach looks just fine. a couple of minor corrections. also pls generate a couple of tests hdf files using 0.16.2 for back-compat testing, they go in pandas/io/tests/data/legacy_hdf dir.

@jreback jreback added Enhancement IO HDF5 read_hdf, HDFStore labels Aug 22, 2015
@hhuuggoo
Copy link
Author

cool thanks! I'll finish and update tomorrow

@hhuuggoo
Copy link
Author

hhuuggoo commented Oct 4, 2015

@jreback so these tests are failing because on a second write, the carrays I created, we attempt to write/create again. I can mirror the semantics of the hdf5 attrs, where we basically overwrite the old value - but I wanted to verify that this was the correct thing to do?

thanks

@jreback
Copy link
Contributor

jreback commented Oct 4, 2015

these are normally written once only
if they are there you can remove and rewrite though

@jreback
Copy link
Contributor

jreback commented Oct 25, 2015

@hhuuggoo can you rebase / update

@jreback
Copy link
Contributor

jreback commented Nov 25, 2015

closing, but if you want to update, pls reopen

@hhuuggoo so close!!!

@jreback jreback closed this Nov 25, 2015
@MJuddBooth
Copy link

@hhuuggoo & @jreback, I've been watching as I also have need of this. I've made the suggested changes. Currently two tests fail, test_append_frame_column_oriented and test_column_multiindex. It looks like the first errors on dtype=object and the second cannot match existing table structure. I presume the former error is a limitation of carray (but haven't looked) and the latter appears to be the tolist() extraction of the carray, which needs to generate tuples for a multiIndex.

@jreback
Copy link
Contributor

jreback commented Dec 6, 2015

why don't u put up your changes as and I'll have a look

@MJuddBooth
Copy link

I submitted a pull request. Legacy test included. The code still needs some cleanup and as I noted in the pull request one test still fails. Also, you suggested in your comments above that one could get _handle from the table, but I don't see any way to find that via self.table. So for now I have a the handle property as a placeholder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants