Skip to content

BUG: reshape of categorical via unstack/to_panel #8704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidrpugh opened this issue Nov 1, 2014 · 7 comments · Fixed by #8705
Closed

BUG: reshape of categorical via unstack/to_panel #8704

davidrpugh opened this issue Nov 1, 2014 · 7 comments · Fixed by #8705
Labels
Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@davidrpugh
Copy link

I have a hierarchical pandas.DataFrame that looks as follows...

In [12]: data.tail()
Out[12]: 
                         country    currency_unit         rgdpe  \
countrycode year                                                  
ZWE         2007-01-01  Zimbabwe  Zimbabwe Dollar  45666.082031   
            2008-01-01  Zimbabwe  Zimbabwe Dollar  35789.878906   
            2009-01-01  Zimbabwe  Zimbabwe Dollar  49294.878906   
            2010-01-01  Zimbabwe  Zimbabwe Dollar  51259.152344   
            2011-01-01  Zimbabwe  Zimbabwe Dollar  55453.312500   

I would like to turn data into a pandas.Panel object. Previously I would do this using the to_panel method without issue. However, after upgrading to Pandas 0.15, this approach no longer works...

In [13]: data.to_panel()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-dfa9fb7a4e83> in <module>()
----> 1 data.to_panel()

/Users/drpugh/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in to_panel(self)
   1061                 placement=block.mgr_locs, shape=shape,
   1062                 labels=[major_labels, minor_labels],
-> 1063                 ref_items=selfsorted.columns)
   1064             new_blocks.append(newb)
   1065 

/Users/drpugh/anaconda/lib/python2.7/site-packages/pandas/core/reshape.pyc in     block2d_to_blocknd(values, placement, shape, labels, ref_items)
   1206 
   1207     if mask.all():
-> 1208         pvalues = np.empty(panel_shape, dtype=values.dtype)
   1209     else:
   1210         dtype, fill_value = _maybe_promote(values.dtype)

TypeError: data type not understood

Has there been an API change? I could not find anything in the release notes to suggest that the above would no longer work.

@davidrpugh
Copy link
Author

When I drop into the pdb I get the following...

TypeError: data type not understood
> /Users/drpugh/anaconda/lib/python2.7/site-packages/pandas/core/reshape.py(1208)block2d_to_blocknd()
   1207     if mask.all():
-> 1208         pvalues = np.empty(panel_shape, dtype=values.dtype)
   1209     else:

ipdb> values
[NaN, NaN, NaN, NaN, NaN, ..., Extrapolated, Extrapolated, Extrapolated, Extrapolated, Extrapolated]
Length: 10354
Categories (3, object): [Benchmark < Extrapolated < Interpolated]
ipdb> values.dtype
category
ipdb> 

@jreback
Copy link
Contributor

jreback commented Nov 1, 2014

you will need to show the construction of your frame (in code) and df.dtypes

@jreback
Copy link
Contributor

jreback commented Nov 1, 2014

Categorical is a new type in 0.15 but you have to explicitly use it so puzzling why you have it in this expression (you didn't mention using it)

it is prob a bug that to_panel() doesn't work with Categorical - but as I said you have to actually convert your data in the first place

@davidrpugh
Copy link
Author

@jreback

When creating my frame I am not explicitly making use of the categorical type. I have written a function that takes some Stata .dta files as inputs and creates a pandas.Panel object out of them.

    try:
        pwt_raw_data = pd.read_stata('pwt' + str(version) + '.dta')
        dep_rates_raw_data = pd.read_stata('depreciation_rates.dta')

    except IOError:
        _download_pwt_data(base_url, version)
        pwt_raw_data = pd.read_stata('pwt' + str(version) + '.dta')
        dep_rates_raw_data = pd.read_stata('depreciation_rates.dta')

    # merge the data
    pwt_merged_data = pd.merge(pwt_raw_data, dep_rates_raw_data, how='outer',
                               on=['countrycode', 'year'])

    # create the hierarchical index
    pwt_merged_data.year = pd.to_datetime(pwt_raw_data.year, format='%Y')
    pwt_merged_data.set_index(['countrycode', 'year'], inplace=True)

    # coerce into a panel
    pwt_panel_data = pwt_merged_data.to_panel()

    return pwt_panel_data

The dataframe called data in my original comment is actually the pwt_merged_data defined in the code snippet. Here are the dtypes for pwt_merged_data:

In [4]: data.dtypes
Out[4]: 
country            object
currency_unit      object
rgdpe             float32
rgdpo             float32
pop               float64
emp               float32
avh               float64
hc                float32
cgdpe             float32
cgdpo             float32
ck                float32
ctfp              float32
rgdpna            float64
rkna              float32
rtfpna            float32
labsh             float32
xr                float64
pl_gdpe           float32
pl_gdpo           float32
i_cig            category
i_xm             category
i_xr             category
i_outlier        category
cor_exp           float64
statcap           float64
csh_c             float32
csh_i             float32
csh_g             float32
csh_x             float32
csh_m             float32
csh_r             float32
pl_c              float64
pl_i              float64
pl_g              float64
pl_x              float32
pl_m              float32
pl_k              float32
delta_k           float32
dtype: object

As you suspected, three of the variables have been cast as categorical. I wonder if this assignment is being done by the read_stata?

@jreback
Copy link
Contributor

jreback commented Nov 2, 2014

read_stata DOES correctly return categorical data.

I am working on a fix for this kind of reshaping. Though support of mixed-type data in a Panel is very limited ATM. You are almost certainly better off keeping it in a multi-indexed frame.

Is their a reason you need a Panel?

at a work-around ATM, you can simply do: df['i_cig'] = df['i_cig'].astype('object') (for each of the category columns).

@jreback jreback added Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 2, 2014
@jreback jreback added this to the 0.15.1 milestone Nov 2, 2014
@davidrpugh
Copy link
Author

@jreback

I suppose no data set needs to be a Panel. The data set being loaded is the Penn World Tablesl data set that is widely used to study the economic growth across countries over time. It seems to be a natural use case for a Panel object.

@jreback
Copy link
Contributor

jreback commented Nov 2, 2014

ok, we'll try to fix this. Their is a 'technical' issue so not sure can get it in 0.15.1, but we'll see.

My point was that the manipulation tools are currently much better for multi-indexed frames that for Panels. And if the data is not too dense, a mi frame is more compact in representation. But Panels do have there uses.

Thanks for the report.

@jreback jreback changed the title Possible bug in the to_panel method of pandas.DataFrame BUG: reshape of categorical via unstack/to_panel Nov 2, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants