BUG: reshape of categorical via unstack/to_panel #8704

davidrpugh · 2014-11-01T21:01:31Z

I have a hierarchical pandas.DataFrame that looks as follows...

In [12]: data.tail()
Out[12]: 
                         country    currency_unit         rgdpe  \
countrycode year                                                  
ZWE         2007-01-01  Zimbabwe  Zimbabwe Dollar  45666.082031   
            2008-01-01  Zimbabwe  Zimbabwe Dollar  35789.878906   
            2009-01-01  Zimbabwe  Zimbabwe Dollar  49294.878906   
            2010-01-01  Zimbabwe  Zimbabwe Dollar  51259.152344   
            2011-01-01  Zimbabwe  Zimbabwe Dollar  55453.312500

I would like to turn data into a pandas.Panel object. Previously I would do this using the to_panel method without issue. However, after upgrading to Pandas 0.15, this approach no longer works...

In [13]: data.to_panel()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-dfa9fb7a4e83> in <module>()
----> 1 data.to_panel()

/Users/drpugh/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in to_panel(self)
   1061                 placement=block.mgr_locs, shape=shape,
   1062                 labels=[major_labels, minor_labels],
-> 1063                 ref_items=selfsorted.columns)
   1064             new_blocks.append(newb)
   1065 

/Users/drpugh/anaconda/lib/python2.7/site-packages/pandas/core/reshape.pyc in     block2d_to_blocknd(values, placement, shape, labels, ref_items)
   1206 
   1207     if mask.all():
-> 1208         pvalues = np.empty(panel_shape, dtype=values.dtype)
   1209     else:
   1210         dtype, fill_value = _maybe_promote(values.dtype)

TypeError: data type not understood

Has there been an API change? I could not find anything in the release notes to suggest that the above would no longer work.

The text was updated successfully, but these errors were encountered:

davidrpugh · 2014-11-01T21:14:27Z

When I drop into the pdb I get the following...

TypeError: data type not understood
> /Users/drpugh/anaconda/lib/python2.7/site-packages/pandas/core/reshape.py(1208)block2d_to_blocknd()
   1207     if mask.all():
-> 1208         pvalues = np.empty(panel_shape, dtype=values.dtype)
   1209     else:

ipdb> values
[NaN, NaN, NaN, NaN, NaN, ..., Extrapolated, Extrapolated, Extrapolated, Extrapolated, Extrapolated]
Length: 10354
Categories (3, object): [Benchmark < Extrapolated < Interpolated]
ipdb> values.dtype
category
ipdb>

jreback · 2014-11-01T21:55:12Z

you will need to show the construction of your frame (in code) and df.dtypes

jreback · 2014-11-01T22:48:21Z

Categorical is a new type in 0.15 but you have to explicitly use it so puzzling why you have it in this expression (you didn't mention using it)

it is prob a bug that to_panel() doesn't work with Categorical - but as I said you have to actually convert your data in the first place

davidrpugh · 2014-11-02T01:50:00Z

@jreback

When creating my frame I am not explicitly making use of the categorical type. I have written a function that takes some Stata .dta files as inputs and creates a pandas.Panel object out of them.

    try:
        pwt_raw_data = pd.read_stata('pwt' + str(version) + '.dta')
        dep_rates_raw_data = pd.read_stata('depreciation_rates.dta')

    except IOError:
        _download_pwt_data(base_url, version)
        pwt_raw_data = pd.read_stata('pwt' + str(version) + '.dta')
        dep_rates_raw_data = pd.read_stata('depreciation_rates.dta')

    # merge the data
    pwt_merged_data = pd.merge(pwt_raw_data, dep_rates_raw_data, how='outer',
                               on=['countrycode', 'year'])

    # create the hierarchical index
    pwt_merged_data.year = pd.to_datetime(pwt_raw_data.year, format='%Y')
    pwt_merged_data.set_index(['countrycode', 'year'], inplace=True)

    # coerce into a panel
    pwt_panel_data = pwt_merged_data.to_panel()

    return pwt_panel_data

The dataframe called data in my original comment is actually the pwt_merged_data defined in the code snippet. Here are the dtypes for pwt_merged_data:

In [4]: data.dtypes
Out[4]: 
country            object
currency_unit      object
rgdpe             float32
rgdpo             float32
pop               float64
emp               float32
avh               float64
hc                float32
cgdpe             float32
cgdpo             float32
ck                float32
ctfp              float32
rgdpna            float64
rkna              float32
rtfpna            float32
labsh             float32
xr                float64
pl_gdpe           float32
pl_gdpo           float32
i_cig            category
i_xm             category
i_xr             category
i_outlier        category
cor_exp           float64
statcap           float64
csh_c             float32
csh_i             float32
csh_g             float32
csh_x             float32
csh_m             float32
csh_r             float32
pl_c              float64
pl_i              float64
pl_g              float64
pl_x              float32
pl_m              float32
pl_k              float32
delta_k           float32
dtype: object

As you suspected, three of the variables have been cast as categorical. I wonder if this assignment is being done by the read_stata?

jreback · 2014-11-02T01:56:45Z

read_stata DOES correctly return categorical data.

I am working on a fix for this kind of reshaping. Though support of mixed-type data in a Panel is very limited ATM. You are almost certainly better off keeping it in a multi-indexed frame.

Is their a reason you need a Panel?

at a work-around ATM, you can simply do: df['i_cig'] = df['i_cig'].astype('object') (for each of the category columns).

davidrpugh · 2014-11-02T02:15:35Z

@jreback

I suppose no data set needs to be a Panel. The data set being loaded is the Penn World Tablesl data set that is widely used to study the economic growth across countries over time. It seems to be a natural use case for a Panel object.

jreback · 2014-11-02T02:21:51Z

ok, we'll try to fix this. Their is a 'technical' issue so not sure can get it in 0.15.1, but we'll see.

My point was that the manipulation tools are currently much better for multi-indexed frames that for Panels. And if the data is not too dense, a mi frame is more compact in representation. But Panels do have there uses.

Thanks for the report.

jreback mentioned this issue Nov 2, 2014

ENH/BUG: support Categorical in to_panel reshaping (GH8704) #8705

Merged

jreback added Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 2, 2014

jreback added this to the 0.15.1 milestone Nov 2, 2014

jreback changed the title ~~Possible bug in the to_panel method of pandas.DataFrame~~ BUG: reshape of categorical via unstack/to_panel Nov 2, 2014

jreback closed this as completed in #8705 Nov 2, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: reshape of categorical via unstack/to_panel #8704

BUG: reshape of categorical via unstack/to_panel #8704

davidrpugh commented Nov 1, 2014

davidrpugh commented Nov 1, 2014

jreback commented Nov 1, 2014

jreback commented Nov 1, 2014

davidrpugh commented Nov 2, 2014

jreback commented Nov 2, 2014

davidrpugh commented Nov 2, 2014

jreback commented Nov 2, 2014

BUG: reshape of categorical via unstack/to_panel #8704

BUG: reshape of categorical via unstack/to_panel #8704

Comments

davidrpugh commented Nov 1, 2014

davidrpugh commented Nov 1, 2014

jreback commented Nov 1, 2014

jreback commented Nov 1, 2014

davidrpugh commented Nov 2, 2014

jreback commented Nov 2, 2014

davidrpugh commented Nov 2, 2014

jreback commented Nov 2, 2014