Skip to content

make DataFrame.to_dict(orient='list') output native python elements #9108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

adamgreenhall
Copy link
Contributor

No description provided.

@shoyer
Copy link
Member

shoyer commented Dec 19, 2014

@adamgreenhall Could you please explain the motivation behind this change in a little more detail?

This will also need tests.

@adamgreenhall
Copy link
Contributor Author

This change is about enabling the export of a DataFrame into a part of a json document (I know to_json exists, but want to add other things to the document as well). Currently, the lists of data created by DataFrame.to_dict(orient='list') are made up of numpy elements. This causes json.dump to raise errors if the DataFrame has np.bool types (works for np.float64 and np.int64) -- see example of the issue below.

This change fixes the immediate symptom by converting all list element values to native python types. Perhaps the underlying issue is really that np.bool types are not json compatible - but I wasn't sure how to address that.

In [1]: import pandas as pd

In [2]: import json

In [3]: df = pd.DataFrame({'a': [1.1, 1.2, 1.3], 'b': [2, 3, 4], 'c': [True, False, True]})

In [4]: print df.dtypes
a    float64
b      int64
c       bool
dtype: object

In [5]: blob = dict(data=df.to_dict(orient='list'), description='this is some data')

In [6]: print type(blob['data']['c'][0])
<type 'numpy.bool_'>

In [7]: print json.dumps(blob)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-3bd1feb8b602> in <module>()
----> 1 print json.dumps(blob)

./python2.7/json/__init__.pyc in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, encoding, default, sort_keys, **kw)
    241         cls is None and indent is None and separators is None and
    242         encoding == 'utf-8' and default is None and not sort_keys and not kw):
--> 243         return _default_encoder.encode(obj)
    244     if cls is None:
    245         cls = JSONEncoder

./python2.7/json/encoder.pyc in encode(self, o)
    205         # exceptions aren't as detailed.  The list call should be roughly
    206         # equivalent to the PySequence_Fast that ''.join() would do.
--> 207         chunks = self.iterencode(o, _one_shot=True)
    208         if not isinstance(chunks, (list, tuple)):
    209             chunks = list(chunks)

./python2.7/json/encoder.pyc in iterencode(self, o, _one_shot)
    268                 self.key_separator, self.item_separator, self.sort_keys,
    269                 self.skipkeys, _one_shot)
--> 270         return _iterencode(o, 0)
    271
    272 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

./python2.7/json/encoder.pyc in default(self, o)
    182
    183         """
--> 184         raise TypeError(repr(o) + " is not JSON serializable")
    185
    186     def encode(self, o):

TypeError: True is not JSON serializable

@cpcloud
Copy link
Member

cpcloud commented Dec 19, 2014

if you don't need to do anything special with your frames you can use pandas.io.json.dumps:

In [8]: from pandas.io.json import dumps

In [9]: dumps(['a', 1, ['a', 2, [3]], {'frame': pd.DataFrame(np.random.rand(2, 2))}])
Out[9]: '["a",1,["a",2,[3]],{"frame":{"0":{"0":0.369918913,"1":0.4624219221},"1":{"0":0.9272272068,"1":0.7450566582}}}]'

@cpcloud
Copy link
Member

cpcloud commented Dec 19, 2014

to_dict will be much less efficient than to_json, as to_json is looping in C whereas to_dict looping in Python.

@cpcloud
Copy link
Member

cpcloud commented Dec 19, 2014

dumps could be exposed at the toplevel API, though I haven't thought about what additional work that might require

@adamgreenhall
Copy link
Contributor Author

I like the idea of using pandas.io.json.dumps, but would also like to keep the orient='list' styling. Sounds like that would require altering it.

@cpcloud
Copy link
Member

cpcloud commented Dec 19, 2014

how comfortable are you with C?

@cpcloud
Copy link
Member

cpcloud commented Dec 19, 2014

you can also do this with dumps right now:

In [16]: import pandas.util.testing as tm

In [17]: from pandas.io.json import dumps

In [18]: df = tm.makeTimeDataFrame().reset_index().rename(columns={'index': 'date'}).head(5)

In [19]: df
Out[19]:
        date         A         B         C         D
0 2000-01-03  0.229303 -1.394965  0.156741  1.233180
1 2000-01-04 -0.611819  0.616925 -0.063782  0.455711
2 2000-01-05  2.387436  0.552139 -0.000982 -0.478749
3 2000-01-06 -0.694529  0.472475  0.924082 -1.544734
4 2000-01-07 -0.794539 -0.597034 -1.734419 -0.104073

In [21]: dumps({k: v for k, v in df.iteritems()},orient='values')
Out[21]: '{"date":[946857600000,946944000000,947030400000,947116800000,947203200000],"A":[0.2293025925,-0.611819198,2.387435883,-0.6945293878,-0.7945391792],"C":[0.1567408883,-0.0637816997,-0.0009824659,0.9240820459,-1.7344187482],"B":[-1.3949645486,0.6169250907,0.5521388533,0.4724746145,-0.5970341471],"D":[1.2331802175,0.4557113376,-0.4787493278,-1.5447336653,-0.1040725235]}'

@adamgreenhall
Copy link
Contributor Author

@cpcloud - that works for me - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants