ujson json attribute logic #12739

TurnaevEvgeny · 2016-03-29T23:53:49Z

./test_fast.sh works fine
Ran 8463 tests in 127.338
OK (SKIP=592)
passes git diff upstream/master | flake8 --diff

A port of ujson 1.35 feature: object can define __json__ attribute for custom serialization. See
ultrajson/ultrajson@a8f0f0f

class ujson_as_is(object):
    def __init__(self, value):
        self.value = value
    def __json__(self):
        return self.value

df = pd.DataFrame([{"foo": ujson_as_is('{"parrot": 42.0}')}])
df.to_json(orient = 'records')

result [{"foo":{"parrot": 42.0}}]

jreback · 2016-03-30T01:05:26Z

could you add some tests specifically for this

how standard is the dunder json tag? other libraries use? can u give some examples?

TurnaevEvgeny · 2016-03-30T20:26:57Z

Added some tests. I don't think __json__ is used anywhere except ujson.
I am building some restful API and using pandas dataframe heavily. So I would like to skip python loops as much as possible.
Primary use cases for me:

put raw json dump of another dataframe to DataFrame, so that it is not double encoded.
create nested json from df skipping python looping.
different parts of nested structure might require different decimal places & datetime formatting.

I can write a small section in io.html#json with example of how I am going to use it in rest api, like:

df_long = pd.DataFrame(... ,index = [1,1,1,2,2,2])
df_result = pd.DataFrame(columns = ["result", "rest_api_code"], index=[1,2,3])

df_result["rest_api_code"] = "not_found"
df_result.loc[df_long.index.unique(),"rest_api_code"] = "ok"

for idx, grp in df_long.groupby(level=0):
    df_result.loc[idx, "result"] = grp.to_json(orient='records')

df_result.to_json(orient='records')

I am actually using multiple levels of such dataframe json nesting.

Do you think its worth noting in docs? It's not like typical use case and not related to core functionality.

jreback · 2016-03-30T23:14:39Z

@TurnaevEvgeny I think what people want is an easy way to do this (then we would expose pd.to_json(...) as a top-level general serializingfunction

In [2]: pd.io.json.to_json(None, df)
Out[2]: '{"A":{"0":1,"1":2,"2":3},"B":{"0":"a","1":"b","2":"c"}}'

In [3]: pd.io.json.to_json(None, {'foo' : df})
NotImplementedError: 'obj' should be a Series or a DataFrame

IIRC this was a very easy fix
cc @cpcloud

jreback · 2016-03-30T23:15:33Z

xref to #9166

TurnaevEvgeny · 2016-03-31T00:10:01Z

@jreback Sorry, I didn't get why #9166 is referenced, seems irrelevant. I didn't completely get your reply that people want to pd.to_json(), although that aligns with my point that __json__ has nothing to do with dataframe serialization in general. Its just a way to allow to store pre-dumped json in df and then output without double/triple encoding. Well and ujson compatibility also. So whats the status of this pull? I can add docs if needed.

jreback · 2016-03-31T00:21:19Z

serializaing nested structures that include pandas objects can almost be done now

I would rather fix to_json than add a dunder method

TurnaevEvgeny · 2016-03-31T01:00:13Z

I would argue that __json__ helps in different scenarios than serializing df with nested structures.
With __json__ one can store json dumps in arbitrary format in DataFrame df.loc[...] = ujson_as_is(df.to_json(orient='records')) and have different orientation, float formatting, and just store anything even not another dataframe dump, but arbitrary json dump. It also helps to store instances of objects that knows how to dump themselfs and have a cached dump representation in hands.

jreback · 2016-03-31T01:04:29Z

can u post a short compelling example (with output)
as if u were writing docs

TurnaevEvgeny · 2016-03-31T02:42:26Z

In [16]: class ujson_as_is(object):
   ....:     def __init__(self, value):
   ....:         self.value = value
   ....:     def __json__(self):
   ....:         return self.value
   ....:     __repr__ = __json__
   ....:

In [17]: df_company_info = pd.DataFrame([{'name': 'Google', 'est': '1998-09-04'}, {'name': 'Microsoft', 'est': '1975-04-04'}, {'name': 'Apple', 'est': '1976-04-01'}],index = ['goog', 'msft', 'aapl'])

In [18]: df_company_info
Out[18]:
             est       name
goog  1998-09-04     Google
msft  1975-04-04  Microsoft
aapl  1976-04-01      Apple

In [19]: names = np.random.choice(['aapl', 'goog', 'msft'], 10)

In [20]: dates = pd.date_range('1/1/2000', periods=10, freq='D')

In [21]: df = pd.DataFrame({'date': dates, 'price': np.random.random(10)}, index = names)

In [22]: df
Out[22]:
           date     price
msft 2000-01-01  0.280379
aapl 2000-01-02  0.120819
aapl 2000-01-03  0.471827
aapl 2000-01-04  0.789162
aapl 2000-01-05  0.649434
msft 2000-01-06  0.858836
goog 2000-01-07  0.440876
aapl 2000-01-08  0.523965
aapl 2000-01-09  0.860230
msft 2000-01-10  0.215722

In [23]: for company,grp in df.groupby(level=0):
   ....:     df_company_info.loc[company, "stock_history"] = ujson_as_is(grp.to_json(orient='records', double_precision = 3, date_format = 'iso'))
   ....:

In [24]: df_company_info
Out[24]:
             est       name                                      stock_history
goog  1998-09-04     Google  [{"date":"2000-01-07T00:00:00.000Z","price":0....
msft  1975-04-04  Microsoft  [{"date":"2000-01-01T00:00:00.000Z","price":0....
aapl  1976-04-01      Apple  [{"date":"2000-01-02T00:00:00.000Z","price":0....

In [25]: df_company_info.to_json(orient='records')
Out[25]: '[{"est":"1998-09-04","name":"Google","stock_history":[{"date":"2000-01-07T00:00:00.000Z","price":0.441}]},{"est":"1975-04-04","name":"Microsoft","stock_history":[{"date":"2000-01-01T00:00:00.000Z","price":0.28},{"date":"2000-01-06T00:00:00.000Z","price":0.859},{"date":"2000-01-10T00:00:00.000Z","price":0.216}]},{"est":"1976-04-01","name":"Apple","stock_history":[{"date":"2000-01-02T00:00:00.000Z","price":0.121},{"date":"2000-01-03T00:00:00.000Z","price":0.472},{"date":"2000-01-04T00:00:00.000Z","price":0.789},{"date":"2000-01-05T00:00:00.000Z","price":0.649},{"date":"2000-01-08T00:00:00.000Z","price":0.524},{"date":"2000-01-09T00:00:00.000Z","price":0.86}]}]'

In [26]:

TurnaevEvgeny · 2016-03-31T23:21:58Z

Is there anything I can help further?

jreback · 2016-03-31T23:24:53Z

I am still confused what problem does this solve
in the example above, why wouldn't you simply df.to_json(...)

TurnaevEvgeny · 2016-03-31T23:37:46Z

In the example above if I do the same without __json__ support in pandas the output will be

Out[194]: '[{"est":"1998-09-04","name":"Google","stock_history":{"value":"[{\\"date\\":\\"2000-01-03T00:00:00.000Z\\",\\"price\\":0.294},{\\"date\\":\\"2000-01-06T00:00:00.000Z\\",\\"price\\":0.472}]"}},{"est":"1975-04-04","name":"Microsoft","stock_history":{"value":"[{\\"date\\":\\"2000-01-01T00:00:00.000Z\\",\\"price\\":0.489},{\\"date\\":\\"2000-01-04T00:00:00.000Z\\",\\"price\\":0.739},{\\"date\\":\\"2000-01-08T00:00:00.000Z\\",\\"price\\":0.057}]"}},{"est":"1976-04-01","name":"Microsoft","stock_history":{"value":"[{\\"date\\":\\"2000-01-02T00:00:00.000Z\\",\\"price\\":0.718},{\\"date\\":\\"2000-01-05T00:00:00.000Z\\",\\"price\\":0.074},{\\"date\\":\\"2000-01-07T00:00:00.000Z\\",\\"price\\":0.517},{\\"date\\":\\"2000-01-09T00:00:00.000Z\\",\\"price\\":0.902},{\\"date\\":\\"2000-01-10T00:00:00.000Z\\",\\"price\\":0.245}]"}}]'

if I decode this back:

tmp[0]['stock_history']
Out[10]: {u'value': u'[{"date":"2000-01-03T00:00:00.000Z","price":0.294},{"date":"2000-01-06T00:00:00.000Z","price":0.472}]'}

stock_history columns becomes double json encoded , like json.dump(json.dump({'foo': 'bar'})). It also gets put into aritificial {'value': sales_history_str} dict that didn't exist in df.
to_dict() partially solves this in the example above, but to_dict() doesn't help with different representation formats. Also I bet to_dict() will be slower because of python object creation while to_json() can convert pretty wide frames to string skipping intermediate python object creation. Besides with __json__ its just more handy to build final json dump gradually.

TurnaevEvgeny · 2016-04-01T03:17:41Z

Does that explains the problem & motivation?

jreback · 2016-04-01T12:58:09Z

I still don't see any utility in doing that. You can simply construct the frame as you like, then serialize the entire thing.

In [21]: df_company_info['history'] = df.groupby(level=0).apply(lambda x: x.to_dict())

In [22]: df_company_info
Out[22]: 
             est       name                                            history
goog  1998-09-04     Google  {u'date': {u'goog': 2000-01-10 00:00:00}, u'pr...
msft  1975-04-04  Microsoft  {u'date': {u'msft': 2000-01-08 00:00:00}, u'pr...
aapl  1976-04-01      Apple  {u'date': {u'aapl': 2000-01-09 00:00:00}, u'pr...

In [23]: df_company_info.to_json(orient='records')
Out[23]: '[{"est":"1998-09-04","name":"Google","history":{"date":{"goog":947462400000},"price":{"goog":0.4752710922}}},{"est":"1975-04-04","name":"Microsoft","history":{"date":{"msft":947289600000},"price":{"msft":0.1898955714}}},{"est":"1976-04-01","name":"Apple","history":{"date":{"aapl":947376000000},"price":{"aapl":0.7407572394}}}]'

TurnaevEvgeny · 2016-04-01T17:30:38Z

See in your example "date":{"goog":947462400000}, thats my first point - different formatting at the different levels.
My second point is performance:

In [26]: %timeit -n 100 df.groupby(level=0).apply(lambda x: x.to_dict())
100 loops, best of 3: 21.8 ms per loop

In [27]: %timeit -n 100 df.groupby(level=0).apply(lambda x: ujson_as_is(x.to_json(orient='records')))
100 loops, best of 3: 8.21 ms per loop

Not a big deal, but not a big df also, and no futher to_dict() higher in call stack.
In the above examples its just one level nesting, but for the api I am building I am going to have: {'a': {'b': [ {'c': {'d': df_slice_dump_here}, 'e': other_df_slice_dump_here } ] }, 'z': other_dump_here }. So suppose I have a function that prepares 'e': other_df_slice_dump_here part of output, what I really want to do in that function is to finalize work as intermediate json. If I go to_dict() in that function - I am going to have troubles with formatting later up in stack and also that will impact performance. Also I want to be able to put arbitrary json dumps in that function that prepares 'e': other_df_slice_dump_here not only other dataframe. See its like gradually building final json bottom up. I think that was exactly the intention why ujson added __json__.

jreback · 2016-04-01T19:00:53Z

why are you putting arbitrary json in dataframes? this is not useful or performant at all

you are just adding an artifice which doesn't enable much of anything, and will cause user confusion

I still don't see how his is actually useful - where are you exporting this json to? and what are you doing with it?

TurnaevEvgeny · 2016-04-01T19:50:51Z

Ok, let me start from beginning. I am building REST api (flask application) that supposed to return json. Internally I am working with dataframes fetched from different sources or cached in memory. During processing api calls I often have 2-3 dataframe that somehow related (often not a straigthforward merge way). The options for me to output final json: 1) Loop though those 2-3 dataframes and construct giant dict and then json.dump it. 2) Use pandas dataframes, df.to_json() and build json gradually.
The first option is a mix of .apply(), to_dict(), .iterrows(), which I would like to skip for performance and readability reasons.
I understand this all is not a pandas problem at all, and this was my original doubt that this functionality should not be put into docs because it is specific of json dumping. But I've found ujson __json__ implementaiton and idea. The code is much simpler, smoother, and more performant with the second approach.
I understand your concern of bringing unneeded complexity into codebase, on the other hand from my perspective this is just ujson compliance that potentially opens new way to prepare json output with pandas dataframes for the cases like rest api.

jreback · 2016-04-09T15:07:09Z

cc @Komnomnomnom any thoughts here

Komnomnomnom · 2016-04-12T12:35:46Z

I'd be more in favour of supporting a default handler which can return encoded / raw json, similiar to the cls arg in the built-in json.dumps e.g. something like

df.to_json(default_enc=lambda obj: json.dumps(obj))

jreback · 2016-04-12T13:01:31Z

@Komnomnomnom isn't that what default_handler is doing?

Komnomnomnom · 2016-04-12T13:29:03Z

No it converts an unsupported object into one that can be JSON serialised i.e. by converting to a dict or a string.

default_enc would be expected to return valid JSON.

jreback · 2016-04-12T13:30:42Z

why 2 different ways? couldn't default_handler just return valid JSON? (and if its not it would be an error)?

Komnomnomnom · 2016-04-26T14:22:46Z

If default_handler returned JSON it would just be treated as a regular string and thus end up double encoded. default_handler must convert the object into a different JSON-serializable object i.e. MyClass -> dict representation of MyClass.

The idea is that

default_handler allows you to easily serialize objects that are not natively supported by the JSON serializer, but you must return a new, supported object to the encoder and it will encode the new object using the params you invoked it with (iso dates, fast numpy and so on).
default_enc would be required to return valid JSON and would allow allow you to override the encoder for a subset of your data. So you could change encoder params or use a different encoder entirely, along with supporting the use case above and avoiding double encoding. The caveat is it would only get invoked for objects that the regular encoder doesn't support.

jreback · 2016-05-13T23:37:18Z

closing as won't fix.

default_handler covers these cases

ujson __json__ attribute logic

5120b8b

jreback added Enhancement IO JSON read_json, to_json, json_normalize labels Mar 30, 2016

Evgeny Turnaev added 2 commits March 29, 2016 22:59

__json__ attribute tests

c7c949a

flake8 thing

e5e4a04

Komnomnomnom mentioned this pull request Apr 27, 2016

BUG: json invoke default handler for unsupported numpy dtypes #12878

Closed

7 tasks

jreback closed this May 13, 2016

jreback added this to the No action milestone May 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ujson json attribute logic #12739

ujson json attribute logic #12739

TurnaevEvgeny commented Mar 29, 2016

jreback commented Mar 30, 2016

TurnaevEvgeny commented Mar 30, 2016

jreback commented Mar 30, 2016

jreback commented Mar 30, 2016

TurnaevEvgeny commented Mar 31, 2016

jreback commented Mar 31, 2016

TurnaevEvgeny commented Mar 31, 2016

jreback commented Mar 31, 2016

TurnaevEvgeny commented Mar 31, 2016

TurnaevEvgeny commented Mar 31, 2016

jreback commented Mar 31, 2016

TurnaevEvgeny commented Mar 31, 2016

TurnaevEvgeny commented Apr 1, 2016

jreback commented Apr 1, 2016

TurnaevEvgeny commented Apr 1, 2016

jreback commented Apr 1, 2016

TurnaevEvgeny commented Apr 1, 2016

jreback commented Apr 9, 2016

Komnomnomnom commented Apr 12, 2016

jreback commented Apr 12, 2016

Komnomnomnom commented Apr 12, 2016

jreback commented Apr 12, 2016

Komnomnomnom commented Apr 26, 2016

jreback commented May 13, 2016

ujson __json__ attribute logic #12739

ujson __json__ attribute logic #12739

Conversation

TurnaevEvgeny commented Mar 29, 2016

jreback commented Mar 30, 2016

TurnaevEvgeny commented Mar 30, 2016

jreback commented Mar 30, 2016

jreback commented Mar 30, 2016

TurnaevEvgeny commented Mar 31, 2016

jreback commented Mar 31, 2016

TurnaevEvgeny commented Mar 31, 2016

jreback commented Mar 31, 2016

TurnaevEvgeny commented Mar 31, 2016

TurnaevEvgeny commented Mar 31, 2016

jreback commented Mar 31, 2016

TurnaevEvgeny commented Mar 31, 2016

TurnaevEvgeny commented Apr 1, 2016

jreback commented Apr 1, 2016

TurnaevEvgeny commented Apr 1, 2016

jreback commented Apr 1, 2016

TurnaevEvgeny commented Apr 1, 2016

jreback commented Apr 9, 2016

Komnomnomnom commented Apr 12, 2016

jreback commented Apr 12, 2016

Komnomnomnom commented Apr 12, 2016

jreback commented Apr 12, 2016

Komnomnomnom commented Apr 26, 2016

jreback commented May 13, 2016

ujson json attribute logic #12739

ujson json attribute logic #12739