Skip to content

ujson __json__ attribute logic #12739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

TurnaevEvgeny
Copy link

  • ./test_fast.sh works fine
    Ran 8463 tests in 127.338
    OK (SKIP=592)
  • passes git diff upstream/master | flake8 --diff

A port of ujson 1.35 feature: object can define __json__ attribute for custom serialization. See
ultrajson/ultrajson@a8f0f0f

class ujson_as_is(object):
    def __init__(self, value):
        self.value = value
    def __json__(self):
        return self.value

df = pd.DataFrame([{"foo": ujson_as_is('{"parrot": 42.0}')}])
df.to_json(orient = 'records')

result [{"foo":{"parrot": 42.0}}]

@jreback
Copy link
Contributor

jreback commented Mar 30, 2016

could you add some tests specifically for this

how standard is the dunder json tag? other libraries use? can u give some examples?

@jreback jreback added Enhancement IO JSON read_json, to_json, json_normalize labels Mar 30, 2016
@TurnaevEvgeny
Copy link
Author

Added some tests. I don't think __json__ is used anywhere except ujson.
I am building some restful API and using pandas dataframe heavily. So I would like to skip python loops as much as possible.
Primary use cases for me:

  • put raw json dump of another dataframe to DataFrame, so that it is not double encoded.
  • create nested json from df skipping python looping.
  • different parts of nested structure might require different decimal places & datetime formatting.

I can write a small section in io.html#json with example of how I am going to use it in rest api, like:

df_long = pd.DataFrame(... ,index = [1,1,1,2,2,2])
df_result = pd.DataFrame(columns = ["result", "rest_api_code"], index=[1,2,3])

df_result["rest_api_code"] = "not_found"
df_result.loc[df_long.index.unique(),"rest_api_code"] = "ok"

for idx, grp in df_long.groupby(level=0):
    df_result.loc[idx, "result"] = grp.to_json(orient='records')

df_result.to_json(orient='records')

I am actually using multiple levels of such dataframe json nesting.

Do you think its worth noting in docs? It's not like typical use case and not related to core functionality.

@jreback
Copy link
Contributor

jreback commented Mar 30, 2016

@TurnaevEvgeny I think what people want is an easy way to do this (then we would expose pd.to_json(...) as a top-level general serializingfunction

In [2]: pd.io.json.to_json(None, df)
Out[2]: '{"A":{"0":1,"1":2,"2":3},"B":{"0":"a","1":"b","2":"c"}}'

In [3]: pd.io.json.to_json(None, {'foo' : df})
NotImplementedError: 'obj' should be a Series or a DataFrame

IIRC this was a very easy fix
cc @cpcloud

@jreback
Copy link
Contributor

jreback commented Mar 30, 2016

xref to #9166

@TurnaevEvgeny
Copy link
Author

@jreback Sorry, I didn't get why #9166 is referenced, seems irrelevant. I didn't completely get your reply that people want to pd.to_json(), although that aligns with my point that __json__ has nothing to do with dataframe serialization in general. Its just a way to allow to store pre-dumped json in df and then output without double/triple encoding. Well and ujson compatibility also. So whats the status of this pull? I can add docs if needed.

@jreback
Copy link
Contributor

jreback commented Mar 31, 2016

serializaing nested structures that include pandas objects can almost be done now

I would rather fix to_json than add a dunder method

@TurnaevEvgeny
Copy link
Author

I would argue that __json__ helps in different scenarios than serializing df with nested structures.
With __json__ one can store json dumps in arbitrary format in DataFrame df.loc[...] = ujson_as_is(df.to_json(orient='records')) and have different orientation, float formatting, and just store anything even not another dataframe dump, but arbitrary json dump. It also helps to store instances of objects that knows how to dump themselfs and have a cached dump representation in hands.

@jreback
Copy link
Contributor

jreback commented Mar 31, 2016

can u post a short compelling example (with output)
as if u were writing docs

@TurnaevEvgeny
Copy link
Author

In [16]: class ujson_as_is(object):
   ....:     def __init__(self, value):
   ....:         self.value = value
   ....:     def __json__(self):
   ....:         return self.value
   ....:     __repr__ = __json__
   ....:

In [17]: df_company_info = pd.DataFrame([{'name': 'Google', 'est': '1998-09-04'}, {'name': 'Microsoft', 'est': '1975-04-04'}, {'name': 'Apple', 'est': '1976-04-01'}],index = ['goog', 'msft', 'aapl'])

In [18]: df_company_info
Out[18]:
             est       name
goog  1998-09-04     Google
msft  1975-04-04  Microsoft
aapl  1976-04-01      Apple

In [19]: names = np.random.choice(['aapl', 'goog', 'msft'], 10)

In [20]: dates = pd.date_range('1/1/2000', periods=10, freq='D')

In [21]: df = pd.DataFrame({'date': dates, 'price': np.random.random(10)}, index = names)

In [22]: df
Out[22]:
           date     price
msft 2000-01-01  0.280379
aapl 2000-01-02  0.120819
aapl 2000-01-03  0.471827
aapl 2000-01-04  0.789162
aapl 2000-01-05  0.649434
msft 2000-01-06  0.858836
goog 2000-01-07  0.440876
aapl 2000-01-08  0.523965
aapl 2000-01-09  0.860230
msft 2000-01-10  0.215722

In [23]: for company,grp in df.groupby(level=0):
   ....:     df_company_info.loc[company, "stock_history"] = ujson_as_is(grp.to_json(orient='records', double_precision = 3, date_format = 'iso'))
   ....:

In [24]: df_company_info
Out[24]:
             est       name                                      stock_history
goog  1998-09-04     Google  [{"date":"2000-01-07T00:00:00.000Z","price":0....
msft  1975-04-04  Microsoft  [{"date":"2000-01-01T00:00:00.000Z","price":0....
aapl  1976-04-01      Apple  [{"date":"2000-01-02T00:00:00.000Z","price":0....

In [25]: df_company_info.to_json(orient='records')
Out[25]: '[{"est":"1998-09-04","name":"Google","stock_history":[{"date":"2000-01-07T00:00:00.000Z","price":0.441}]},{"est":"1975-04-04","name":"Microsoft","stock_history":[{"date":"2000-01-01T00:00:00.000Z","price":0.28},{"date":"2000-01-06T00:00:00.000Z","price":0.859},{"date":"2000-01-10T00:00:00.000Z","price":0.216}]},{"est":"1976-04-01","name":"Apple","stock_history":[{"date":"2000-01-02T00:00:00.000Z","price":0.121},{"date":"2000-01-03T00:00:00.000Z","price":0.472},{"date":"2000-01-04T00:00:00.000Z","price":0.789},{"date":"2000-01-05T00:00:00.000Z","price":0.649},{"date":"2000-01-08T00:00:00.000Z","price":0.524},{"date":"2000-01-09T00:00:00.000Z","price":0.86}]}]'

In [26]:

@TurnaevEvgeny
Copy link
Author

Is there anything I can help further?

@jreback
Copy link
Contributor

jreback commented Mar 31, 2016

I am still confused what problem does this solve
in the example above, why wouldn't you simply df.to_json(...)

@TurnaevEvgeny
Copy link
Author

In the example above if I do the same without __json__ support in pandas the output will be

Out[194]: '[{"est":"1998-09-04","name":"Google","stock_history":{"value":"[{\\"date\\":\\"2000-01-03T00:00:00.000Z\\",\\"price\\":0.294},{\\"date\\":\\"2000-01-06T00:00:00.000Z\\",\\"price\\":0.472}]"}},{"est":"1975-04-04","name":"Microsoft","stock_history":{"value":"[{\\"date\\":\\"2000-01-01T00:00:00.000Z\\",\\"price\\":0.489},{\\"date\\":\\"2000-01-04T00:00:00.000Z\\",\\"price\\":0.739},{\\"date\\":\\"2000-01-08T00:00:00.000Z\\",\\"price\\":0.057}]"}},{"est":"1976-04-01","name":"Microsoft","stock_history":{"value":"[{\\"date\\":\\"2000-01-02T00:00:00.000Z\\",\\"price\\":0.718},{\\"date\\":\\"2000-01-05T00:00:00.000Z\\",\\"price\\":0.074},{\\"date\\":\\"2000-01-07T00:00:00.000Z\\",\\"price\\":0.517},{\\"date\\":\\"2000-01-09T00:00:00.000Z\\",\\"price\\":0.902},{\\"date\\":\\"2000-01-10T00:00:00.000Z\\",\\"price\\":0.245}]"}}]'

if I decode this back:

tmp[0]['stock_history']
Out[10]: {u'value': u'[{"date":"2000-01-03T00:00:00.000Z","price":0.294},{"date":"2000-01-06T00:00:00.000Z","price":0.472}]'}

stock_history columns becomes double json encoded , like json.dump(json.dump({'foo': 'bar'})). It also gets put into aritificial {'value': sales_history_str} dict that didn't exist in df.
to_dict() partially solves this in the example above, but to_dict() doesn't help with different representation formats. Also I bet to_dict() will be slower because of python object creation while to_json() can convert pretty wide frames to string skipping intermediate python object creation. Besides with __json__ its just more handy to build final json dump gradually.

@TurnaevEvgeny
Copy link
Author

Does that explains the problem & motivation?

@jreback
Copy link
Contributor

jreback commented Apr 1, 2016

I still don't see any utility in doing that. You can simply construct the frame as you like, then serialize the entire thing.

In [21]: df_company_info['history'] = df.groupby(level=0).apply(lambda x: x.to_dict())

In [22]: df_company_info
Out[22]: 
             est       name                                            history
goog  1998-09-04     Google  {u'date': {u'goog': 2000-01-10 00:00:00}, u'pr...
msft  1975-04-04  Microsoft  {u'date': {u'msft': 2000-01-08 00:00:00}, u'pr...
aapl  1976-04-01      Apple  {u'date': {u'aapl': 2000-01-09 00:00:00}, u'pr...

In [23]: df_company_info.to_json(orient='records')
Out[23]: '[{"est":"1998-09-04","name":"Google","history":{"date":{"goog":947462400000},"price":{"goog":0.4752710922}}},{"est":"1975-04-04","name":"Microsoft","history":{"date":{"msft":947289600000},"price":{"msft":0.1898955714}}},{"est":"1976-04-01","name":"Apple","history":{"date":{"aapl":947376000000},"price":{"aapl":0.7407572394}}}]'

@TurnaevEvgeny
Copy link
Author

See in your example "date":{"goog":947462400000}, thats my first point - different formatting at the different levels.
My second point is performance:

In [26]: %timeit -n 100 df.groupby(level=0).apply(lambda x: x.to_dict())
100 loops, best of 3: 21.8 ms per loop

In [27]: %timeit -n 100 df.groupby(level=0).apply(lambda x: ujson_as_is(x.to_json(orient='records')))
100 loops, best of 3: 8.21 ms per loop

Not a big deal, but not a big df also, and no futher to_dict() higher in call stack.
In the above examples its just one level nesting, but for the api I am building I am going to have: {'a': {'b': [ {'c': {'d': df_slice_dump_here}, 'e': other_df_slice_dump_here } ] }, 'z': other_dump_here }. So suppose I have a function that prepares 'e': other_df_slice_dump_here part of output, what I really want to do in that function is to finalize work as intermediate json. If I go to_dict() in that function - I am going to have troubles with formatting later up in stack and also that will impact performance. Also I want to be able to put arbitrary json dumps in that function that prepares 'e': other_df_slice_dump_here not only other dataframe. See its like gradually building final json bottom up. I think that was exactly the intention why ujson added __json__.

@jreback
Copy link
Contributor

jreback commented Apr 1, 2016

why are you putting arbitrary json in dataframes? this is not useful or performant at all

you are just adding an artifice which doesn't enable much of anything, and will cause user confusion

I still don't see how his is actually useful - where are you exporting this json to? and what are you doing with it?

@TurnaevEvgeny
Copy link
Author

Ok, let me start from beginning. I am building REST api (flask application) that supposed to return json. Internally I am working with dataframes fetched from different sources or cached in memory. During processing api calls I often have 2-3 dataframe that somehow related (often not a straigthforward merge way). The options for me to output final json: 1) Loop though those 2-3 dataframes and construct giant dict and then json.dump it. 2) Use pandas dataframes, df.to_json() and build json gradually.
The first option is a mix of .apply(), to_dict(), .iterrows(), which I would like to skip for performance and readability reasons.
I understand this all is not a pandas problem at all, and this was my original doubt that this functionality should not be put into docs because it is specific of json dumping. But I've found ujson __json__ implementaiton and idea. The code is much simpler, smoother, and more performant with the second approach.
I understand your concern of bringing unneeded complexity into codebase, on the other hand from my perspective this is just ujson compliance that potentially opens new way to prepare json output with pandas dataframes for the cases like rest api.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2016

cc @Komnomnomnom any thoughts here

@Komnomnomnom
Copy link
Contributor

I'd be more in favour of supporting a default handler which can return encoded / raw json, similiar to the cls arg in the built-in json.dumps e.g. something like

df.to_json(default_enc=lambda obj: json.dumps(obj))

@jreback
Copy link
Contributor

jreback commented Apr 12, 2016

@Komnomnomnom isn't that what default_handler is doing?

@Komnomnomnom
Copy link
Contributor

No it converts an unsupported object into one that can be JSON serialised i.e. by converting to a dict or a string.

default_enc would be expected to return valid JSON.

@jreback
Copy link
Contributor

jreback commented Apr 12, 2016

why 2 different ways? couldn't default_handler just return valid JSON? (and if its not it would be an error)?

@Komnomnomnom
Copy link
Contributor

If default_handler returned JSON it would just be treated as a regular string and thus end up double encoded. default_handler must convert the object into a different JSON-serializable object i.e. MyClass -> dict representation of MyClass.

The idea is that

  • default_handler allows you to easily serialize objects that are not natively supported by the JSON serializer, but you must return a new, supported object to the encoder and it will encode the new object using the params you invoked it with (iso dates, fast numpy and so on).
  • default_enc would be required to return valid JSON and would allow allow you to override the encoder for a subset of your data. So you could change encoder params or use a different encoder entirely, along with supporting the use case above and avoiding double encoding. The caveat is it would only get invoked for objects that the regular encoder doesn't support.

@jreback
Copy link
Contributor

jreback commented May 13, 2016

closing as won't fix.

default_handler covers these cases

@jreback jreback closed this May 13, 2016
@jreback jreback added this to the No action milestone May 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants