Skip to content

gh-95382: Improve performance of json encoder with indent #118105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
May 6, 2024

Conversation

eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented Apr 19, 2024

In this PR we implement the indent argument for the C implemention of the JSON encoder.

Benchmark:

json.dumps indent=4: Mean +- std dev: [main] 27.0 us +- 1.5 us -> [pr] 11.0 us +- 0.6 us: 2.45x faster
json.dumps: Mean +- std dev: [main] 8.90 us +- 0.30 us -> [pr] 8.73 us +- 0.35 us: 1.02x faster

Geometric mean: 1.58x faster

Updated benchmark after using PySequence_Repeat :

json.dumps indent=4: Mean +- std dev: [main] 27.0 us +- 1.5 us -> [pr_v2] 10.3 us +- 0.3 us: 2.62x faster
json.dumps: Mean +- std dev: [main] 8.90 us +- 0.30 us -> [pr_v2] 9.02 us +- 0.37 us: 1.01x slower

Geometric mean: 1.61x faster
Benchmark script
import json
import pyperf

obj = {'list': [1, 2, 3, 4, 5., 6., 7.], 'dict': {'a': 'A', 'b': 'A', 'c': 'C'},
       'tuple': tuple(ii**2/3 for ii in range(20)), 'bool': True, 'float': float('nan'), }

json_dumps = lambda data: json.dumps(data)
json_dumps4 = lambda data: json.dumps(data, indent=4)

runner = pyperf.Runner()
runner.bench_func('json.dumps indent=4', json_dumps4, obj)
runner.bench_func('json.dumps', json_dumps, obj)

Notes

  • The implementation of _create_newline_indent is not very efficient. We can improve that either in a followup PR or in this PR by adding a cache (probably with python dict). Even without this improvement the C implementation is now much faster than the Python implementation.
  • We keep the Python implementation of the encoder unchanged (https://peps.python.org/pep-0399/)

@nineteendo

This comment was marked as resolved.

@eendebakpt eendebakpt changed the title Draft: gh-95382: Improve performance of json encoder with indent gh-95382: Improve performance of json encoder with indent Apr 19, 2024
Modules/_json.c Outdated
PyObject* start = PyUnicode_FromString("\n");
PyObject* newline_indent = NULL;
if (start == 0) {
goto end;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mul_name will be used uninitialized if this branch is taken.

Also, == 0 looks odd for pointer comparisons; compare against NULL instead.

Modules/_json.c Outdated
if (start == 0) {
goto end;
}
// there is no public PyUnicode_Repeat?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great suggestion!

Modules/_json.c Outdated
goto bail;
}

separator = PyUnicode_Concat(separator, newline_indent);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this leaks a reference to the previous contents of separator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous contents of separator was a borrowed reference, so it should be fine. But correct me if I am wrong!

@nineteendo
Copy link
Contributor

Don't forget to resolve the conversations.

@nineteendo
Copy link
Contributor

Could you also configure pre-commit? https://devguide.python.org/getting-started/setup-building/#install-pre-commit

@nineteendo
Copy link
Contributor

It blocks you from committing trailing whitespace:

diff --git a/Modules/_json.c b/Modules/_json.c
index 7bc23d1..828c35c 100644
--- a/Modules/_json.c
+++ b/Modules/_json.c
@@ -1360,7 +1360,7 @@ _steal_accumulate(_PyUnicodeWriter *writer, PyObject *stolen)
 PyObject* _create_newline_indent(PyObject* indent, Py_ssize_t indent_level)
 {
     PyObject* newline_indent = NULL;
- 
+
     PyObject* _current_indent = PySequence_Repeat(indent, indent_level);
     if (_current_indent == NULL) {
         goto end;

@eendebakpt
Copy link
Contributor Author

Don't forget to resolve the conversations.

Typically I leave that to the reviewers (some reviewers prefer it that way)

@nineteendo
Copy link
Contributor

nineteendo commented Apr 21, 2024

Also note that regular reviewers (like me) can't resolve conversations. So, you (or someone else) willl have to resolve that conversation.

@eendebakpt
Copy link
Contributor Author

eendebakpt commented Apr 21, 2024

Also note that regular reviewers (like me) can't resolve conversations. So, you (or someone else) willl have to resolve that conversation.

Ah, I did not know that. Well, we can agree among the two of us it is ok to resolve each others comments when addressed.

Modules/_json.c Outdated

PyObject* _current_indent = PySequence_Repeat(indent, indent_level);
if (_current_indent == NULL) {
goto end;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goto will use start before it is defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialise it with NULL like newline_indent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just return NULL.

Modules/_json.c Outdated
goto bail;
}

separator = PyUnicode_Concat(separator, newline_indent); // non-borrowed reference
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that "non-borrowed reference" means here but it looks like we're leaking a reference to the previous contents of separator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is okay in normal case. separator is a borrowed reference when s->indent == Py_None and a non-borrowed reference when s->indent != Py_None. It is balanced out in normal flow. But if an error happens and we go to the bail label, separator can be leaked. It should be decrefed if needed.

Modules/_json.c Outdated

PyObject* _current_indent = PySequence_Repeat(indent, indent_level);
if (_current_indent == NULL) {
goto end;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just return NULL.

Modules/_json.c Outdated
goto bail;
}

separator = PyUnicode_Concat(separator, newline_indent); // non-borrowed reference
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is okay in normal case. separator is a borrowed reference when s->indent == Py_None and a non-borrowed reference when s->indent != Py_None. It is balanced out in normal flow. But if an error happens and we go to the bail label, separator can be leaked. It should be decrefed if needed.

Modules/_json.c Outdated
indent_level -= 1;
indent_level--;
Py_DECREF(separator);
newline_indent = _create_newline_indent(s->indent, indent_level);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline_indent is calculated twice per every list or dict. First for nested items, second for the closing ] or }. The latter is the same as newline_indent for items of the outer list or dict.

In the following example the indent for the nested ] is the same as the indent for 1, 2 and the nested [:

[
    1,
    2,
    [
        3,
        4
    ]
]

I think that we can just pass it as an argument of encoder_listencode_list() and encoder_listencode_dict().

The inner value can then be calculated as a simple concatenation, without multiplication:

item_newline_indent = PyUnicode_Concat(newline_indent, s->ident);

nineteendo

This comment was marked as duplicate.

@nineteendo
Copy link
Contributor

nineteendo commented Apr 24, 2024

Sorry, I accidentally added these as single comments. I thought I was still editing my review.

Co-authored-by: Nice Zombies <[email protected]>
Modules/_json.c Outdated
return NULL;
}
}
if (encoder_listencode_obj(self, &writer, obj, indent_level,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass indent_level if we pass current_newline_indent? I suppose that it is no longer used.

Modules/_json.c Outdated
Comment on lines 1494 to 1495
PyObject *current_newline_indent,
PyObject *current_item_separator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_newline_indent and current_item_separator are pretty long names. What if remove the current_ prefix from them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we need new_newline_indent then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the prefix works for this method, but in the other methods that take the same argument there already was a newline_indent. I took nineteendo's suggestion to rename those to new_newline_indent

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please add a NEWS entry, and it will be ready for merge.

Your initial idea about caching looks interesting. It does not even need a dict, it can use a list. And the implementation may be not more complex than this PR, maybe even simpler.

@serhiy-storchaka
Copy link
Member

I have implemented caching. It is not simpler than the current PR, so I'll open a separate PR after merging this PR.

eendebakpt and others added 2 commits May 3, 2024 17:18
Rewrite _creat_newline_indent using PyUnicode_AppendAndDel

Co-authored-by: Serhiy Storchaka <[email protected]>
@serhiy-storchaka serhiy-storchaka merged commit 05adfbb into python:main May 6, 2024
36 checks passed
@serhiy-storchaka
Copy link
Member

Thank you for your contribution @eendebakpt. The main advantage of this PR is that it is simple, so while serialization with indentations is not used in performance critical code (because indentation increases the volume of the output, and therefore makes it slower), it is nice to have such speed up almost for free.

@eendebakpt
Copy link
Contributor Author

Nice to have this in just before the beta. Thanks to all the reviewers for suggestions and refleak hunting skills!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants