-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
gh-95382: Improve performance of json encoder with indent #118105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
Co-authored-by: Nice Zombies <[email protected]>
Modules/_json.c
Outdated
PyObject* start = PyUnicode_FromString("\n"); | ||
PyObject* newline_indent = NULL; | ||
if (start == 0) { | ||
goto end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mul_name
will be used uninitialized if this branch is taken.
Also, == 0
looks odd for pointer comparisons; compare against NULL instead.
Modules/_json.c
Outdated
if (start == 0) { | ||
goto end; | ||
} | ||
// there is no public PyUnicode_Repeat? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, great suggestion!
Modules/_json.c
Outdated
goto bail; | ||
} | ||
|
||
separator = PyUnicode_Concat(separator, newline_indent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this leaks a reference to the previous contents of separator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous contents of separator was a borrowed reference, so it should be fine. But correct me if I am wrong!
Don't forget to resolve the conversations. |
Could you also configure pre-commit? https://devguide.python.org/getting-started/setup-building/#install-pre-commit |
It blocks you from committing trailing whitespace: diff --git a/Modules/_json.c b/Modules/_json.c
index 7bc23d1..828c35c 100644
--- a/Modules/_json.c
+++ b/Modules/_json.c
@@ -1360,7 +1360,7 @@ _steal_accumulate(_PyUnicodeWriter *writer, PyObject *stolen)
PyObject* _create_newline_indent(PyObject* indent, Py_ssize_t indent_level)
{
PyObject* newline_indent = NULL;
-
+
PyObject* _current_indent = PySequence_Repeat(indent, indent_level);
if (_current_indent == NULL) {
goto end; |
Typically I leave that to the reviewers (some reviewers prefer it that way) |
Also note that regular reviewers (like me) can't resolve conversations. So, you (or someone else) willl have to resolve that conversation. |
Ah, I did not know that. Well, we can agree among the two of us it is ok to resolve each others comments when addressed. |
Modules/_json.c
Outdated
|
||
PyObject* _current_indent = PySequence_Repeat(indent, indent_level); | ||
if (_current_indent == NULL) { | ||
goto end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goto will use start
before it is defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialise it with NULL
like newline_indent
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just return NULL
.
Modules/_json.c
Outdated
goto bail; | ||
} | ||
|
||
separator = PyUnicode_Concat(separator, newline_indent); // non-borrowed reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that "non-borrowed reference" means here but it looks like we're leaking a reference to the previous contents of separator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is okay in normal case. separator
is a borrowed reference when s->indent == Py_None
and a non-borrowed reference when s->indent != Py_None
. It is balanced out in normal flow. But if an error happens and we go to the bail
label, separator
can be leaked. It should be decrefed if needed.
Modules/_json.c
Outdated
|
||
PyObject* _current_indent = PySequence_Repeat(indent, indent_level); | ||
if (_current_indent == NULL) { | ||
goto end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just return NULL
.
Modules/_json.c
Outdated
goto bail; | ||
} | ||
|
||
separator = PyUnicode_Concat(separator, newline_indent); // non-borrowed reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is okay in normal case. separator
is a borrowed reference when s->indent == Py_None
and a non-borrowed reference when s->indent != Py_None
. It is balanced out in normal flow. But if an error happens and we go to the bail
label, separator
can be leaked. It should be decrefed if needed.
Modules/_json.c
Outdated
indent_level -= 1; | ||
indent_level--; | ||
Py_DECREF(separator); | ||
newline_indent = _create_newline_indent(s->indent, indent_level); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline_indent
is calculated twice per every list or dict. First for nested items, second for the closing ]
or }
. The latter is the same as newline_indent
for items of the outer list or dict.
In the following example the indent for the nested ]
is the same as the indent for 1
, 2
and the nested [
:
[
1,
2,
[
3,
4
]
]
I think that we can just pass it as an argument of encoder_listencode_list()
and encoder_listencode_dict()
.
The inner value can then be calculated as a simple concatenation, without multiplication:
item_newline_indent = PyUnicode_Concat(newline_indent, s->ident);
Sorry, I accidentally added these as single comments. I thought I was still editing my review. |
Co-authored-by: Nice Zombies <[email protected]>
Modules/_json.c
Outdated
return NULL; | ||
} | ||
} | ||
if (encoder_listencode_obj(self, &writer, obj, indent_level, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to pass indent_level
if we pass current_newline_indent
? I suppose that it is no longer used.
Modules/_json.c
Outdated
PyObject *current_newline_indent, | ||
PyObject *current_item_separator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_newline_indent
and current_item_separator
are pretty long names. What if remove the current_
prefix from them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we need new_newline_indent
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the prefix works for this method, but in the other methods that take the same argument there already was a newline_indent
. I took nineteendo's suggestion to rename those to new_newline_indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please add a NEWS entry, and it will be ready for merge.
Your initial idea about caching looks interesting. It does not even need a dict, it can use a list. And the implementation may be not more complex than this PR, maybe even simpler.
I have implemented caching. It is not simpler than the current PR, so I'll open a separate PR after merging this PR. |
Rewrite _creat_newline_indent using PyUnicode_AppendAndDel Co-authored-by: Serhiy Storchaka <[email protected]>
Misc/NEWS.d/next/Core and Builtins/2024-05-03-18-01-26.gh-issue-95382.73FSEv.rst
Outdated
Show resolved
Hide resolved
…e-95382.73FSEv.rst Co-authored-by: Jelle Zijlstra <[email protected]>
Thank you for your contribution @eendebakpt. The main advantage of this PR is that it is simple, so while serialization with indentations is not used in performance critical code (because indentation increases the volume of the output, and therefore makes it slower), it is nice to have such speed up almost for free. |
Nice to have this in just before the beta. Thanks to all the reviewers for suggestions and refleak hunting skills! |
In this PR we implement the
indent
argument for the C implemention of the JSON encoder.Benchmark:
Updated benchmark after using
PySequence_Repeat
:Benchmark script
Notes
_create_newline_indent
is not very efficient. We can improve that either in a followup PR or in this PR by adding a cache (probably with python dict). Even without this improvement the C implementation is now much faster than the Python implementation.