Skip to content

BUG: to_json segfaults when exception occurs in UTF8 encoding of string #50324

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 18, 2022

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 requested a review from WillAyd December 17, 2022 19:17
@lithomas1 lithomas1 added the IO JSON read_json, to_json, json_normalize label Dec 17, 2022
@lithomas1 lithomas1 marked this pull request as ready for review December 17, 2022 19:20
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice

@gfyoung gfyoung added Bug Segfault Non-Recoverable Error labels Dec 18, 2022
@gfyoung
Copy link
Member

gfyoung commented Dec 18, 2022

LGTM pending CI

@WillAyd WillAyd merged commit cac2e87 into pandas-dev:main Dec 18, 2022
@WillAyd
Copy link
Member

WillAyd commented Dec 18, 2022

Awesome thanks @lithomas1 would take an audit of any of these to support #49756

@phofl
Copy link
Member

phofl commented Dec 21, 2022

it looks like this causes the 32bit built to time out on main. Any ideas why this is happening?

@lithomas1 lithomas1 deleted the fix-json-segfault branch December 21, 2022 18:48
@lithomas1
Copy link
Member Author

Not sure, do you know the exact test that is timing out?

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 21, 2022

Pretty sure it's pandas/tests/io/json/test_ujson.py

CI passes if I skip it #50381

@lithomas1
Copy link
Member Author

I can confirm that it hangs on the test added in this PR. I'm going to debug more, but feel free to revert this PR in the meantime.

@WillAyd
Copy link
Member

WillAyd commented Dec 21, 2022

Are there any warnings being generated in the logs? I would also maybe check what happens if we get rid of some of the casts - generally we do this too much in the C code

@lithomas1
Copy link
Member Author

I've pulled the 32 bit docker container onto my machine, so hopefully I'll be able to debug with gdb soon.

@WillAyd
Copy link
Member

WillAyd commented Dec 21, 2022

My other guess is that enc->errorMsg = "Encoding failed."; is problematic. I think the rhs of this is a char[] with auto storage, and it is possible that the memory for it has been reused after the function exists and by the time ultrajson actually evaluates it. This could cause undefined behavior

You could try heap allocating that string and see if it allows tests to pass (though it will cause a memory leak)

@WillAyd
Copy link
Member

WillAyd commented Dec 21, 2022

Looks like I was wrong about the duration of the string literal

https://stackoverflow.com/questions/9970295/life-time-of-a-string-literal-in-c

@lithomas1
Copy link
Member Author

Looks like its getting stuck in Buffer_ReAlloc

#0  Buffer_Realloc (enc=0xffa2ac0c, cbNeeded=3079809510)
    at pandas/_libs/src/ujson/lib/ultrajsonenc.c:373
#1  0xf3ede5f8 in encode (obj=0xf1ace110, enc=0xffa2ac0c, name=0x0, cbName=0)
    at pandas/_libs/src/ujson/lib/ultrajsonenc.c:1083
#2  0xf3ede240 in encode (obj=0xf1a55828, enc=0xffa2ac0c, name=0x0, cbName=0)
    at pandas/_libs/src/ujson/lib/ultrajsonenc.c:993
#3  0xf3edea7d in JSON_EncodeObject (obj=0xf1a55828, enc=0xffa2ac0c, 
    _buffer=0xffa2acc4 "[", _cbBuffer=65536)

Maybe I need to set the length of the string to zero.

@lithomas1
Copy link
Member Author

lithomas1 commented Dec 21, 2022

Looks like that did it. PR incoming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize Segfault Non-Recoverable Error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: strange string causes segmentation fault on df.to_json
6 participants