Skip to content

BUG: strange string causes segmentation fault on df.to_json #50307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
naterush opened this issue Dec 17, 2022 · 3 comments · Fixed by #50324
Closed
3 tasks done

BUG: strange string causes segmentation fault on df.to_json #50307

naterush opened this issue Dec 17, 2022 · 3 comments · Fixed by #50324
Assignees
Labels
Bug IO JSON read_json, to_json, json_normalize Segfault Non-Recoverable Error

Comments

@naterush
Copy link

naterush commented Dec 17, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
string = chr(56000)
print(repr(string)) # This is a valid Python string, but can't be printed normally
df = pd.DataFrame({'A': [string]})
df.to_json()

Issue Description

When I run this from the command line, it causes a segmentation fault:

zsh: segmentation fault  python main.py

Expected Behavior

It should not segmentation fault. Perhaps it throws some error saying it can't to_json it, but it probably shouldn't crash the whole Python runtime (e.g. you can't even recover with a try catch).

Installed Versions

INSTALLED VERSIONS

commit : 8dab54d
python : 3.9.9.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Wed Aug 10 14:28:23 PDT 2022; root:xnu-8020.141.5~2/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.2
numpy : 1.23.5
pytz : 2022.6
dateutil : 2.8.2
setuptools : 56.0.0
pip : 21.2.4
Cython : None
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.0.2
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.7.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@naterush naterush added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 17, 2022
@lithomas1 lithomas1 added IO JSON read_json, to_json, json_normalize Segfault Non-Recoverable Error and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 17, 2022
@lithomas1
Copy link
Member

lithomas1 commented Dec 17, 2022

Hi @naterush,
Thanks for the report. It looks like we're not handling errors correctly within the JSON C code.
For reference, we are calling PyUnicode_AsUTF8AndSize here

static char *PyUnicodeToUTF8(JSOBJ _obj, JSONTypeContext *Py_UNUSED(tc),
size_t *_outLen) {
return (char *)PyUnicode_AsUTF8AndSize(_obj, (Py_ssize_t *)_outLen);
}

and it is throwing
UnicodeEncodeError: 'utf-8' codec can't encode character '\udac0' in position 0: surrogates not allowed.
(The error is suppressed because of the segfault)

I'll try to submit a PR for this soon.

@lithomas1 lithomas1 self-assigned this Dec 17, 2022
@lithomas1
Copy link
Member

lithomas1 commented Dec 17, 2022

Looking into this further, it seems like the UnicodeEncodeError is expected.

In your example,
string.encode("utf-8") will work, but

df.squeeze().encode("utf-8") will throw the UnicodeEncodeError.

@naterush
Copy link
Author

@lithomas1 thanks for the quick turnaround, this was epic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize Segfault Non-Recoverable Error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants