Skip to content

Faster ASCII and possibly UTF-8 decoding of text files with TextIOWrapper #101289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rhpvorderman opened this issue Jan 24, 2023 · 1 comment
Closed
Labels
performance Performance or resource usage topic-IO type-feature A feature request or enhancement

Comments

@rhpvorderman
Copy link
Contributor

Feature or enhancement

Do an ASCII check on the entire buffer for TextIOWrapper and safe this as a variable. Such as self->buffer_is_ascii. Use this informed knowledge to create strings more quickly.

Pitch

PyUnicode_Decode* functions perform a check what the maximum character is for the data. For instance PyUnicode_DecodeLatin1 still scans and if the string is actually ASCII an ASCII string is made. A similar process happens when using TextIOWrapper to decode a text file.

However in the ASCII case, all characters are ASCII. In the UTF8 case, possibly all characters are ASCII. In that case a PyUnicode_New call to initialize an ASCII string and a simple memcpy of the data is much faster than the alternative. This is utilized in the dnaio parser for FASTQ files.

The following code runs at 20GB/s https://github.com/rhpvorderman/ascii-check/blob/main/ascii_check.h#L41 and is therefore almost cost-free when running on io.DEFAULT_BUFFER_SIZE chunks (8kb IIRC). Also a SSE2 implementation is provided in the same repository.

After this step is performed a lot of the translation and decoding can in fact be skipped if the data turns out to be ASCII. Since UTF-8 files are quite common, this can turn out to be a real-world performance benefit.

@rhpvorderman
Copy link
Contributor Author

I made a PR #120212 but the added code complexity was not deemed worth the minor speed increase.

@terryjreedy terryjreedy closed this as not planned Won't fix, can't repro, duplicate, stale Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-IO type-feature A feature request or enhancement
Projects
Development

No branches or pull requests

4 participants