Faster ASCII and possibly UTF-8 decoding of text files with TextIOWrapper

# Feature or enhancement

Do an ASCII check on the entire buffer for TextIOWrapper and safe this as a variable. Such as self->buffer_is_ascii. Use this informed knowledge to create strings more quickly.

# Pitch
PyUnicode_Decode* functions perform a check what the maximum character is for the data. For instance PyUnicode_DecodeLatin1 still scans and if the string is actually ASCII an ASCII string is made. A similar process happens when using TextIOWrapper to decode a text file.

However in the ASCII case, all characters are ASCII. In the UTF8 case, possibly all characters are ASCII. In that case a PyUnicode_New call to initialize an ASCII string and a simple memcpy of the data is much faster than the alternative. This is utilized in the [dnaio](https://github.com/marcelm/dnaio) parser for FASTQ files. 

The following code runs at 20GB/s https://github.com/rhpvorderman/ascii-check/blob/main/ascii_check.h#L41 and is therefore almost cost-free when running on io.DEFAULT_BUFFER_SIZE chunks (8kb IIRC). Also a SSE2 implementation is provided in the same repository.

After this step is performed a lot of the translation and decoding can in fact be skipped if the data turns out to be ASCII. Since UTF-8 files are quite common, this can turn out to be a real-world performance benefit.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Faster ASCII and possibly UTF-8 decoding of text files with TextIOWrapper #101289

Feature or enhancement

Pitch

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Faster ASCII and possibly UTF-8 decoding of text files with TextIOWrapper #101289

Description

Feature or enhancement

Pitch

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions