You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do an ASCII check on the entire buffer for TextIOWrapper and safe this as a variable. Such as self->buffer_is_ascii. Use this informed knowledge to create strings more quickly.
Pitch
PyUnicode_Decode* functions perform a check what the maximum character is for the data. For instance PyUnicode_DecodeLatin1 still scans and if the string is actually ASCII an ASCII string is made. A similar process happens when using TextIOWrapper to decode a text file.
However in the ASCII case, all characters are ASCII. In the UTF8 case, possibly all characters are ASCII. In that case a PyUnicode_New call to initialize an ASCII string and a simple memcpy of the data is much faster than the alternative. This is utilized in the dnaio parser for FASTQ files.
After this step is performed a lot of the translation and decoding can in fact be skipped if the data turns out to be ASCII. Since UTF-8 files are quite common, this can turn out to be a real-world performance benefit.
The text was updated successfully, but these errors were encountered:
Feature or enhancement
Do an ASCII check on the entire buffer for TextIOWrapper and safe this as a variable. Such as self->buffer_is_ascii. Use this informed knowledge to create strings more quickly.
Pitch
PyUnicode_Decode* functions perform a check what the maximum character is for the data. For instance PyUnicode_DecodeLatin1 still scans and if the string is actually ASCII an ASCII string is made. A similar process happens when using TextIOWrapper to decode a text file.
However in the ASCII case, all characters are ASCII. In the UTF8 case, possibly all characters are ASCII. In that case a PyUnicode_New call to initialize an ASCII string and a simple memcpy of the data is much faster than the alternative. This is utilized in the dnaio parser for FASTQ files.
The following code runs at 20GB/s https://github.com/rhpvorderman/ascii-check/blob/main/ascii_check.h#L41 and is therefore almost cost-free when running on io.DEFAULT_BUFFER_SIZE chunks (8kb IIRC). Also a SSE2 implementation is provided in the same repository.
After this step is performed a lot of the translation and decoding can in fact be skipped if the data turns out to be ASCII. Since UTF-8 files are quite common, this can turn out to be a real-world performance benefit.
The text was updated successfully, but these errors were encountered: