Closed
Description
After asking on the mailing list, I learned that I was wrong assuming that prce_match_16 would take the search buffer size in byte units. I had read https://www.pcre.org/current/doc/html/pcre2api.html#SEC15 before, and still misunderstood it.
What also misled me was that it was explained that PCRE2_SIZE
is same as size_t
, which is clearly a byte size (otherwise, it could overflow for units > 1 byte).
May I ask that this is explained more clearly, with these points:
- When using the 16 or 32 bit functions, then the buffer length is half or a fourth of the buffer's length in bytes.
- The 16 or 32 bit unichars are expected at their word-alignment. That means that (a) UTF-16BE chars won't be found in a buffer containing UTF-16BE and (b) if strings are searched in binary data (with the option
PCRE2_MATCH_INVALID_UTF
), they also won't be found if they're not aligned (in this case, a work-around might be to search multiple times with the start offset moved by 1 or more bytes).
Also, not sure if that's explained: The search code is not unicode-composition aware, i.e. when searching for unicode chars with accents (or umlauts), one should perform multiple searches with all possible composition forms.
Metadata
Metadata
Assignees
Labels
No labels