The code unit rules for _16 and _32 functions need to be more clearly explained

After asking on the mailing list, I learned that I was wrong assuming that prce_match_16 would take the search buffer size in byte units. I had read https://www.pcre.org/current/doc/html/pcre2api.html#SEC15 before, and still misunderstood it.

What also misled me was that it was explained that `PCRE2_SIZE` is same as `size_t`, which is clearly a byte size (otherwise, it could overflow for units > 1 byte).

May I ask that this is explained more clearly, with these points:

1. When using the 16 or 32 bit functions, then the buffer length is half or a fourth of the buffer's length in bytes.
2. The 16 or 32 bit unichars are expected at their word-alignment. That means that (a) UTF-16BE chars won't be found in a buffer containing UTF-16BE and (b) if strings are searched in binary data (with the option `PCRE2_MATCH_INVALID_UTF`), they also won't be found if they're not aligned (in this case, a work-around might be to search multiple times with the start offset moved by 1 or more bytes).

Also, not sure if that's explained: The search code is not unicode-composition aware, i.e. when searching for unicode chars with accents (or umlauts), one should perform multiple searches with all possible composition forms.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The code unit rules for _16 and _32 functions need to be more clearly explained #187

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The code unit rules for _16 and _32 functions need to be more clearly explained #187

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions