Skip to content

The code unit rules for _16 and _32 functions need to be more clearly explained #187

Closed
@tempelmann

Description

@tempelmann

After asking on the mailing list, I learned that I was wrong assuming that prce_match_16 would take the search buffer size in byte units. I had read https://www.pcre.org/current/doc/html/pcre2api.html#SEC15 before, and still misunderstood it.

What also misled me was that it was explained that PCRE2_SIZE is same as size_t, which is clearly a byte size (otherwise, it could overflow for units > 1 byte).

May I ask that this is explained more clearly, with these points:

  1. When using the 16 or 32 bit functions, then the buffer length is half or a fourth of the buffer's length in bytes.
  2. The 16 or 32 bit unichars are expected at their word-alignment. That means that (a) UTF-16BE chars won't be found in a buffer containing UTF-16BE and (b) if strings are searched in binary data (with the option PCRE2_MATCH_INVALID_UTF), they also won't be found if they're not aligned (in this case, a work-around might be to search multiple times with the start offset moved by 1 or more bytes).

Also, not sure if that's explained: The search code is not unicode-composition aware, i.e. when searching for unicode chars with accents (or umlauts), one should perform multiple searches with all possible composition forms.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions