Skip to content

Make :utf8 an actually validating layer #23342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: blead
Choose a base branch
from
Draft

Make :utf8 an actually validating layer #23342

wants to merge 3 commits into from

Conversation

Leont
Copy link
Contributor

@Leont Leont commented May 30, 2025

This replaces :utf8 with an actually secure layer based on :utf8_strict

This is a draft for two reasons:

  1. This does not work well with the :stdio layer. This is because of limitations in stdio that may not be fixable.
  2. It has a known bug when combined with :crlf. I think I know how to fix it, I just haven't gotten around to it.

  • This set of changes requires a perldelta entry, and it is not yet included.

Leont added 3 commits May 29, 2025 23:09
In fact, it hasn't done anything for like a decade
It will check the input for validity, by default strict validity though less strict forms are provided. This also means PerlIO::get_layers doesn't return a "utf8" pseudo-layer anymore, which can break some code making that assumption.
@jkeenan
Copy link
Contributor

jkeenan commented May 30, 2025

Can you suggest some different invocations of sh ./Configure that would either expose what this p.r. does or show the remaining problems (e.g., how would I build a perl for the :crlf problem)?


typedef struct {
PerlIOBuf buf;
STDCHAR leftovers[UTF8_MAX_BYTES];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you seek with something in leftovers?

};

#define UTF8_MAX_BYTES 4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better bitwise math algebra formula for doing this vs the current solution. At first glance, I thought that table was a duplicate of https://github.com/Perl/perl5/blob/blead/utf8.h#L222

EXTCONST unsigned char PL_utf8skip[] = {
/* 0x00 */ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, /* ascii */

but on the 2nd examination, the table in this commit isn't identical to PL_utf8skip[]. Since the max value in this table is 0x3 or 0x01 | 0x02, or 0x4 depending how you look at it, in other words, a bit vector with 2 or 3 bits per array element will hold the info, This table is leaving alot of dead bits in this char array. My gut instinct says such an ocean of unused bits is "a way" but the wrong way to do it. Atleast pack the lengths into quads. Ideally these bits

    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 0x80-0x8F */
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 0x90-0x9F */
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 0xA0-0xAF */
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 0xB0-0xBF */
    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, /* 0xC0-0xCF */
    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, /* 0xD0-0xDF */
    3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, /* 0xE0-0xEF */
    4,4,4,4,4,0,0,0,0,0,0,0,0,0,0,0, /* 0xF0-0xFF */

can be reduced to 256 bits of const literal operands. More detailed, it can be reduced to 4 or 6 x 64bit integer constants, and 4 or 6 branches, something like if(ch <= 32) len = ( 0xU64_BITVEC_MASK >> ch) & 0x3;. I'll guess doing some SO/Google research will reveal fast and better and more clever algos invented by various ppl vs this 256 byte table.

Changing this table to something better is a very low priority, and should be done as the last thing before this draft is committed to blead as prod code. As a draft/WIP/concept, perf and speed arent the highest priority until consensus and future revisions make this PR commitable to blead.

*d++ = ' ';
}
*d = 0;
Perl_croak(aTHX_ fmt, seq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Perl_croak_nocontext() and delete the pTHX_ on the proto. This is an arctic cold panic fn. Its no_return branch should be as small as possible in mach code in its caller frames, since this no_return branch will never execute in reality (prod code env).

| ((U32)cur[1] << 16)
| ((U32)cur[2] << 8)
| ((U32)cur[3]);
/* 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use htonl() which GCC optimizes to 1 cpu on Linux since 2014 or 2008. Or use dedicated CPU intrinsics or use prewritten perl.h/handy.h tools to do this. I have a another PR open about fixing a soap opera involving htonl() vs the day 1 1993 fundamental design of the Win32/64 platform inside WinPerl.

perl.h/handy.h/Configure has almost non-existent support for ISO C compliant unaligned memory access. But GCC project devs evangelize that combining static inline new_fn() {} + memcpy(&u32, unalgn_u32, 4) is holy water and a universal C grammar synonym for safe hardware unalign memory read/writes on all possible CPU archs GCC supports.

Note, GCC will always turn memcpy(&u32, unalgn_u32, 4) into an inline intrinsic that is 0.25 CPU instructions to 1 CPU instructions long, unlike WinPerl built with MSVC. 0.25 CPU instructions means changing an operand to an existing CPU op, think of Intels lea or Intel's SIB byte or Intel's SSE's movaps vs movups.

goto check;
skip = skip_sequence(cur, end - cur);
if (eof || cur + skip < end)
goto illformed;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can if(eof) goto illformed; test be moved upwards, to happen before skip = skip_sequence(cur, end - cur); statement?

if (v < 0x10000)
v = (v & 0x3F) | (v & 0x1F00) >> 2;
else
v = (v & 0x3F) | (v & 0x1F00) >> 2 | (v & 0x0F0000) >> 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't compile this and look at the asm code, but I suspect low IQ MSVC will have 2 branches, both with identical CPUs op doing expression (v & 0x3F) | (v & 0x1F00) >> 2 so factor that out of the if else branch.

return 0;

begin = SvPV(param, len);
delim = strchr(begin, ',');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strlen() class fn calls plz, use memchr(). memchr() can use U16/U32/U64/U128/U256 mem read/test/cmp/jmp instructions, strchr() can't advance except 1 byte at a time.

I will pretend that Apple's innovative and controversial valgrind report violating memcmp()/strlen()/etc inside OSX have never been invented or released as gold.

They first compute the next 4096 page boundary, then perform all memory reads, using U32 *s ops inside strlen(), which will cause reading of upto 3 bytes ahead of legally "uninitialized" and legally "undefined" memory, according to all valgrind style dev tools. These std libc functions with "undefined behavior", physically can't ever throw a SEGV doing these over-reads, since 4096 byte units are the smallest granularity available on all Apple and non Apple CPU archs. PS other than SH4/Sega Dreamcast's 1024 byte page size. No other CPUs in my human lifetime with a page smaller than 4096 have ever been made. Its always 4096 or higher, except for obsolete SH4.

SvPV() probably needs the _const variant here.

if (PerlIO_flush(f) != 0)
return -1;
if (PerlIOBase(f)->flags & PERLIO_F_TTY)
PerlIOBase_flush_linebuf(aTHX);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y a 2nd flush() variant? didnt 1st one do it?

if (u->leftover_length) {
Copy(u->leftovers, b->buf, u->leftover_length, STDCHAR);
b->end = b->buf + u->leftover_length;
read_bytes = u->leftover_length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deref u->leftover_length once and cache it to a C auto. struct member u->leftover_length will prob get 2 mem reads even with -O2 here. Since the write to addr b->end, could happen to be the same addr as u->leftover_length in C abstract machine.

C<:encoding> is a generic conversion layer, that converts a file from a variety
of encodings to perl's internal encoding utf8 and vice versa. C<:utf8> is a
validation layer that checks if input data is correct UTF-8 but doesn't change
the bytestream in any way.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a sentence or phrase, spelling out, the fact that :utf8 is faster/more efficient vs :encoding needs to be added, if that is true. The speed aspect and its pros/cons are not-obvious to every possible IT guru or IT amateur person reading this POD. It took too many seconds (5-10) for me to recognize one is faster than the other reading it for the first time.

the loose matching of encoding names. Also note that currently C<:utf8> is unsafe for
input, because it accepts the data without validating that it is indeed valid
UTF-8; you should instead use C<:encoding(UTF-8)> (with or without a
hyphen).
=======
the loose matching of encoding names.
>>>>>>> f7eed95508 (Made :utf8 an actual layer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git conflict resolve failure here

Comment on lines +401 to +408
<<<<<<< HEAD
the loose matching of encoding names. Also note that currently C<:utf8> is unsafe for
input, because it accepts the data without validating that it is indeed valid
UTF-8; you should instead use C<:encoding(UTF-8)> (with or without a
hyphen).
=======
the loose matching of encoding names.
>>>>>>> f7eed95508 (Made :utf8 an actual layer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some merge conflict detritus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah rebasing a 12 years old branch was a hell, I thought I had fixed everything but apparently not.

@guest20
Copy link

guest20 commented Jun 5, 2025

How does this change fit with the "instead of utf8 always use utf-8" rule that I've been using forever?

@Grinnz
Copy link
Contributor

Grinnz commented Jun 5, 2025

This should not affect use of any layers other than :utf8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants