Make :utf8 an actually validating layer #23342

Leont · 2025-05-30T23:38:12Z

This replaces :utf8 with an actually secure layer based on :utf8_strict

This is a draft for two reasons:

This does not work well with the :stdio layer. This is because of limitations in stdio that may not be fixable.
It has a known bug when combined with :crlf. I think I know how to fix it, I just haven't gotten around to it.

This set of changes requires a perldelta entry, and it is not yet included.

In fact, it hasn't done anything for like a decade

It will check the input for validity, by default strict validity though less strict forms are provided. This also means PerlIO::get_layers doesn't return a "utf8" pseudo-layer anymore, which can break some code making that assumption.

jkeenan · 2025-05-30T23:41:54Z

Can you suggest some different invocations of sh ./Configure that would either expose what this p.r. does or show the remaining problems (e.g., how would I build a perl for the :crlf problem)?

tonycoz · 2025-06-04T23:31:05Z

perlio.c

+
+typedef struct {
+    PerlIOBuf buf;
+    STDCHAR leftovers[UTF8_MAX_BYTES];


What happens if you seek with something in leftovers?

bulk88 · 2025-06-05T00:44:44Z

perlio.c

+};
+
+#define UTF8_MAX_BYTES 4
+


Is there a better bitwise math algebra formula for doing this vs the current solution. At first glance, I thought that table was a duplicate of https://github.com/Perl/perl5/blob/blead/utf8.h#L222

EXTCONST unsigned char PL_utf8skip[] = { /* 0x00 */ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, /* ascii */

but on the 2nd examination, the table in this commit isn't identical to PL_utf8skip[]. Since the max value in this table is 0x3 or 0x01 | 0x02, or 0x4 depending how you look at it, in other words, a bit vector with 2 or 3 bits per array element will hold the info, This table is leaving alot of dead bits in this char array. My gut instinct says such an ocean of unused bits is "a way" but the wrong way to do it. Atleast pack the lengths into quads. Ideally these bits

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 0x80-0x8F */ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 0x90-0x9F */ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 0xA0-0xAF */ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 0xB0-0xBF */ 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, /* 0xC0-0xCF */ 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, /* 0xD0-0xDF */ 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, /* 0xE0-0xEF */ 4,4,4,4,4,0,0,0,0,0,0,0,0,0,0,0, /* 0xF0-0xFF */

can be reduced to 256 bits of const literal operands. More detailed, it can be reduced to 4 or 6 x 64bit integer constants, and 4 or 6 branches, something like if(ch <= 32) len = ( 0xU64_BITVEC_MASK >> ch) & 0x3;. I'll guess doing some SO/Google research will reveal fast and better and more clever algos invented by various ppl vs this 256 byte table.

Changing this table to something better is a very low priority, and should be done as the last thing before this draft is committed to blead as prod code. As a draft/WIP/concept, perf and speed arent the highest priority until consensus and future revisions make this PR commitable to blead.

bulk88 · 2025-06-05T00:49:12Z

perlio.c

+	    *d++ = ' ';
+    }
+    *d = 0;
+    Perl_croak(aTHX_ fmt, seq);


use Perl_croak_nocontext() and delete the pTHX_ on the proto. This is an arctic cold panic fn. Its no_return branch should be as small as possible in mach code in its caller frames, since this no_return branch will never execute in reality (prod code env).

bulk88 · 2025-06-05T01:05:31Z

perlio.c

+		  | ((U32)cur[1] << 16)
+		  | ((U32)cur[2] <<  8)
+		  | ((U32)cur[3]);
+		/* 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx */


use htonl() which GCC optimizes to 1 cpu on Linux since 2014 or 2008. Or use dedicated CPU intrinsics or use prewritten perl.h/handy.h tools to do this. I have a another PR open about fixing a soap opera involving htonl() vs the day 1 1993 fundamental design of the Win32/64 platform inside WinPerl.

perl.h/handy.h/Configure has almost non-existent support for ISO C compliant unaligned memory access. But GCC project devs evangelize that combining static inline new_fn() {} + memcpy(&u32, unalgn_u32, 4) is holy water and a universal C grammar synonym for safe hardware unalign memory read/writes on all possible CPU archs GCC supports.

Note, GCC will always turn memcpy(&u32, unalgn_u32, 4) into an inline intrinsic that is 0.25 CPU instructions to 1 CPU instructions long, unlike WinPerl built with MSVC. 0.25 CPU instructions means changing an operand to an existing CPU op, think of Intels lea or Intel's SIB byte or Intel's SSE's movaps vs movups.

bulk88 · 2025-06-05T01:07:23Z

perlio.c

+	    goto check;
+	skip = skip_sequence(cur, end - cur);
+	if (eof || cur + skip < end)
+	    goto illformed;


can if(eof) goto illformed; test be moved upwards, to happen before skip = skip_sequence(cur, end - cur); statement?

bulk88 · 2025-06-05T01:10:42Z

perlio.c

+    if (v < 0x10000)
+	v = (v & 0x3F) | (v & 0x1F00) >> 2;
+    else
+	v = (v & 0x3F) | (v & 0x1F00) >> 2 | (v & 0x0F0000) >> 4;


I didn't compile this and look at the asm code, but I suspect low IQ MSVC will have 2 branches, both with identical CPUs op doing expression (v & 0x3F) | (v & 0x1F00) >> 2 so factor that out of the if else branch.

bulk88 · 2025-06-05T01:43:24Z

perlio.c

+	return 0;
+
+    begin = SvPV(param, len);
+    delim = strchr(begin, ',');


no strlen() class fn calls plz, use memchr(). memchr() can use U16/U32/U64/U128/U256 mem read/test/cmp/jmp instructions, strchr() can't advance except 1 byte at a time.

I will pretend that Apple's innovative and controversial valgrind report violating memcmp()/strlen()/etc inside OSX have never been invented or released as gold.

They first compute the next 4096 page boundary, then perform all memory reads, using U32 *s ops inside strlen(), which will cause reading of upto 3 bytes ahead of legally "uninitialized" and legally "undefined" memory, according to all valgrind style dev tools. These std libc functions with "undefined behavior", physically can't ever throw a SEGV doing these over-reads, since 4096 byte units are the smallest granularity available on all Apple and non Apple CPU archs. PS other than SH4/Sega Dreamcast's 1024 byte page size. No other CPUs in my human lifetime with a page smaller than 4096 have ever been made. Its always 4096 or higher, except for obsolete SH4.

SvPV() probably needs the _const variant here.

bulk88 · 2025-06-05T01:46:13Z

perlio.c

+    if (PerlIO_flush(f) != 0)
+	return -1;
+    if (PerlIOBase(f)->flags & PERLIO_F_TTY)
+	PerlIOBase_flush_linebuf(aTHX);


y a 2nd flush() variant? didnt 1st one do it?

bulk88 · 2025-06-05T01:49:38Z

perlio.c

+    if (u->leftover_length) {
+	Copy(u->leftovers, b->buf, u->leftover_length, STDCHAR);
+	b->end = b->buf + u->leftover_length;
+	read_bytes = u->leftover_length;


deref u->leftover_length once and cache it to a C auto. struct member u->leftover_length will prob get 2 mem reads even with -O2 here. Since the write to addr b->end, could happen to be the same addr as u->leftover_length in C abstract machine.

bulk88 · 2025-06-05T01:56:33Z

pod/perlunifaq.pod

+C<:encoding> is a generic conversion layer, that converts a file from a variety
+of encodings to perl's internal encoding utf8 and vice versa. C<:utf8> is a
+validation layer that checks if input data is correct UTF-8 but doesn't change
+the bytestream in any way.


I think a sentence or phrase, spelling out, the fact that :utf8 is faster/more efficient vs :encoding needs to be added, if that is true. The speed aspect and its pros/cons are not-obvious to every possible IT guru or IT amateur person reading this POD. It took too many seconds (5-10) for me to recognize one is faster than the other reading it for the first time.

bulk88 · 2025-06-05T01:56:58Z

pod/perluniintro.pod

 the loose matching of encoding names. Also note that currently C<:utf8> is unsafe for
 input, because it accepts the data without validating that it is indeed valid
 UTF-8; you should instead use C<:encoding(UTF-8)> (with or without a
 hyphen).
+=======
+the loose matching of encoding names.
+>>>>>>> f7eed95508 (Made :utf8 an actual layer)


git conflict resolve failure here

tonycoz · 2025-06-05T06:03:32Z

pod/perluniintro.pod

+<<<<<<< HEAD
 the loose matching of encoding names. Also note that currently C<:utf8> is unsafe for
 input, because it accepts the data without validating that it is indeed valid
 UTF-8; you should instead use C<:encoding(UTF-8)> (with or without a
 hyphen).
+=======
+the loose matching of encoding names.
+>>>>>>> f7eed95508 (Made :utf8 an actual layer)


some merge conflict detritus

Yeah rebasing a 12 years old branch was a hell, I thought I had fixed everything but apparently not.

guest20 · 2025-06-05T09:32:03Z

How does this change fit with the "instead of utf8 always use utf-8" rule that I've been using forever?

Grinnz · 2025-06-05T09:39:12Z

This should not affect use of any layers other than :utf8.

Leont added 3 commits May 29, 2025 23:09

Deprecate PERLIO_K_DUMMY as it doesn't do anything

ffc2a04

In fact, it hasn't done anything for like a decade

Make :bytes actually pop off non-byte layers

ed4d129

Made :utf8 an actual layer

841de31

It will check the input for validity, by default strict validity though less strict forms are provided. This also means PerlIO::get_layers doesn't return a "utf8" pseudo-layer anymore, which can break some code making that assumption.

tonycoz reviewed Jun 4, 2025

View reviewed changes

bulk88 suggested changes Jun 5, 2025

View reviewed changes

tonycoz reviewed Jun 5, 2025

View reviewed changes

		};

		#define UTF8_MAX_BYTES 4

Make :utf8 an actually validating layer #23342

Are you sure you want to change the base?

Make :utf8 an actually validating layer #23342

Uh oh!

Conversation

Leont commented May 30, 2025

Uh oh!

jkeenan commented May 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guest20 commented Jun 5, 2025

Uh oh!

Grinnz commented Jun 5, 2025

Uh oh!

Uh oh!