-
Notifications
You must be signed in to change notification settings - Fork 584
Make :utf8 an actually validating layer #23342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: blead
Are you sure you want to change the base?
Conversation
In fact, it hasn't done anything for like a decade
It will check the input for validity, by default strict validity though less strict forms are provided. This also means PerlIO::get_layers doesn't return a "utf8" pseudo-layer anymore, which can break some code making that assumption.
Can you suggest some different invocations of |
|
||
typedef struct { | ||
PerlIOBuf buf; | ||
STDCHAR leftovers[UTF8_MAX_BYTES]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you seek with something in leftovers?
}; | ||
|
||
#define UTF8_MAX_BYTES 4 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better bitwise math algebra formula for doing this vs the current solution. At first glance, I thought that table was a duplicate of https://github.com/Perl/perl5/blob/blead/utf8.h#L222
EXTCONST unsigned char PL_utf8skip[] = {
/* 0x00 */ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, /* ascii */
but on the 2nd examination, the table in this commit isn't identical to PL_utf8skip[]
. Since the max value in this table is 0x3
or 0x01 | 0x02
, or 0x4
depending how you look at it, in other words, a bit vector with 2 or 3 bits per array element will hold the info, This table is leaving alot of dead bits in this char array. My gut instinct says such an ocean of unused bits is "a way" but the wrong way to do it. Atleast pack the lengths into quads. Ideally these bits
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 0x80-0x8F */
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 0x90-0x9F */
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 0xA0-0xAF */
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, /* 0xB0-0xBF */
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, /* 0xC0-0xCF */
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, /* 0xD0-0xDF */
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, /* 0xE0-0xEF */
4,4,4,4,4,0,0,0,0,0,0,0,0,0,0,0, /* 0xF0-0xFF */
can be reduced to 256 bits of const literal operands. More detailed, it can be reduced to 4 or 6 x 64bit integer constants, and 4 or 6 branches, something like if(ch <= 32) len = ( 0xU64_BITVEC_MASK >> ch) & 0x3;
. I'll guess doing some SO/Google research will reveal fast and better and more clever algos invented by various ppl vs this 256 byte table.
Changing this table to something better is a very low priority, and should be done as the last thing before this draft is committed to blead as prod code. As a draft/WIP/concept, perf and speed arent the highest priority until consensus and future revisions make this PR commitable to blead.
*d++ = ' '; | ||
} | ||
*d = 0; | ||
Perl_croak(aTHX_ fmt, seq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use Perl_croak_nocontext()
and delete the pTHX_
on the proto. This is an arctic cold panic fn. Its no_return branch should be as small as possible in mach code in its caller frames, since this no_return branch will never execute in reality (prod code env).
| ((U32)cur[1] << 16) | ||
| ((U32)cur[2] << 8) | ||
| ((U32)cur[3]); | ||
/* 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use htonl()
which GCC optimizes to 1 cpu on Linux since 2014 or 2008. Or use dedicated CPU intrinsics or use prewritten perl.h
/handy.h
tools to do this. I have a another PR open about fixing a soap opera involving htonl()
vs the day 1 1993 fundamental design of the Win32/64 platform inside WinPerl.
perl.h/handy.h/Configure has almost non-existent support for ISO C compliant unaligned memory access. But GCC project devs evangelize that combining static inline new_fn() {}
+ memcpy(&u32, unalgn_u32, 4)
is holy water and a universal C grammar synonym for safe hardware unalign memory read/writes on all possible CPU archs GCC supports.
Note, GCC will always turn memcpy(&u32, unalgn_u32, 4)
into an inline intrinsic that is 0.25 CPU instructions to 1 CPU instructions long, unlike WinPerl built with MSVC. 0.25 CPU instructions means changing an operand to an existing CPU op, think of Intels lea
or Intel's SIB byte or Intel's SSE's movaps
vs movups
.
goto check; | ||
skip = skip_sequence(cur, end - cur); | ||
if (eof || cur + skip < end) | ||
goto illformed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can if(eof) goto illformed;
test be moved upwards, to happen before skip = skip_sequence(cur, end - cur);
statement?
if (v < 0x10000) | ||
v = (v & 0x3F) | (v & 0x1F00) >> 2; | ||
else | ||
v = (v & 0x3F) | (v & 0x1F00) >> 2 | (v & 0x0F0000) >> 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't compile this and look at the asm code, but I suspect low IQ MSVC will have 2 branches, both with identical CPUs op doing expression (v & 0x3F) | (v & 0x1F00) >> 2
so factor that out of the if else branch.
return 0; | ||
|
||
begin = SvPV(param, len); | ||
delim = strchr(begin, ','); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no strlen()
class fn calls plz, use memchr()
. memchr()
can use U16/U32/U64/U128/U256 mem read/test/cmp/jmp instructions, strchr()
can't advance except 1 byte at a time.
I will pretend that Apple's innovative and controversial valgrind report violating memcmp()
/strlen()
/etc inside OSX have never been invented or released as gold.
They first compute the next 4096 page boundary, then perform all memory reads, using U32 *s ops inside strlen()
, which will cause reading of upto 3 bytes ahead of legally "uninitialized" and legally "undefined" memory, according to all valgrind style dev tools. These std libc functions with "undefined behavior", physically can't ever throw a SEGV doing these over-reads, since 4096 byte units are the smallest granularity available on all Apple and non Apple CPU archs. PS other than SH4/Sega Dreamcast's 1024 byte page size. No other CPUs in my human lifetime with a page smaller than 4096 have ever been made. Its always 4096 or higher, except for obsolete SH4.
SvPV()
probably needs the _const
variant here.
if (PerlIO_flush(f) != 0) | ||
return -1; | ||
if (PerlIOBase(f)->flags & PERLIO_F_TTY) | ||
PerlIOBase_flush_linebuf(aTHX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y a 2nd flush() variant? didnt 1st one do it?
if (u->leftover_length) { | ||
Copy(u->leftovers, b->buf, u->leftover_length, STDCHAR); | ||
b->end = b->buf + u->leftover_length; | ||
read_bytes = u->leftover_length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deref u->leftover_length
once and cache it to a C auto. struct member u->leftover_length
will prob get 2 mem reads even with -O2
here. Since the write to addr b->end
, could happen to be the same addr as u->leftover_length
in C abstract machine.
C<:encoding> is a generic conversion layer, that converts a file from a variety | ||
of encodings to perl's internal encoding utf8 and vice versa. C<:utf8> is a | ||
validation layer that checks if input data is correct UTF-8 but doesn't change | ||
the bytestream in any way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a sentence or phrase, spelling out, the fact that :utf8
is faster/more efficient vs :encoding
needs to be added, if that is true. The speed aspect and its pros/cons are not-obvious to every possible IT guru or IT amateur person reading this POD. It took too many seconds (5-10) for me to recognize one is faster than the other reading it for the first time.
the loose matching of encoding names. Also note that currently C<:utf8> is unsafe for | ||
input, because it accepts the data without validating that it is indeed valid | ||
UTF-8; you should instead use C<:encoding(UTF-8)> (with or without a | ||
hyphen). | ||
======= | ||
the loose matching of encoding names. | ||
>>>>>>> f7eed95508 (Made :utf8 an actual layer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git conflict resolve failure here
<<<<<<< HEAD | ||
the loose matching of encoding names. Also note that currently C<:utf8> is unsafe for | ||
input, because it accepts the data without validating that it is indeed valid | ||
UTF-8; you should instead use C<:encoding(UTF-8)> (with or without a | ||
hyphen). | ||
======= | ||
the loose matching of encoding names. | ||
>>>>>>> f7eed95508 (Made :utf8 an actual layer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some merge conflict detritus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah rebasing a 12 years old branch was a hell, I thought I had fixed everything but apparently not.
How does this change fit with the "instead of utf8 always use utf-8" rule that I've been using forever? |
This should not affect use of any layers other than |
This replaces
:utf8
with an actually secure layer based on:utf8_strict
This is a draft for two reasons: