-
-
Notifications
You must be signed in to change notification settings - Fork 739
Fix Issue 18384 - std.net.isemail is slow to import due to regex #6129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for your pull request, @n8sh! We are looking forward to reviewing it, and you should be hearing from a maintainer soon. Some tips to help speed things up:
Bear in mind that large or tricky changes may require multiple rounds of review and revision. Please see CONTRIBUTING.md for more information. Bugzilla references
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple nits. Looks good except for that.
if (end >= 2 && s[end-2] == '.') | ||
start = end - 2; | ||
else if (end >= 3 && s[end-3] == '.') | ||
start = end - 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is untested, try an IP address with 2 digits in an octet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
std/net/isemail.d
Outdated
import std.ascii : isHexDigit; | ||
if (s.length > 4) return false; | ||
foreach (i; 0 .. s.length) | ||
if (!isHexDigit(s[i])) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach (c; s)
if (!isHexDigit(c)) return false;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although that is nicer syntax, it would auto-decode which is not necessary (all characters I'm searching for are <= 0x7F) and would be a performance hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach (c; s)
doesn't do auto-decoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does if c
is dchar
and s
is a narrow string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake: I thought c
would default to dchar
unless given a specific type. Changed to use foreach
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? https://run.dlang.io/is/UGY8jP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought c would default to dchar unless given a specific type.
This is a common misconception -- because Phobos treats arrays of char
weirdly. The language doesn't, it's still an array to the language.
0ac5260
to
4948cbb
Compare
std/net/isemail.d
Outdated
+/ | ||
private static const(Char)[] matchIPSuffix(Char)(return const Char[] s) | ||
{ | ||
if (s.length < "0.0.0.0".length) return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (s.length < 7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the work of moving forward and getting rid of the slow and bulky RegExp.
However, I would prefer if we don't reinvent parsing and just use the existing and testing std.conv
. I added a few comments and ideas.
else if (end >= 4 && s[end-4] == '.') | ||
start = end - 4; | ||
else | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
start = end - s.byCodeUnit.retro.take(4).until(".").count;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do even one better:
int blocks = s.byCodeUnit
.splitter(".")
.filter!checkValid // check an individual block
.take(4) // avoids problems with infinite strings
.walkLength;
if (blocks < 4) return null;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think that does the right thing. The blocks need to be consecutive and they need to be the tail of the string, and we need the total length of the tail.
uint c = cast(uint) s[i] - '0'; | ||
if (c > 9) return null; | ||
x = x * 10 + c; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint x = s[start + 1 .. end].byCodeUnit.take(2).to!uint.ifThrown(256);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not @nogc
and not nothrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: @nogc
isn't a requirement for Phobos (we prefer nice, readable code instead of mir). In this case it will be fixed once the already merged -dip1008
gets activated by default.
Also the previous matchAll
isn't neither @nogc
nor nothrow
either (and neither is the code which uses it).
Regarding nothrow
, how about scope(failure) return null;
?
That would save even more lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While making things @nogc nothrow
is nice when ever possible, in this specific instance isEmail
is not @nogc nothrow
, so this change will have no effect on the user and there's no benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about converting to ubyte
? then let to
do the work of making sure the octet is in range.
FWIW, I like the idea of tagging internal functions @nogc
and nothrow
, even if the users aren't, because it helps keep GC usage down. Otherwise, little fixes that may add an unnecessary allocation creep in here and there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this specific instance
isEmail
is not@nogc nothrow
It probably should be but I'll leave that for another pull request. It looks like isEmail
is incorrectly using Exceptions rather than assert() to debug logic errors in the function itself (see forum discussion).
--start; | ||
x += 100 * (cast(uint) s[start] - '0'); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s[max(0, end - 3) .. $].byCodeUnit.to!uint.ifThrown(256);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't use this here because unlike above I don't know where the encoded octet starts.
std/net/isemail.d
Outdated
// (TO DETERMINE: is the definition of "word character" ASCII only?) | ||
if (start == 0) return s; | ||
const b = cast(uint) s[start - 1]; | ||
if (b - 'A' < 26 || b - 'a' < 26 || b - '0' < 10 || b == '_') return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use isAlphaNum
from std.ascii
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
std/net/isemail.d
Outdated
so we can return `const(Char)[]` instead of `const(Char)[][]` using a | ||
zero-length string to indicate no match. | ||
+/ | ||
private static const(Char)[] matchIPSuffix(Char)(return const Char[] s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for the use of return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static
is pointless, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add pure nothrow @safe @nogc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
std/net/isemail.d
Outdated
private static const(Char)[] matchIPSuffix(Char)(return const Char[] s) | ||
{ | ||
if (s.length < 7) return null; | ||
size_t end = s.length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you could now move end
one line above and do if (end < 7) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
I know it's tempting here to find the "best way" to parse an ip address. When I first saw the function, I spent about 15 minutes writing an alternative one as a suggestion that looked more at the text pattern rather than creating a uint out of the numbers. After a while, I just deleted it, since it's really not insanely complex either way, and the end result isn't different enough to warrant changes to this PR. I think probably we can pull this as-is, and improve the internal details later. I'd like to get the regex problem solved. |
Can't you just manually include the D code generated and mixin()ed by the ctRegex? Leave the ctRegex version(none)ed out to indicate where the code comes from, and to reenable it once the performance situation changes. |
{ | ||
uint c = cast(uint) s[i] - '0'; | ||
if (c > 9) return null; | ||
x = x * 10 + c; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to check for integer overflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't overflow, we're summing a maximum of 3 digits.
7c42c19
to
f2f1c32
Compare
Yes, this makes sense. @n8sh There's a debug flag in std.regex which will print out the code. Can you see which one is faster? |
@luismarques @JackStouffer I've run a comparison and the new code is dramatically faster. Time to extract either an empty string or an IP address suffix from "addressLiteral" in a list of 32 strings, repeated 10_000 times:
The "addressLiteral" strings used for the benchmark were captured from the current std.net.isemail unittests plus the unittest added by this pull request. I can post the benchmark code if you suggest a convenient place for it (it's a bit long for this comment). |
Note that the way the ctRegex is used in the current code -- which is how I benchmarked it -- includes dynamic memory allocation. It is effectively: auto matchesIp = addressLiteral.matchAll(ipRegex).map!(a => a.hit).array;
string ipSuffix = matchesIp.empty ? null : matchesIp.front; |
@n8sh You can post it here via the following
|
@JackStouffer Thanks. Benchmark source code// Strings taken from current unittests.
immutable string[32] address_literal_dataset =
[
`IPv6:1111:2222:3333:4444:5555:6666::8888`,
`255.255.255.255`,
`255.255.255`,
`255.255.255.255.255`,
`255.255.255.256`,
`1111:2222:3333:4444:5555:6666:7777:8888`,
`IPv6:1111:2222:3333:4444:5555:6666:7777`,
`IPv6:1111:2222:3333:4444:5555:6666:7777:8888`,
`IPv6:1111:2222:3333:4444:5555:6666:7777:8888:9999`,
`IPv6:1111:2222:3333:4444:5555:6666:7777:888G`,
`IPv6:1111:2222:3333:4444:5555:6666::8888`,
`IPv6:1111:2222:3333:4444:5555::8888`,
`IPv6:1111:2222:3333:4444:5555:6666::7777:8888`,
`IPv6::3333:4444:5555:6666:7777:8888`,
`IPv6:::3333:4444:5555:6666:7777:8888`,
`IPv6:1111::4444:5555::8888`,
`IPv6:::`,
`IPv6:1111:2222:3333:4444:5555:255.255.255.255`,
`IPv6:1111:2222:3333:4444:5555:6666:255.255.255.255`,
`IPv6:1111:2222:3333:4444:5555:6666:7777:255.255.255.255`,
`IPv6:1111:2222:3333:4444::255.255.255.255`,
`IPv6:1111:2222:3333:4444:5555:6666::255.255.255.255`,
`IPv6:1111:2222:3333:4444:::255.255.255.255`,
`IPv6::255.255.255.255`,
`255.255.255.255`,
`RFC-5322-domain-literal`,
`RFC-5322`,
`RFC5322domainliteral`,
`RFC-5322-domain-literal`,
`IPv6:1::2:`,
`255.255.255.255`,
`babaev 176.16.0.1`
];
void main(string[] args)
{
import std.stdio : writeln;
import std.datetime.stopwatch : AutoStart, StopWatch;
enum iterations = 10_000;
auto test_data = address_literal_dataset[];
size_t output;
StopWatch sw = StopWatch(AutoStart.no);
// Method 1
sw.reset();
output = 0;
sw.start();
foreach_reverse (_; 0.. iterations)
{
foreach (address_literal; test_data)
output += method1(address_literal).length;
}
sw.stop();
writeln("method1 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");
sw.reset();
output = 0;
sw.start();
foreach_reverse (_; 0.. iterations)
{
foreach (address_literal; test_data)
output += method1(address_literal).length;
}
sw.stop();
writeln("method1 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");
// Method 2
sw.reset();
output = 0;
sw.start();
foreach_reverse (_; 0.. iterations)
{
foreach (address_literal; test_data)
output += method2(address_literal).length;
}
sw.stop();
writeln("method2 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");
sw.reset();
output = 0;
sw.start();
foreach_reverse (_; 0.. iterations)
{
foreach (address_literal; test_data)
output += method2(address_literal).length;
}
sw.stop();
writeln("method2 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");
// Method 1 again
sw.reset();
output = 0;
sw.start();
foreach_reverse (_; 0.. iterations)
{
foreach (address_literal; test_data)
output += method1(address_literal).length;
}
sw.stop();
writeln("method1 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");
sw.reset();
output = 0;
sw.start();
foreach_reverse (_; 0.. iterations)
{
foreach (address_literal; test_data)
output += method1(address_literal).length;
}
sw.stop();
writeln("method1 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");
}
pragma(inline, false)
const(Char)[] method1(Char)(const(Char)[] s)
{
import std.algorithm.iteration : map;
import std.array : array, split;
import std.range.primitives : empty, front;
import std.regex : ctRegex, matchAll;
static ipRegex = ctRegex!(`\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}`~
`(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$`);
auto matchesIp = s.matchAll(ipRegex).map!(a => a.hit).array;
return matchesIp.empty ? null : matchesIp.front;
}
pragma(inline, false)
private const(Char)[] method2(Char)(return const(Char)[] s)
{
size_t end = s.length;
if (end < 7) return null;
// Check the first three `[.]\d{1,3}`
foreach (_; 0 .. 3)
{
size_t start = void;
if (end >= 2 && s[end-2] == '.')
start = end - 2;
else if (end >= 3 && s[end-3] == '.')
start = end - 3;
else if (end >= 4 && s[end-4] == '.')
start = end - 4;
else
return null;
uint x = 0;
foreach (i; start + 1 .. end)
{
uint c = cast(uint) s[i] - '0';
if (c > 9) return null;
x = x * 10 + c;
}
if (x > 255) return null;
end = start;
}
// Check the final `\d{1,3}`.
if (end < 1) return null;
size_t start = end - 1;
uint x = cast(uint) s[start] - '0';
if (x > 9) return null;
if (start > 0 && cast(uint) s[start-1] - '0' <= 9)
{
--start;
x += 10 * (cast(uint) s[start] - '0');
if (start > 0 && cast(uint) s[start-1] - '0' <= 9)
{
--start;
x += 100 * (cast(uint) s[start] - '0');
}
}
if (x > 255) return null;
// Must either be at start of string or preceded by a non-word character.
// (TO DETERMINE: is the definition of "word character" ASCII only?)
if (start == 0) return s;
const b = s[start - 1];
import std.ascii : isAlphaNum;
if (isAlphaNum(b) || b == '_') return null;
return s[start .. $];
} |
Solution is to remove regex from std.net.isemail. May be worth revisiting if std.regex compile times improve.
e5521a9
to
06e4030
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with this, thanks
{ | ||
import std.ascii : isHexDigit; | ||
if (s.length > 4) return false; | ||
foreach (c; s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return s.length <= 4 && s.all!isHexDigit;
But that would need std.algorithm.searching
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this would involve auto-decoding I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.byCodeUnit
- no autodecoding and in @nogc nothrow pure @safe
I think this is in a good enough state, we can have further PRs to include uses of Phobos and nifty range pipelines as improvements. Given the two approvals, I'll pull. |
Solution is to remove regex from std.net.isemail. May be worth revisiting if std.regex compile times improve. On my machine making the change decreased the time measured by @wilzbach's import benchmark script from 0.15 seconds to 0.03 seconds (consistently over repeated trials).
Link to mention of this issue on the forums.