Fix Issue 18384 - std.net.isemail is slow to import due to regex #6129

n8sh · 2018-02-06T13:47:55Z

Solution is to remove regex from std.net.isemail. May be worth revisiting if std.regex compile times improve. On my machine making the change decreased the time measured by @wilzbach's import benchmark script from 0.15 seconds to 0.03 seconds (consistently over repeated trials).

Link to mention of this issue on the forums.

dlang-bot · 2018-02-06T13:47:56Z

Thanks for your pull request, @n8sh! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.

Some tips to help speed things up:

smaller, focused PRs are easier to review than big ones
try not to mix up refactoring or style changes with bug fixes or feature enhancements
provide helpful commit messages explaining the rationale behind each change

Bear in mind that large or tricky changes may require multiple rounds of review and revision.

Please see CONTRIBUTING.md for more information.

Bugzilla references

Auto-close	Bugzilla	Description
✓	18384	std.net.isemail is slow to import due to regex

schveiguy

Couple nits. Looks good except for that.

schveiguy · 2018-02-06T15:09:44Z

std/net/isemail.d

+        if (end >= 2 && s[end-2] == '.')
+            start = end - 2;
+        else if (end >= 3 && s[end-3] == '.')
+            start = end - 3;


This line is untested, try an IP address with 2 digits in an octet.

schveiguy · 2018-02-06T15:33:59Z

std/net/isemail.d

+    import std.ascii : isHexDigit;
+    if (s.length > 4) return false;
+    foreach (i; 0 .. s.length)
+        if (!isHexDigit(s[i])) return false;


foreach (c; s) if (!isHexDigit(c)) return false;

Although that is nicer syntax, it would auto-decode which is not necessary (all characters I'm searching for are <= 0x7F) and would be a performance hit.

foreach (c; s) doesn't do auto-decoding.

It does if c is dchar and s is a narrow string.

My mistake: I thought c would default to dchar unless given a specific type. Changed to use foreach.

Are you sure? https://run.dlang.io/is/UGY8jP

I thought c would default to dchar unless given a specific type.

This is a common misconception -- because Phobos treats arrays of char weirdly. The language doesn't, it's still an array to the language.

JackStouffer · 2018-02-06T14:35:11Z

std/net/isemail.d

+/
+private static const(Char)[] matchIPSuffix(Char)(return const Char[] s)
+{
+    if (s.length < "0.0.0.0".length) return null;


if (s.length < 7)

wilzbach

Thanks a lot for the work of moving forward and getting rid of the slow and bulky RegExp.
However, I would prefer if we don't reinvent parsing and just use the existing and testing std.conv. I added a few comments and ideas.

wilzbach · 2018-02-06T16:11:48Z

std/net/isemail.d

+        else if (end >= 4 && s[end-4] == '.')
+            start = end - 4;
+        else
+            return null;


How about:

start = end - s.byCodeUnit.retro.take(4).until(".").count;

You can do even one better:

int blocks = s.byCodeUnit .splitter(".") .filter!checkValid // check an individual block .take(4) // avoids problems with infinite strings .walkLength; if (blocks < 4) return null;

Don't think that does the right thing. The blocks need to be consecutive and they need to be the tail of the string, and we need the total length of the tail.

wilzbach · 2018-02-06T16:15:07Z

std/net/isemail.d

+            uint c = cast(uint) s[i] - '0';
+            if (c > 9) return null;
+            x = x * 10 + c;
+        }


uint x = s[start + 1 .. end].byCodeUnit.take(2).to!uint.ifThrown(256);

not @nogc and not nothrow

FYI: @nogc isn't a requirement for Phobos (we prefer nice, readable code instead of mir). In this case it will be fixed once the already merged -dip1008 gets activated by default.
Also the previous matchAll isn't neither @nogc nor nothrow either (and neither is the code which uses it).

Regarding nothrow, how about scope(failure) return null;?
That would save even more lines.

While making things @nogc nothrow is nice when ever possible, in this specific instance isEmail is not @nogc nothrow, so this change will have no effect on the user and there's no benefit.

What about converting to ubyte? then let to do the work of making sure the octet is in range.

FWIW, I like the idea of tagging internal functions @nogc and nothrow, even if the users aren't, because it helps keep GC usage down. Otherwise, little fixes that may add an unnecessary allocation creep in here and there.

in this specific instance isEmail is not @nogc nothrow

It probably should be but I'll leave that for another pull request. It looks like isEmail is incorrectly using Exceptions rather than assert() to debug logic errors in the function itself (see forum discussion).

wilzbach · 2018-02-06T16:17:15Z

std/net/isemail.d

+            --start;
+            x += 100 * (cast(uint) s[start] - '0');
+        }
+    }


s[max(0, end - 3) .. $].byCodeUnit.to!uint.ifThrown(256);

Can't use this here because unlike above I don't know where the encoded octet starts.

wilzbach · 2018-02-06T16:19:05Z

std/net/isemail.d

+    // (TO DETERMINE: is the definition of "word character" ASCII only?)
+    if (start == 0) return s;
+    const b = cast(uint) s[start - 1];
+    if (b - 'A' < 26 || b - 'a' < 26 || b - '0' < 10 || b == '_') return null;


Use isAlphaNum from std.ascii

wilzbach · 2018-02-06T16:23:46Z

std/net/isemail.d

+so we can return `const(Char)[]` instead of `const(Char)[][]` using a
+zero-length string to indicate no match.
+/
+private static const(Char)[] matchIPSuffix(Char)(return const Char[] s)


👍 for the use of return

static is pointless, though.

add pure nothrow @safe @nogc

wilzbach · 2018-02-06T16:24:38Z

std/net/isemail.d

+private static const(Char)[] matchIPSuffix(Char)(return const Char[] s)
+{
+    if (s.length < 7) return null;
+    size_t end = s.length;


Nit: you could now move end one line above and do if (end < 7) ...

schveiguy · 2018-02-06T22:56:21Z

I know it's tempting here to find the "best way" to parse an ip address. When I first saw the function, I spent about 15 minutes writing an alternative one as a suggestion that looked more at the text pattern rather than creating a uint out of the numbers. After a while, I just deleted it, since it's really not insanely complex either way, and the end result isn't different enough to warrant changes to this PR.

I think probably we can pull this as-is, and improve the internal details later. I'd like to get the regex problem solved.

luismarques · 2018-02-06T23:45:10Z

Can't you just manually include the D code generated and mixin()ed by the ctRegex? Leave the ctRegex version(none)ed out to indicate where the code comes from, and to reenable it once the performance situation changes.

WalterBright · 2018-02-06T23:45:54Z

std/net/isemail.d

+        {
+            uint c = cast(uint) s[i] - '0';
+            if (c > 9) return null;
+            x = x * 10 + c;


need to check for integer overflow

Can't overflow, we're summing a maximum of 3 digits.

JackStouffer · 2018-02-07T00:04:18Z

Can't you just manually include the D code generated and mixin()ed by the ctRegex? Leave the ctRegex version(none)ed out to indicate where the code comes from, and to reenable it once the performance situation changes.

Yes, this makes sense. @n8sh There's a debug flag in std.regex which will print out the code. Can you see which one is faster?

n8sh · 2018-02-07T20:51:08Z

@luismarques @JackStouffer I've run a comparison and the new code is dramatically faster.

Time to extract either an empty string or an IP address suffix from "addressLiteral" in a list of 32 strings, repeated 10_000 times:

	dmd -O	dmd -O -inline	ldc2 -O2
ctRegex	1667 msecs	889 msecs	451 msecs
matchIPSuffix	3 msecs	3 msecs	2 msecs

The "addressLiteral" strings used for the benchmark were captured from the current std.net.isemail unittests plus the unittest added by this pull request. I can post the benchmark code if you suggest a convenient place for it (it's a bit long for this comment).

n8sh · 2018-02-07T20:54:53Z

Note that the way the ctRegex is used in the current code -- which is how I benchmarked it -- includes dynamic memory allocation. It is effectively:

auto matchesIp = addressLiteral.matchAll(ipRegex).map!(a => a.hit).array;
string ipSuffix = matchesIp.empty ? null : matchesIp.front;

JackStouffer · 2018-02-07T21:17:19Z

@n8sh You can post it here via the following

<details>

```d
code here
```

</details>

n8sh · 2018-02-07T22:43:05Z

@JackStouffer Thanks.

Benchmark source code

// Strings taken from current unittests.
immutable string[32] address_literal_dataset =
[
    `IPv6:1111:2222:3333:4444:5555:6666::8888`,
    `255.255.255.255`,
    `255.255.255`,
    `255.255.255.255.255`,
    `255.255.255.256`,
    `1111:2222:3333:4444:5555:6666:7777:8888`,
    `IPv6:1111:2222:3333:4444:5555:6666:7777`,
    `IPv6:1111:2222:3333:4444:5555:6666:7777:8888`,
    `IPv6:1111:2222:3333:4444:5555:6666:7777:8888:9999`,
    `IPv6:1111:2222:3333:4444:5555:6666:7777:888G`,
    `IPv6:1111:2222:3333:4444:5555:6666::8888`,
    `IPv6:1111:2222:3333:4444:5555::8888`,
    `IPv6:1111:2222:3333:4444:5555:6666::7777:8888`,
    `IPv6::3333:4444:5555:6666:7777:8888`,
    `IPv6:::3333:4444:5555:6666:7777:8888`,
    `IPv6:1111::4444:5555::8888`,
    `IPv6:::`,
    `IPv6:1111:2222:3333:4444:5555:255.255.255.255`,
    `IPv6:1111:2222:3333:4444:5555:6666:255.255.255.255`,
    `IPv6:1111:2222:3333:4444:5555:6666:7777:255.255.255.255`,
    `IPv6:1111:2222:3333:4444::255.255.255.255`,
    `IPv6:1111:2222:3333:4444:5555:6666::255.255.255.255`,
    `IPv6:1111:2222:3333:4444:::255.255.255.255`,
    `IPv6::255.255.255.255`,
    `255.255.255.255`,
    `RFC-5322-domain-literal`,
    `RFC-5322`,
    `RFC5322domainliteral`,
    `RFC-5322-domain-literal`,
    `IPv6:1::2:`,
    `255.255.255.255`,
    `babaev 176.16.0.1`
];

void main(string[] args)
{
    import std.stdio : writeln;
    import std.datetime.stopwatch : AutoStart, StopWatch;
    enum iterations = 10_000;
    auto test_data = address_literal_dataset[];
    size_t output;
    StopWatch sw = StopWatch(AutoStart.no);
    
    // Method 1

    sw.reset();
    output = 0;
    sw.start();
    foreach_reverse (_; 0.. iterations)
    {
        foreach (address_literal; test_data)
            output += method1(address_literal).length;
    }
    sw.stop();
    writeln("method1 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");

    sw.reset();
    output = 0;
    sw.start();
    foreach_reverse (_; 0.. iterations)
    {
        foreach (address_literal; test_data)
            output += method1(address_literal).length;
    }
    sw.stop();
    writeln("method1 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");

    // Method 2

    sw.reset();
    output = 0;
    sw.start();
    foreach_reverse (_; 0.. iterations)
    {
        foreach (address_literal; test_data)
            output += method2(address_literal).length;
    }
    sw.stop();
    writeln("method2 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");

    sw.reset();
    output = 0;
    sw.start();
    foreach_reverse (_; 0.. iterations)
    {
        foreach (address_literal; test_data)
            output += method2(address_literal).length;
    }
    sw.stop();
    writeln("method2 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");

    // Method 1 again

    sw.reset();
    output = 0;
    sw.start();
    foreach_reverse (_; 0.. iterations)
    {
        foreach (address_literal; test_data)
            output += method1(address_literal).length;
    }
    sw.stop();
    writeln("method1 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");

    sw.reset();
    output = 0;
    sw.start();
    foreach_reverse (_; 0.. iterations)
    {
        foreach (address_literal; test_data)
            output += method1(address_literal).length;
    }
    sw.stop();
    writeln("method1 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");
}

pragma(inline, false)
const(Char)[] method1(Char)(const(Char)[] s)
{
    import std.algorithm.iteration : map;
    import std.array : array, split;
    import std.range.primitives : empty, front;
    import std.regex : ctRegex, matchAll;
    static ipRegex = ctRegex!(`\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}`~
                            `(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$`);
    auto matchesIp = s.matchAll(ipRegex).map!(a => a.hit).array;
    return matchesIp.empty ? null : matchesIp.front;
}

pragma(inline, false)
private const(Char)[] method2(Char)(return const(Char)[] s)
{
    size_t end = s.length;
    if (end < 7) return null;
    // Check the first three `[.]\d{1,3}`
    foreach (_; 0 .. 3)
    {
        size_t start = void;
        if (end >= 2 && s[end-2] == '.')
            start = end - 2;
        else if (end >= 3 && s[end-3] == '.')
            start = end - 3;
        else if (end >= 4 && s[end-4] == '.')
            start = end - 4;
        else
            return null;
        uint x = 0;
        foreach (i; start + 1 .. end)
        {
            uint c = cast(uint) s[i] - '0';
            if (c > 9) return null;
            x = x * 10 + c;
        }
        if (x > 255) return null;
        end = start;
    }
    // Check the final `\d{1,3}`.
    if (end < 1) return null;
    size_t start = end - 1;
    uint x = cast(uint) s[start] - '0';
    if (x > 9) return null;
    if (start > 0 && cast(uint) s[start-1] - '0' <= 9)
    {
        --start;
        x += 10 * (cast(uint) s[start] - '0');
        if (start > 0 && cast(uint) s[start-1] - '0' <= 9)
        {
            --start;
            x += 100 * (cast(uint) s[start] - '0');
        }
    }
    if (x > 255) return null;
    // Must either be at start of string or preceded by a non-word character.
    // (TO DETERMINE: is the definition of "word character" ASCII only?)
    if (start == 0) return s;
    const b = s[start - 1];
    import std.ascii : isAlphaNum;
    if (isAlphaNum(b) || b == '_') return null;
    return s[start .. $];
}

Solution is to remove regex from std.net.isemail. May be worth revisiting if std.regex compile times improve.

andralex

I'm good with this, thanks

andralex · 2018-02-13T13:18:37Z

std/net/isemail.d

+{
+    import std.ascii : isHexDigit;
+    if (s.length > 4) return false;
+    foreach (c; s)


return s.length <= 4 && s.all!isHexDigit; But that would need std.algorithm.searching :)

Also, this would involve auto-decoding I think.

.byCodeUnit - no autodecoding and in @nogc nothrow pure @safe

https://run.dlang.io/is/PvbTfz

schveiguy · 2018-02-13T13:41:02Z

I think this is in a good enough state, we can have further PRs to include uses of Phobos and nifty range pipelines as improvements. Given the two approvals, I'll pull.

n8sh requested a review from JackStouffer as a code owner February 6, 2018 13:47

dlang-bot added the Severity:Bug Fix label Feb 6, 2018

schveiguy approved these changes Feb 6, 2018

View reviewed changes

n8sh force-pushed the isemail-noregex branch 3 times, most recently from 0ac5260 to 4948cbb Compare February 6, 2018 16:09

JackStouffer reviewed Feb 6, 2018

View reviewed changes

n8sh force-pushed the isemail-noregex branch from 4948cbb to a74b295 Compare February 6, 2018 16:17

wilzbach reviewed Feb 6, 2018

View reviewed changes

n8sh force-pushed the isemail-noregex branch from a74b295 to f1a2de4 Compare February 6, 2018 16:30

WalterBright reviewed Feb 6, 2018

View reviewed changes

n8sh force-pushed the isemail-noregex branch 2 times, most recently from 7c42c19 to f2f1c32 Compare February 7, 2018 00:03

n8sh force-pushed the isemail-noregex branch from f2f1c32 to e5521a9 Compare February 7, 2018 01:04

wilzbach mentioned this pull request Feb 9, 2018

Add a global convenience package file #5916

Merged

Fix Issue 18384 - std.net.isemail is slow to import due to regex

06e4030

Solution is to remove regex from std.net.isemail. May be worth revisiting if std.regex compile times improve.

n8sh force-pushed the isemail-noregex branch from e5521a9 to 06e4030 Compare February 11, 2018 11:08

andralex approved these changes Feb 13, 2018

View reviewed changes

schveiguy added the Merge:auto-merge label Feb 13, 2018

dlang-bot merged commit cbd6cf1 into dlang:master Feb 13, 2018

wilzbach mentioned this pull request Feb 14, 2018

Fix DScanner - same visibility attribute used as defined on line 1713 #6171

Merged

Fix Issue 18384 - std.net.isemail is slow to import due to regex #6129

Fix Issue 18384 - std.net.isemail is slow to import due to regex #6129

Conversation

n8sh commented Feb 6, 2018

dlang-bot commented Feb 6, 2018

Bugzilla references

schveiguy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wilzbach left a comment

Choose a reason for hiding this comment

wilzbach Feb 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schveiguy Feb 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schveiguy commented Feb 6, 2018

luismarques commented Feb 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackStouffer commented Feb 7, 2018

n8sh commented Feb 7, 2018 • edited Loading

n8sh commented Feb 7, 2018

JackStouffer commented Feb 7, 2018 • edited Loading

n8sh commented Feb 7, 2018

andralex left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schveiguy commented Feb 13, 2018

wilzbach Feb 6, 2018 •

edited

Loading

schveiguy Feb 6, 2018 •

edited

Loading

n8sh commented Feb 7, 2018 •

edited

Loading

JackStouffer commented Feb 7, 2018 •

edited

Loading