Skip to content

Fix Issue 18384 - std.net.isemail is slow to import due to regex #6129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 13, 2018

Conversation

n8sh
Copy link
Member

@n8sh n8sh commented Feb 6, 2018

Solution is to remove regex from std.net.isemail. May be worth revisiting if std.regex compile times improve. On my machine making the change decreased the time measured by @wilzbach's import benchmark script from 0.15 seconds to 0.03 seconds (consistently over repeated trials).

Link to mention of this issue on the forums.

@n8sh n8sh requested a review from JackStouffer as a code owner February 6, 2018 13:47
@dlang-bot
Copy link
Contributor

Thanks for your pull request, @n8sh! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.

Some tips to help speed things up:

  • smaller, focused PRs are easier to review than big ones

  • try not to mix up refactoring or style changes with bug fixes or feature enhancements

  • provide helpful commit messages explaining the rationale behind each change

Bear in mind that large or tricky changes may require multiple rounds of review and revision.

Please see CONTRIBUTING.md for more information.

Bugzilla references

Auto-close Bugzilla Description
18384 std.net.isemail is slow to import due to regex

Copy link
Member

@schveiguy schveiguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple nits. Looks good except for that.

if (end >= 2 && s[end-2] == '.')
start = end - 2;
else if (end >= 3 && s[end-3] == '.')
start = end - 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is untested, try an IP address with 2 digits in an octet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

import std.ascii : isHexDigit;
if (s.length > 4) return false;
foreach (i; 0 .. s.length)
if (!isHexDigit(s[i])) return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foreach (c; s)
    if (!isHexDigit(c)) return false;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although that is nicer syntax, it would auto-decode which is not necessary (all characters I'm searching for are <= 0x7F) and would be a performance hit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foreach (c; s) doesn't do auto-decoding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does if c is dchar and s is a narrow string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake: I thought c would default to dchar unless given a specific type. Changed to use foreach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought c would default to dchar unless given a specific type.

This is a common misconception -- because Phobos treats arrays of char weirdly. The language doesn't, it's still an array to the language.

@n8sh n8sh force-pushed the isemail-noregex branch 3 times, most recently from 0ac5260 to 4948cbb Compare February 6, 2018 16:09
+/
private static const(Char)[] matchIPSuffix(Char)(return const Char[] s)
{
if (s.length < "0.0.0.0".length) return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (s.length < 7)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

@wilzbach wilzbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the work of moving forward and getting rid of the slow and bulky RegExp.
However, I would prefer if we don't reinvent parsing and just use the existing and testing std.conv. I added a few comments and ideas.

else if (end >= 4 && s[end-4] == '.')
start = end - 4;
else
return null;
Copy link
Member

@wilzbach wilzbach Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

start = end - s.byCodeUnit.retro.take(4).until(".").count;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do even one better:

int blocks = s.byCodeUnit
.splitter(".")
.filter!checkValid // check an individual block
.take(4) // avoids problems with infinite strings
.walkLength;
if (blocks < 4) return null;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think that does the right thing. The blocks need to be consecutive and they need to be the tail of the string, and we need the total length of the tail.

uint c = cast(uint) s[i] - '0';
if (c > 9) return null;
x = x * 10 + c;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint x = s[start + 1 .. end].byCodeUnit.take(2).to!uint.ifThrown(256);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not @nogc and not nothrow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: @nogc isn't a requirement for Phobos (we prefer nice, readable code instead of mir). In this case it will be fixed once the already merged -dip1008 gets activated by default.
Also the previous matchAll isn't neither @nogc nor nothrow either (and neither is the code which uses it).

Regarding nothrow, how about scope(failure) return null;?
That would save even more lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While making things @nogc nothrow is nice when ever possible, in this specific instance isEmail is not @nogc nothrow, so this change will have no effect on the user and there's no benefit.

Copy link
Member

@schveiguy schveiguy Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about converting to ubyte? then let to do the work of making sure the octet is in range.

FWIW, I like the idea of tagging internal functions @nogc and nothrow, even if the users aren't, because it helps keep GC usage down. Otherwise, little fixes that may add an unnecessary allocation creep in here and there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this specific instance isEmail is not @nogc nothrow

It probably should be but I'll leave that for another pull request. It looks like isEmail is incorrectly using Exceptions rather than assert() to debug logic errors in the function itself (see forum discussion).

--start;
x += 100 * (cast(uint) s[start] - '0');
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s[max(0, end - 3) .. $].byCodeUnit.to!uint.ifThrown(256);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use this here because unlike above I don't know where the encoded octet starts.

// (TO DETERMINE: is the definition of "word character" ASCII only?)
if (start == 0) return s;
const b = cast(uint) s[start - 1];
if (b - 'A' < 26 || b - 'a' < 26 || b - '0' < 10 || b == '_') return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use isAlphaNum from std.ascii

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

so we can return `const(Char)[]` instead of `const(Char)[][]` using a
zero-length string to indicate no match.
+/
private static const(Char)[] matchIPSuffix(Char)(return const Char[] s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for the use of return

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static is pointless, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add pure nothrow @safe @nogc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

private static const(Char)[] matchIPSuffix(Char)(return const Char[] s)
{
if (s.length < 7) return null;
size_t end = s.length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you could now move end one line above and do if (end < 7) ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@schveiguy
Copy link
Member

I know it's tempting here to find the "best way" to parse an ip address. When I first saw the function, I spent about 15 minutes writing an alternative one as a suggestion that looked more at the text pattern rather than creating a uint out of the numbers. After a while, I just deleted it, since it's really not insanely complex either way, and the end result isn't different enough to warrant changes to this PR.

I think probably we can pull this as-is, and improve the internal details later. I'd like to get the regex problem solved.

@luismarques
Copy link
Contributor

Can't you just manually include the D code generated and mixin()ed by the ctRegex? Leave the ctRegex version(none)ed out to indicate where the code comes from, and to reenable it once the performance situation changes.

{
uint c = cast(uint) s[i] - '0';
if (c > 9) return null;
x = x * 10 + c;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check for integer overflow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't overflow, we're summing a maximum of 3 digits.

@n8sh n8sh force-pushed the isemail-noregex branch 2 times, most recently from 7c42c19 to f2f1c32 Compare February 7, 2018 00:03
@JackStouffer
Copy link
Member

Can't you just manually include the D code generated and mixin()ed by the ctRegex? Leave the ctRegex version(none)ed out to indicate where the code comes from, and to reenable it once the performance situation changes.

Yes, this makes sense. @n8sh There's a debug flag in std.regex which will print out the code. Can you see which one is faster?

@n8sh
Copy link
Member Author

n8sh commented Feb 7, 2018

@luismarques @JackStouffer I've run a comparison and the new code is dramatically faster.

Time to extract either an empty string or an IP address suffix from "addressLiteral" in a list of 32 strings, repeated 10_000 times:

dmd -O dmd -O -inline ldc2 -O2
ctRegex 1667 msecs 889 msecs 451 msecs
matchIPSuffix 3 msecs 3 msecs 2 msecs

The "addressLiteral" strings used for the benchmark were captured from the current std.net.isemail unittests plus the unittest added by this pull request. I can post the benchmark code if you suggest a convenient place for it (it's a bit long for this comment).

@n8sh
Copy link
Member Author

n8sh commented Feb 7, 2018

Note that the way the ctRegex is used in the current code -- which is how I benchmarked it -- includes dynamic memory allocation. It is effectively:

auto matchesIp = addressLiteral.matchAll(ipRegex).map!(a => a.hit).array;
string ipSuffix = matchesIp.empty ? null : matchesIp.front;

@JackStouffer
Copy link
Member

JackStouffer commented Feb 7, 2018

@n8sh You can post it here via the following

<details>

```d
code here
```

</details>

@n8sh
Copy link
Member Author

n8sh commented Feb 7, 2018

@JackStouffer Thanks.

Benchmark source code
// Strings taken from current unittests.
immutable string[32] address_literal_dataset =
[
    `IPv6:1111:2222:3333:4444:5555:6666::8888`,
    `255.255.255.255`,
    `255.255.255`,
    `255.255.255.255.255`,
    `255.255.255.256`,
    `1111:2222:3333:4444:5555:6666:7777:8888`,
    `IPv6:1111:2222:3333:4444:5555:6666:7777`,
    `IPv6:1111:2222:3333:4444:5555:6666:7777:8888`,
    `IPv6:1111:2222:3333:4444:5555:6666:7777:8888:9999`,
    `IPv6:1111:2222:3333:4444:5555:6666:7777:888G`,
    `IPv6:1111:2222:3333:4444:5555:6666::8888`,
    `IPv6:1111:2222:3333:4444:5555::8888`,
    `IPv6:1111:2222:3333:4444:5555:6666::7777:8888`,
    `IPv6::3333:4444:5555:6666:7777:8888`,
    `IPv6:::3333:4444:5555:6666:7777:8888`,
    `IPv6:1111::4444:5555::8888`,
    `IPv6:::`,
    `IPv6:1111:2222:3333:4444:5555:255.255.255.255`,
    `IPv6:1111:2222:3333:4444:5555:6666:255.255.255.255`,
    `IPv6:1111:2222:3333:4444:5555:6666:7777:255.255.255.255`,
    `IPv6:1111:2222:3333:4444::255.255.255.255`,
    `IPv6:1111:2222:3333:4444:5555:6666::255.255.255.255`,
    `IPv6:1111:2222:3333:4444:::255.255.255.255`,
    `IPv6::255.255.255.255`,
    `255.255.255.255`,
    `RFC-5322-domain-literal`,
    `RFC-5322`,
    `RFC5322domainliteral`,
    `RFC-5322-domain-literal`,
    `IPv6:1::2:`,
    `255.255.255.255`,
    `babaev 176.16.0.1`
];

void main(string[] args)
{
    import std.stdio : writeln;
    import std.datetime.stopwatch : AutoStart, StopWatch;
    enum iterations = 10_000;
    auto test_data = address_literal_dataset[];
    size_t output;
    StopWatch sw = StopWatch(AutoStart.no);
    
    // Method 1

    sw.reset();
    output = 0;
    sw.start();
    foreach_reverse (_; 0.. iterations)
    {
        foreach (address_literal; test_data)
            output += method1(address_literal).length;
    }
    sw.stop();
    writeln("method1 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");

    sw.reset();
    output = 0;
    sw.start();
    foreach_reverse (_; 0.. iterations)
    {
        foreach (address_literal; test_data)
            output += method1(address_literal).length;
    }
    sw.stop();
    writeln("method1 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");

    // Method 2

    sw.reset();
    output = 0;
    sw.start();
    foreach_reverse (_; 0.. iterations)
    {
        foreach (address_literal; test_data)
            output += method2(address_literal).length;
    }
    sw.stop();
    writeln("method2 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");

    sw.reset();
    output = 0;
    sw.start();
    foreach_reverse (_; 0.. iterations)
    {
        foreach (address_literal; test_data)
            output += method2(address_literal).length;
    }
    sw.stop();
    writeln("method2 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");

    // Method 1 again

    sw.reset();
    output = 0;
    sw.start();
    foreach_reverse (_; 0.. iterations)
    {
        foreach (address_literal; test_data)
            output += method1(address_literal).length;
    }
    sw.stop();
    writeln("method1 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");

    sw.reset();
    output = 0;
    sw.start();
    foreach_reverse (_; 0.. iterations)
    {
        foreach (address_literal; test_data)
            output += method1(address_literal).length;
    }
    sw.stop();
    writeln("method1 x", iterations, " = ", sw.peek().total!"msecs", " msecs [checksum ", output, "]");
}

pragma(inline, false)
const(Char)[] method1(Char)(const(Char)[] s)
{
    import std.algorithm.iteration : map;
    import std.array : array, split;
    import std.range.primitives : empty, front;
    import std.regex : ctRegex, matchAll;
    static ipRegex = ctRegex!(`\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}`~
                            `(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$`);
    auto matchesIp = s.matchAll(ipRegex).map!(a => a.hit).array;
    return matchesIp.empty ? null : matchesIp.front;
}

pragma(inline, false)
private const(Char)[] method2(Char)(return const(Char)[] s)
{
    size_t end = s.length;
    if (end < 7) return null;
    // Check the first three `[.]\d{1,3}`
    foreach (_; 0 .. 3)
    {
        size_t start = void;
        if (end >= 2 && s[end-2] == '.')
            start = end - 2;
        else if (end >= 3 && s[end-3] == '.')
            start = end - 3;
        else if (end >= 4 && s[end-4] == '.')
            start = end - 4;
        else
            return null;
        uint x = 0;
        foreach (i; start + 1 .. end)
        {
            uint c = cast(uint) s[i] - '0';
            if (c > 9) return null;
            x = x * 10 + c;
        }
        if (x > 255) return null;
        end = start;
    }
    // Check the final `\d{1,3}`.
    if (end < 1) return null;
    size_t start = end - 1;
    uint x = cast(uint) s[start] - '0';
    if (x > 9) return null;
    if (start > 0 && cast(uint) s[start-1] - '0' <= 9)
    {
        --start;
        x += 10 * (cast(uint) s[start] - '0');
        if (start > 0 && cast(uint) s[start-1] - '0' <= 9)
        {
            --start;
            x += 100 * (cast(uint) s[start] - '0');
        }
    }
    if (x > 255) return null;
    // Must either be at start of string or preceded by a non-word character.
    // (TO DETERMINE: is the definition of "word character" ASCII only?)
    if (start == 0) return s;
    const b = s[start - 1];
    import std.ascii : isAlphaNum;
    if (isAlphaNum(b) || b == '_') return null;
    return s[start .. $];
}

Solution is to remove regex from std.net.isemail. May be worth revisiting
if std.regex compile times improve.
Copy link
Member

@andralex andralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this, thanks

{
import std.ascii : isHexDigit;
if (s.length > 4) return false;
foreach (c; s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return s.length <= 4 && s.all!isHexDigit; But that would need std.algorithm.searching :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this would involve auto-decoding I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.byCodeUnit - no autodecoding and in @nogc nothrow pure @safe

https://run.dlang.io/is/PvbTfz

@schveiguy
Copy link
Member

I think this is in a good enough state, we can have further PRs to include uses of Phobos and nifty range pipelines as improvements. Given the two approvals, I'll pull.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.