Skip to content

SI-7710 fix memory performance of RegexParsers in jdk7u6+ #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 25, 2014

Conversation

gourlaysama
Copy link
Contributor

Starting with 1.7.0_06 [1], String.substring no longer reuses the internal
char array of the String but make a copy instead. Since we call
subSequence twice for every input character, this results in horrible
parse performance and GC.

With the benchmark from the (duplicate) ticket SI-8542, I get:

BEFORE:

    parseAll(new StringReader(String))
    For 100 items: 49 ms
    For 500 items: 97 ms
    For 1000 items: 155 ms
    For 5000 items: 113 ms
    For 10000 items: 188 ms
    For 50000 items: 1437 ms
    ===
    parseAll(String)
    For 100 items: 4 ms
    For 500 items: 67 ms
    For 1000 items: 372 ms
    For 5000 items: 5693 ms
    For 10000 items: 23126 ms
    For 50000 items: 657665 ms

AFTER:

    parseAll(new StringReader(String))
    For 100 items: 43 ms
    For 500 items: 118 ms
    For 1000 items: 217 ms
    For 5000 items: 192 ms
    For 10000 items: 196 ms
    For 50000 items: 1424 ms
    ===
    parseAll(String)
    For 100 items: 2 ms
    For 500 items: 8 ms
    For 1000 items: 16 ms
    For 5000 items: 79 ms
    For 10000 items: 161 ms
    For 50000 items: 636 ms

[1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6924259

@gourlaysama
Copy link
Contributor Author

Hum, travis still tries to build against 2.11.0-SNAPSHOT...

@gourlaysama gourlaysama mentioned this pull request Apr 28, 2014
@dcsobral
Copy link
Contributor

+1

@Ichoran
Copy link

Ichoran commented Apr 28, 2014

Great! Do you know if there is a code path that can have a lot of parsing/recursion without calling regex? If so, it would still exhibit the old O(n^2) behavior. I put off writing a fix until I had time to track it down.

@gourlaysama
Copy link
Contributor Author

@Ichoran I couldn't find any other heavy user of substring/subSequence in there.

There are still a few but they are:

  • required by public API to be converted to a String (hence we have no way around array copying)
  • never repeatedly called by anything else in parser-combinators

@gourlaysama
Copy link
Contributor Author

@adriaanm: could you take a look at this when you have time? That bug is pretty annoying :)
(or is there a new maintainer for parser-combinators?)

@adriaanm
Copy link
Contributor

Sorry, been busy with Scaladays etc :-)
I'd love to hand over maintenance, yes. Happy to help you get started ;)

@adriaanm
Copy link
Contributor

Travis is failing because your PR is based on an older version of master that tested against 2.11.0-SNAPSHOT (which is now 2.11.2-SNAPSHOT). git pull --rebase https://github.com/scala/scala-parser-combinators.git master in your branch should do the trick (and then git push -f $yourRemote)

@gourlaysama
Copy link
Contributor Author

I rebased it. Mima also wasn't happy with a private class in a trait, so I moved it out.

throw new IndexOutOfBoundsException(s"start: ${_start}, end: ${_end}, length: $length")

new SubSequence(s, start + _start, _end - _start)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry! one last nitpick: indentation is off-by-one here :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Fixed :-)

@adriaanm
Copy link
Contributor

Thanks, LGTM!

Starting with 1.7.0_06 [1], String.substring no longer reuses the internal
char array of the String but make a copy instead. Since we call
subSequence twice for *every* input character, this results in horrible
parse performance and GC.

With the benchmark from the (duplicate) ticket SI-8542, I get:

BEFORE:
    parseAll(new StringReader(String))
    For 100 items: 49 ms
    For 500 items: 97 ms
    For 1000 items: 155 ms
    For 5000 items: 113 ms
    For 10000 items: 188 ms
    For 50000 items: 1437 ms
    ===
    parseAll(String)
    For 100 items: 4 ms
    For 500 items: 67 ms
    For 1000 items: 372 ms
    For 5000 items: 5693 ms
    For 10000 items: 23126 ms
    For 50000 items: 657665 ms

AFTER:
    parseAll(new StringReader(String))
    For 100 items: 43 ms
    For 500 items: 118 ms
    For 1000 items: 217 ms
    For 5000 items: 192 ms
    For 10000 items: 196 ms
    For 50000 items: 1424 ms
    ===
    parseAll(String)
    For 100 items: 2 ms
    For 500 items: 8 ms
    For 1000 items: 16 ms
    For 5000 items: 79 ms
    For 10000 items: 161 ms
    For 50000 items: 636 ms

[1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6924259
@adriaanm
Copy link
Contributor

Great! I've restarted the build on Travis. Looks like failure was on their end.

adriaanm added a commit that referenced this pull request Jun 25, 2014
SI-7710 fix memory performance of RegexParsers in jdk7u6+
@adriaanm adriaanm merged commit 9942de1 into scala:master Jun 25, 2014
@ceilican
Copy link

@Ichoran @gourlaysama: maybe rep or rep1 also have a hidden O(n^2) behaviour? See this: http://stackoverflow.com/questions/23117635/why-is-scalas-combinator-parsing-slow-when-parsing-large-files-what-can-i-do

Note that, in the example discussed there, the lines parsed via regex were of equal length. I suspect that if the regex parser were replaced by a non-regex parser, the slow-down described there would still happen.

@gourlaysama
Copy link
Contributor Author

@ceilican in that example, rep calls the regex-based parser line, hence the slowdown. But it doesn't call subString on anything (because it doesn't extract anything from a String, only the repeated parser does). A quick benckmark confirms it.

@ceilican
Copy link

Great! Thanks! When will Scala 2.11.2 with this fix be released?

(This fix would allow my project to handle much bigger inputs. Right now I am wasting a lot of a cluster's computer resources because of this, and therefore I would really be interested in the Scala 2.11.2 release as soon as possible.)

In case it will not be released soon, is there an easy way (e.g. by simply changing something in my project's build.sbt file) to benefit from this fix before it is released?

@gourlaysama
Copy link
Contributor Author

scala-parser-combinators is now versionned separately, so
(theoretically) it doesn't have to follow scala releases; a 1.0.2 could be
released before 2.11.2.

Until then, you can always build it from source, publish it locally and
depend on 1.0.2-SNAPSHOT, like you would any other normal dependency.

@adriaanm I'd be happy to help with maintenance :-)

@adriaanm
Copy link
Contributor

@gourlaysama, great! Thank you! I've added you to our Community Maintainers team. Use the power wisely ;-) We're always happy to help, so don't hesitate to ask when in doubt.

I'd suggest adding your email address to .travis.yml so that you're notified of build breakage.

Also, @gkossakowski is our infrastructure tsar this year. One task on his TODO list is to make releases tag-driven, so maintainers can cut releases easily. Feel free to prod him on this ;-)

@adriaanm
Copy link
Contributor

Until we have automated releases, I'm happy to have you, as a maintainer, tag a release when it's ready, and ping @gkossakowski to publish it. This should be done before the PR freeze of the next Scala release (July 14 for 2.11.2) in order to have it included in the distro. While you're at it, please also bump the Scala & sbt versions to the latest stable release.

Tags really shouldn't be undone/changed, so please tread lightly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants