SI-7710 fix memory performance of RegexParsers in jdk7u6+ #17

gourlaysama · 2014-04-28T18:02:47Z

Starting with 1.7.0_06 [1], String.substring no longer reuses the internal
char array of the String but make a copy instead. Since we call
subSequence twice for every input character, this results in horrible
parse performance and GC.

With the benchmark from the (duplicate) ticket SI-8542, I get:

BEFORE:

    parseAll(new StringReader(String))
    For 100 items: 49 ms
    For 500 items: 97 ms
    For 1000 items: 155 ms
    For 5000 items: 113 ms
    For 10000 items: 188 ms
    For 50000 items: 1437 ms
    ===
    parseAll(String)
    For 100 items: 4 ms
    For 500 items: 67 ms
    For 1000 items: 372 ms
    For 5000 items: 5693 ms
    For 10000 items: 23126 ms
    For 50000 items: 657665 ms

AFTER:

    parseAll(new StringReader(String))
    For 100 items: 43 ms
    For 500 items: 118 ms
    For 1000 items: 217 ms
    For 5000 items: 192 ms
    For 10000 items: 196 ms
    For 50000 items: 1424 ms
    ===
    parseAll(String)
    For 100 items: 2 ms
    For 500 items: 8 ms
    For 1000 items: 16 ms
    For 5000 items: 79 ms
    For 10000 items: 161 ms
    For 50000 items: 636 ms

[1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6924259

gourlaysama · 2014-04-28T18:37:41Z

Hum, travis still tries to build against 2.11.0-SNAPSHOT...

dcsobral · 2014-04-28T19:22:01Z

+1

Ichoran · 2014-04-28T19:34:13Z

Great! Do you know if there is a code path that can have a lot of parsing/recursion without calling regex? If so, it would still exhibit the old O(n^2) behavior. I put off writing a fix until I had time to track it down.

gourlaysama · 2014-04-28T20:19:03Z

@Ichoran I couldn't find any other heavy user of substring/subSequence in there.

There are still a few but they are:

required by public API to be converted to a String (hence we have no way around array copying)
never repeatedly called by anything else in parser-combinators

gourlaysama · 2014-06-10T07:30:32Z

@adriaanm: could you take a look at this when you have time? That bug is pretty annoying :)
(or is there a new maintainer for parser-combinators?)

adriaanm · 2014-06-25T07:22:22Z

Sorry, been busy with Scaladays etc :-)
I'd love to hand over maintenance, yes. Happy to help you get started ;)

adriaanm · 2014-06-25T07:27:27Z

Travis is failing because your PR is based on an older version of master that tested against 2.11.0-SNAPSHOT (which is now 2.11.2-SNAPSHOT). git pull --rebase https://github.com/scala/scala-parser-combinators.git master in your branch should do the trick (and then git push -f $yourRemote)

gourlaysama · 2014-06-25T08:04:39Z

I rebased it. Mima also wasn't happy with a private class in a trait, so I moved it out.

adriaanm · 2014-06-25T09:23:41Z

src/main/scala/scala/util/parsing/combinator/SubSequence.scala

+      throw new IndexOutOfBoundsException(s"start: ${_start}, end: ${_end}, length: $length")
+
+    new SubSequence(s, start + _start, _end - _start)
+}


sorry! one last nitpick: indentation is off-by-one here :)

Oh. Fixed :-)

adriaanm · 2014-06-25T09:23:54Z

Thanks, LGTM!

Starting with 1.7.0_06 [1], String.substring no longer reuses the internal char array of the String but make a copy instead. Since we call subSequence twice for *every* input character, this results in horrible parse performance and GC. With the benchmark from the (duplicate) ticket SI-8542, I get: BEFORE: parseAll(new StringReader(String)) For 100 items: 49 ms For 500 items: 97 ms For 1000 items: 155 ms For 5000 items: 113 ms For 10000 items: 188 ms For 50000 items: 1437 ms === parseAll(String) For 100 items: 4 ms For 500 items: 67 ms For 1000 items: 372 ms For 5000 items: 5693 ms For 10000 items: 23126 ms For 50000 items: 657665 ms AFTER: parseAll(new StringReader(String)) For 100 items: 43 ms For 500 items: 118 ms For 1000 items: 217 ms For 5000 items: 192 ms For 10000 items: 196 ms For 50000 items: 1424 ms === parseAll(String) For 100 items: 2 ms For 500 items: 8 ms For 1000 items: 16 ms For 5000 items: 79 ms For 10000 items: 161 ms For 50000 items: 636 ms [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6924259

adriaanm · 2014-06-25T09:36:07Z

Great! I've restarted the build on Travis. Looks like failure was on their end.

SI-7710 fix memory performance of RegexParsers in jdk7u6+

ceilican · 2014-06-25T11:04:03Z

@Ichoran @gourlaysama: maybe rep or rep1 also have a hidden O(n^2) behaviour? See this: http://stackoverflow.com/questions/23117635/why-is-scalas-combinator-parsing-slow-when-parsing-large-files-what-can-i-do

Note that, in the example discussed there, the lines parsed via regex were of equal length. I suspect that if the regex parser were replaced by a non-regex parser, the slow-down described there would still happen.

gourlaysama · 2014-06-25T12:48:26Z

@ceilican in that example, rep calls the regex-based parser line, hence the slowdown. But it doesn't call subString on anything (because it doesn't extract anything from a String, only the repeated parser does). A quick benckmark confirms it.

ceilican · 2014-06-25T14:30:21Z

Great! Thanks! When will Scala 2.11.2 with this fix be released?

(This fix would allow my project to handle much bigger inputs. Right now I am wasting a lot of a cluster's computer resources because of this, and therefore I would really be interested in the Scala 2.11.2 release as soon as possible.)

In case it will not be released soon, is there an easy way (e.g. by simply changing something in my project's build.sbt file) to benefit from this fix before it is released?

gourlaysama · 2014-06-26T06:46:14Z

scala-parser-combinators is now versionned separately, so
(theoretically) it doesn't have to follow scala releases; a 1.0.2 could be
released before 2.11.2.

Until then, you can always build it from source, publish it locally and
depend on 1.0.2-SNAPSHOT, like you would any other normal dependency.

@adriaanm I'd be happy to help with maintenance :-)

adriaanm · 2014-06-26T08:14:44Z

@gourlaysama, great! Thank you! I've added you to our Community Maintainers team. Use the power wisely ;-) We're always happy to help, so don't hesitate to ask when in doubt.

I'd suggest adding your email address to .travis.yml so that you're notified of build breakage.

Also, @gkossakowski is our infrastructure tsar this year. One task on his TODO list is to make releases tag-driven, so maintainers can cut releases easily. Feel free to prod him on this ;-)

adriaanm · 2014-06-26T08:30:42Z

Until we have automated releases, I'm happy to have you, as a maintainer, tag a release when it's ready, and ping @gkossakowski to publish it. This should be done before the PR freeze of the next Scala release (July 14 for 2.11.2) in order to have it included in the distro. While you're at it, please also bump the Scala & sbt versions to the latest stable release.

Tags really shouldn't be undone/changed, so please tread lightly.

gourlaysama mentioned this pull request Apr 28, 2014

bump version numbers. #18

Closed

xuwei-k mentioned this pull request Jun 20, 2014

Slow on large csv files tototoshi/scala-csv#11

Closed

adriaanm reviewed Jun 25, 2014
View reviewed changes

adriaanm added a commit that referenced this pull request Jun 25, 2014

Merge pull request #17 from gourlaysama/t7710

9942de1

SI-7710 fix memory performance of RegexParsers in jdk7u6+

adriaanm merged commit 9942de1 into scala:master Jun 25, 2014

gourlaysama mentioned this pull request Jun 27, 2014

Parsing String is worse than parsing StringReader #16

Closed

gourlaysama deleted the t7710 branch July 1, 2014 11:35

gourlaysama mentioned this pull request Jul 2, 2014

[backport] SI-7710 fix memory performance of RegexParsers in jdk7u6+ scala/scala#3860

Merged

This was referenced Apr 7, 2017

RegexParsers.scala has O(inputlength) memory performance on java >= 7u6 scala/bug#7710

Closed

Parser Combinators - parsing String is worse than StringReader scala/bug#8542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SI-7710 fix memory performance of RegexParsers in jdk7u6+ #17

SI-7710 fix memory performance of RegexParsers in jdk7u6+ #17

gourlaysama commented Apr 28, 2014

gourlaysama commented Apr 28, 2014

dcsobral commented Apr 28, 2014

Ichoran commented Apr 28, 2014

gourlaysama commented Apr 28, 2014

gourlaysama commented Jun 10, 2014

adriaanm commented Jun 25, 2014

adriaanm commented Jun 25, 2014

gourlaysama commented Jun 25, 2014

adriaanm Jun 25, 2014

gourlaysama Jun 25, 2014

adriaanm commented Jun 25, 2014

adriaanm commented Jun 25, 2014

ceilican commented Jun 25, 2014

gourlaysama commented Jun 25, 2014

ceilican commented Jun 25, 2014

gourlaysama commented Jun 26, 2014

adriaanm commented Jun 26, 2014

adriaanm commented Jun 26, 2014

SI-7710 fix memory performance of RegexParsers in jdk7u6+ #17

SI-7710 fix memory performance of RegexParsers in jdk7u6+ #17

Conversation

gourlaysama commented Apr 28, 2014

gourlaysama commented Apr 28, 2014

dcsobral commented Apr 28, 2014

Ichoran commented Apr 28, 2014

gourlaysama commented Apr 28, 2014

gourlaysama commented Jun 10, 2014

adriaanm commented Jun 25, 2014

adriaanm commented Jun 25, 2014

gourlaysama commented Jun 25, 2014

adriaanm Jun 25, 2014

Choose a reason for hiding this comment

gourlaysama Jun 25, 2014

Choose a reason for hiding this comment

adriaanm commented Jun 25, 2014

adriaanm commented Jun 25, 2014

ceilican commented Jun 25, 2014

gourlaysama commented Jun 25, 2014

ceilican commented Jun 25, 2014

gourlaysama commented Jun 26, 2014

adriaanm commented Jun 26, 2014

adriaanm commented Jun 26, 2014