-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Improved runtime speed for Vector, restoring previous performance. #5516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -466,24 +466,24 @@ override def companion: GenericCompanion[Vector] = Vector | |||
display5 = copyRange(display5, oldLeft, newLeft) | |||
} | |||
|
|||
private def zeroLeft(array: Array[AnyRef], index: Int): Unit = { | |||
private[this] def zeroLeft(array: Array[AnyRef], index: Int): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
is also nice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either one does the trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I reran the original tests and I'm not 100% sure it makes any difference at all. If it's important I should rerun them later, maybe with JMH. I've been trying to iterate quickly, so have been using Thyme which is more subject to fluctuations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"it" meaning having the method be final, private[this], or just private like it used to be. In any case the effects are small in my hands (~5%).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bytecode for a private[this]
method is exactly the same as for private
, so this should not make any difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experiment only avoiding the pre-null-ing of arrays had any performance impact (i.e. the array allocation and copying should be next to each other, otherwise the array is null-ed first).
Before:
VectorBenchmark.Slice.scala_persistent thrpt 4193.769 ops/s
After:
VectorBenchmark.Slice.scala_persistent thrpt 10306.275 ops/s
i.e. 2.5x faster, congrats @Ichoran
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lrytz - Yeah, I see that. It does (or has--not sure about now) occasionally make a difference with fields and it was faster for me to check the benchmark than the bytecode. But it seems there was just a benchmarking fluctuation that made everything I did after the one bad benchmark seem like it was helping (a tiny bit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -466,24 +466,24 @@ override def companion: GenericCompanion[Vector] = Vector | |||
display5 = copyRange(display5, oldLeft, newLeft) | |||
} | |||
|
|||
private def zeroLeft(array: Array[AnyRef], index: Int): Unit = { | |||
private[this] def zeroLeft(array: Array[AnyRef], index: Int): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experiment only avoiding the pre-null-ing of arrays had any performance impact (i.e. the array allocation and copying should be next to each other, otherwise the array is null-ed first).
Before:
VectorBenchmark.Slice.scala_persistent thrpt 4193.769 ops/s
After:
VectorBenchmark.Slice.scala_persistent thrpt 10306.275 ops/s
i.e. 2.5x faster, congrats @Ichoran
@paplorinc - Which change specifically was the improvement? I tested the java.util.Arrays.copyOf one extensively, but the others I just threw in at the last moment as a benchmark indicated there was a small residual slowdown. It seemed to work so I left it, but it sounds like it's unnecessary and should be removed? |
HotSpot optimizes The extra null checks and/or other small differences in bytecode shapes between 2.11.8 and 2.12.0 probably hampered this optimization. Would be interesting to see if the 2.11.8 benchmark slows down when this JIT optimization is disabled ( |
Platform.arraycopy(a, 0, b, 0, a.length) | ||
b | ||
} | ||
private[immutable] final def copyOf(a: Array[AnyRef]) = java.util.Arrays.copyOf(a, a.length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will go through:
public static <T,U> T[] copyOf(U[] original, int newLength, Class<? extends T[]> newType) {
@SuppressWarnings("unchecked")
T[] copy = ((Object)newType == (Object)Object[].class)
? (T[]) new Object[newLength]
: (T[]) Array.newInstance(newType.getComponentType(), newLength);
System.arraycopy(original, 0, copy, 0,
Math.min(original.length, newLength));
return copy;
}
It might be better to just call new Array[AnyRef]; System.arrayCopy
here to avoid passing through the generic code in copyOf
, unless we're sure that HotSpot can elide the array zeroing for it as well (in cases where it has been used by other callers so both branches of the conditional are live.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@retronym - On my machine, the copyOf version is actually better-optimized than the new Array/arraycopy. I don't see anything left in the assembly of the object type test, and it looks a lot like the 2.11.8 version.
@retronym - Keeping bulk zeroing does make a big difference on 2.11--about 50% on my machine, which is a majority of the slowdown observed. There are measurable residual effects, but those are a few percent at most. 2.12 with the copyOf = java.util.Arrays.copyOf fix, plus the inlined System.arraycopy, works the same way, while 2.12.0 is barely affected by -XX:-ReduceBulkZeroing. |
Here are some benchmark numbers:
Error is about +- 0.5%. copyOf is just changing copyOf to forward to java.util.Arrays.copyOf, while -Platform is replacing all Platform.arraycopy calls with java.lang.System.arraycopy. I don't think there's a meaningful difference between the two (I don't think I was systematic enough before), so I think @retronym is right that we should just stick to new Array/System.arraycopy. |
All calls to Platform.arraycopy were rewritten as java.lang.System.arraycopy to reduce the work that the JIT compiler has to do to produce optimized bytecode that avoids zeroing just-allocated arrays that are about to be copied over. (Tested with -XX:-ReduceBulkZeroing as suggested by retronym.)
3f70b46
to
e5fd42d
Compare
Here are some more instances of the same pattern that could be changed. scala/src/library/scala/collection/mutable/ArrayBuffer.scala Lines 69 to 70 in 23203f7
scala/src/reflect/scala/reflect/internal/BaseTypeSeqs.scala Lines 102 to 103 in a91c81f
scala/src/reflect/scala/reflect/internal/Names.scala Lines 70 to 71 in 92affd3
I'm also wondering whether we should make the inliner more aggressive (eliminated the null check and maybe even the module load) for objects inlined from some whitelist, e.g |
@retronym - I was just about to mention the library ones there, plus one in mutable.PriorityQueue also that is not exactly the same pattern but could become it if optimized/inlined enough. Not sure if I should fix them all (just by doing Platform -> System) as part of this patch? I guess there's little harm in it. |
+1 for inlining all callers of |
Thanks for the running those benchmarks and presenting the results so clearly! |
…arraycopy to avoid the same kind of slowdowns that Vector was experiencing due to the less aggressive inlining by scalac.
@retronym - Got them all, I think. |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree that these changes solve the issue, I think this is a lot of workaround for a known root cause (i.e. symptomatic treatment for the Scala inliner working differently).
If we solve this locally, the clients of @inline will still be affected.
Since there is a very specific test for the inlining of Platform.arrayCopy
, I suggest we look there for why it's passing:
def arraycopy(): Unit = { |
@@ -102,7 +101,7 @@ object Array extends FallbackArrayBuilding { | |||
def copy(src: AnyRef, srcPos: Int, dest: AnyRef, destPos: Int, length: Int) { | |||
val srcClass = src.getClass | |||
if (srcClass.isArray && dest.getClass.isAssignableFrom(srcClass)) | |||
arraycopy(src, srcPos, dest, destPos, length) | |||
java.lang.System.arraycopy(src, srcPos, dest, destPos, length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this probably isn't affected anyway, as the array that gets here is already null
-ed out.
But at least we're consistent :) (in which case Platform.arraycopy
should probably be removed).
Also note, that in other cases (e.g.
System.arraycopy(b, offset, dest, 0, len) |
System
reference is not fully qualified.
@@ -955,7 +954,7 @@ private[immutable] trait VectorPointer[T] { | |||
|
|||
private[immutable] final def copyOf(a: Array[AnyRef]) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I think that the signature of copyLeft
, copyRight
, copyRange
and copyOf
should be the same (except for the name, of course), so maybe consider removing the [immutable] final
modifiers here (or add them to the rest also)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going for a bug fix with no visible changes (and binary compatibility), so while I agree that it's inconsistent, I don't think this is the time to change it.
@@ -478,12 +477,12 @@ override def companion: GenericCompanion[Vector] = Vector | |||
// if (array eq null) | |||
// println("OUCH!!! " + right + "/" + depth + "/"+startIndex + "/" + endIndex + "/" + focus) | |||
val a2 = new Array[AnyRef](array.length) | |||
Platform.arraycopy(array, 0, a2, 0, right) | |||
java.lang.System.arraycopy(array, 0, a2, 0, right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
a2 | ||
} | ||
private def copyRight(array: Array[AnyRef], left: Int): Array[AnyRef] = { | ||
val a2 = new Array[AnyRef](array.length) | ||
Platform.arraycopy(array, left, a2, left, a2.length - left) | ||
java.lang.System.arraycopy(array, left, a2, left, a2.length - left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -955,7 +954,7 @@ private[immutable] trait VectorPointer[T] { | |||
|
|||
private[immutable] final def copyOf(a: Array[AnyRef]) = { | |||
val b = new Array[AnyRef](a.length) | |||
Platform.arraycopy(a, 0, b, 0, a.length) | |||
java.lang.System.arraycopy(a, 0, b, 0, a.length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -331,8 +331,8 @@ sealed class PriorityQueue[A](implicit val ord: Ordering[A]) | |||
val pq = new PriorityQueue[A] | |||
val n = resarr.p_size0 | |||
pq.resarr.p_ensureSize(n) | |||
java.lang.System.arraycopy(resarr.p_array, 1, pq.resarr.p_array, 1, n-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will avoid nulling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not now, but it's more amenable to future optimization heuristics if there are no intervening instructions.
@@ -101,7 +101,7 @@ trait ResizableArray[A] extends IndexedSeq[A] | |||
if (newSize > Int.MaxValue) newSize = Int.MaxValue | |||
|
|||
val newArray: Array[AnyRef] = new Array(newSize.toInt) | |||
scala.compat.Platform.arraycopy(array, 0, newArray, 0, size0) | |||
java.lang.System.arraycopy(array, 0, newArray, 0, size0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to change all Platform.arrayCopy
references, consider changing line 120 also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I thought I did.
@@ -100,7 +100,7 @@ trait BaseTypeSeqs { | |||
|
|||
def copy(head: Type, offset: Int): BaseTypeSeq = { | |||
val arr = new Array[Type](elems.length + offset) | |||
scala.compat.Platform.arraycopy(elems, 0, arr, offset, elems.length) | |||
java.lang.System.arraycopy(elems, 0, arr, offset, elems.length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -68,7 +68,7 @@ trait Names extends api.Names { | |||
while (i < len) { | |||
if (nc + i == chrs.length) { | |||
val newchrs = new Array[Char](chrs.length * 2) | |||
scala.compat.Platform.arraycopy(chrs, 0, newchrs, 0, chrs.length) | |||
java.lang.System.arraycopy(chrs, 0, newchrs, 0, chrs.length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -220,7 +220,7 @@ trait Names extends api.Names { | |||
|
|||
/** Copy bytes of this name to buffer cs, starting at position `offset`. */ | |||
final def copyChars(cs: Array[Char], offset: Int) = | |||
scala.compat.Platform.arraycopy(chrs, index, cs, offset, len) | |||
java.lang.System.arraycopy(chrs, index, cs, offset, len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, no allocation here.
Unless you decide to remove Platform.arraycopy
completely, I don't think the changes where there's no allocation, just a copy are worth the trouble :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't remove Platform.arraycopy
in 2.12.1 for compatibility constraints. But I support the move to uniformly avoid using it, which is a bit simpler than doing so selectively for particular code patterns.
@Ichoran, please update the title and description of the PR to reflect the new changes also :) |
The call is inlined, but there is an additional check that |
Perhaps a dumb question, but with 2.12 being Java8+ wouldn't invokedynamic Cheers, On Nov 10, 2016 12:12, "Jason Zaugg" [email protected] wrote:
|
I think there are two issues here that could probably be addressed separately:
|
Deprecation of this method (and maybe That said, our first priority is to fix the performance regression in 2.12.1, so I'm happy with this PR in it current state to meet that goal. |
Null checks that are always false are basically free in C2 compiled code. This case is really quite special, the null check itself isn't expensive, but it changes the bytecode shape for the JIT optimization of array instantation/copy. |
Yes, that should indeed be the case. But we have to make sure the inliner preserves semantics, so a null check on the receiver is necessary, unless the compiler can prove that the receiver cannot be null. In the case of a module like A simple analysis on module classfiles to know that their initialization doesn't have side-effects (only methods, no constructor statements, superclass Related, there's also scala/scala-dev#112. |
Why must an inliner preserve incidental semantics of side-effects not directly referenced in the method itself? I would be totally fine with this giving 35 and not printing anything: class Foo { println("I just made a foo!"); @inline times(a: Int, b: Int) = a*b }
val f: Foo = null
f.times(5, 7) |
I agree with @Ichoran, people would expect a lot simpler behavior for such a simple concept. |
I think it should, this seems to be also a core principle of the JVM. Making code execute in a way that is not according to the language spec can lead to bugs / behaviors that are surprising / hard to understand. Of course I'm all for optimizing more, but I think we should keep to the spec. I think there's a lot of room to do more even with this restriction. It's true that we're not very consistent at the moment either: scala/scala-dev#112 is an example. But for example we also defer class loading (when inlining a static method, or eliminating unreachable code), which can cause observable differences (defer side-effects, eliminate deadlocks). I think we never really discussed and defined precisely where to draw the line. |
The problem with preserving incidental semantics is that they're incidental. Very often you don't care whether a module is initialized or not. You just want to execute some code, and there is no way to run it without dragging in module initialization. If there is no way to avoid incidentals that you don't want, you're forced to do silly time-wasting stuff like inline code by hand, a purely mechanical process that you're going through because your language won't do it for you. I agree that it's tricky to know what is incidental and what is intended, and I also agree that it can be very surprising when behavior changes based upon inlining or not (because of incidental behavior). But it's also important, as this slowdown has illustrated. |
I'm happy that the followup ideas to improve the optimizer are covered by those tickets @lrytz linked to, so I'm going to merge this one. Thanks for your detailed review, @paplorinc! |
@retronym - I hadn't had a chance to catch a Platform that somehow I missed in ResizableArray, but I guess that's not actually worth waiting for. Not sure that copy method is ever used, and if it is, it's not necessarily going to be next to an allocation. (I should have just used sed rather than an interactive editor....) |
I've checked the rest of the affected methods (i.e. not just Using the Javaslang collection benchmarks with Benchmark (CONTAINER_SIZE) Mode Score Error Units
VectorBenchmark.GroupBy.scala_persistent 100 thrpt 215444.420 ±9500.270 ops/s and Benchmark (CONTAINER_SIZE) Mode Score Error Units
VectorBenchmark.GroupBy.scala_persistent 100 thrpt 173771.549 ±2979.146 ops/s Running the rest of the benchmarks, I noticed that Benchmark (CONTAINER_SIZE) Mode Score Error Units
ListBenchmark.Update.scala_mutable 1000 thrpt 979.478 ±9.109 ops/s and Benchmark (CONTAINER_SIZE) Mode Score Error Units
ListBenchmark.Update.scala_mutable 1000 thrpt 520.756 ±7.598 ops/s I will look into these in the weekend :) |
@paplorinc @lrytz - There are definitely other problems out there. GroupBy is slow in general. I haven't yet tracked down whether it's just builders or whether it's mutable map getOrElseUpdate also, but there's definitely a builder component because the slowdown is a function of the collection type even when there aren't very many distinct keys (so it's mostly key lookup and adding to builders). I'll write a ticket when I have time, and try to fix it this weekend. If it turns out to be more optimizer stuff, I worry that we're going to be putting out small fires all over the place until we have a more aggressive inlining strategy. |
@paplorinc - I have the groupBy reported at https://issues.scala-lang.org/browse/SI-10049 Please comment there if you have details to add |
Thanks, will sign up and comment. Until now I've found that Adding a |
@paplorinc - Let's continue on the ticket. I think there must be some details that differ, as I don't see a slowdown in |
copyOf was redirected to java.util.Arrays.copyOf to reduce the work that the JIT compiler has to do to make this work well.
zeroLeft, zeroRight, copyLeft, and copyRight were made private[this] to improve performance (presumably by encouraging the JIT compiler to try harder despite using the new trait encoding scheme).
Checks of performance were conducted with Thyme to aid iteration speed.