-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Optimized HashTable.index #5537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1,11 +1,11 @@ | |||
scalaHome := Some(file("../../build/pack")) | |||
scalaVersion := "2.12.0-dev" | |||
scalaVersion := "2.12.1-dev" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these will disappear once #5528 is merged
def withIntegerBitCount(bh: Blackhole) { | ||
for (v <- powersOfTwo) { | ||
val leadingZeros = withIntegerBitCount(v) | ||
// assert (leadingZeros == withLoop(v), s"$leadingZeros != ${withLoop(v)} ($v)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asserts are meant for correctness checking, but there is no easy way to disable assertions yet
// https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn | ||
val multiplyDeBruijnBitPosition2 = Array(32, 31, 4, 30, 3, 18, 8, 29, 2, 10, 12, 17, 7, 15, 28, 24, 1, 5, 19, 9, 11, 13, 16, 25, 6, 20, 14, 26, 21, 27, 22, 23) | ||
|
||
def with2DeBruijn(v: Int) = multiplyDeBruijnBitPosition2((v * 0x077CB531) >>> 27) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this turned out to be the fastest impl
} | ||
} | ||
|
||
def withSumBinSearch(v: Int): Int = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tried this with the last two lines replaced with a lookup table -> no change
} | ||
} | ||
|
||
def withBinSearch(v: Int) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very proud of this contraption :D
cdfa0b9
to
d677823
Compare
val shifted = (improved >> (32 - java.lang.Integer.bitCount(ones))) & ones | ||
shifted | ||
val exponent = exponents((table.length * 0x077CB531) >>> 27) | ||
(improve(hcode, seedvalue) >>> exponent) & (table.length - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum ... this code (before and after) seems to assume that table.length
is an exact power of 2. I believe this is assumed elsewhere in the implementation.
If that is true, then exponent
is in fact equal to java.lang.Integer.numberOfTrailingZeros(table.length)
. Have you tried that? Or also 31 - java.lang.Integer.numberOfLeadingZeros(table.length)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assume that table.length is an exact power of 2
Indeed, d677823#diff-a24c1e91e1a44ff65f3da12201465cebR154 resizes this to be a power of two always (please verify).
numberOfTrailingZeros(table.length). Have you tried that?
Yes, please check out the benchmarks below:
d677823#diff-9ee4c1bdb2df6f45b8c03d48dd2b0b0dR41
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please check out the benchmarks below:
Oops, I had actually had a look at the benchmarks, but somehow my eyes did not see the one I was looking for ^^
This is fun--and I can't imagine the array lookup is actually faster than a proper LZCNT assembly instruction, so it indicates that the JIT compiler isn't actually emitting ideal assembly. That suggests to me that we shouldn't bake in a table lookup because (1) if the JIT compiler improves the table will be slower than the alternatives, and (2) if the loop is less hot, the table lookup will be more expensive since it won't be hanging out in L1 cache the entire time. But why not store tableLengthLeadingZeros and update it only when the table size changes (i.e. rarely)? That should be faster yet. Yes, it takes an extra 4 bytes of storage, but HashMap isn't exactly careful with how much it allocates as it is. |
d677823
to
70dbd3f
Compare
Valid concerns, @Ichoran! Edit: The good news: adding an |
70dbd3f
to
fc5cf70
Compare
@@ -80,7 +80,7 @@ sealed class HashMap[A, +B] extends AbstractMap[A, B] | |||
|
|||
protected def elemHashCode(key: A) = key.## | |||
|
|||
protected final def improve(hcode: Int) = { | |||
protected final def improve(hcode: Int) = { // TODO unify with HashUtils.improve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should change the hashing scheme in point releases for 2.12. Leave it alone until 2.13.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, is it ok if I leave the comment here?
rotated | ||
protected final def improve(hcode: Int, seed: Int): Int = { | ||
val i = scala.util.hashing.byteswap32(hcode) | ||
val rotation = seed & ((1 << 5) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x % 32
is the same as x & 31
for positive numbers (which should be the case since we're using it for bit shifts below), just faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've deleted all the discussion about hash code refinement. Could you preserve enough so we have some idea of the thought process behind it (record of what worked & didn't)? This is an important choice, so it's nice to have some documentation of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other cases the history of a given method was stored in git, the method documentation or in the issue, corresponding to the change.
Here it seems like there are leftover comments that have nothing to do with the current state, which I found very confusing - especially since the code itself is not very complicated, though a documentation link would be welcome.
Could you please help me out in deciding which parts to keep and which to bury in git history instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning about the old bad algorithm seems particularly pertinent. Anything that says, "We already did this but it was bad", and it isn't obvious from inspection that it is bad, should be preserved. Everything else can go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Ichoran, reformatted the old comments as method comment, with separated code parts, hope you like it this way :)
fc5cf70
to
b2ecdc3
Compare
val rotated = (i >>> rotation) | (i << (32 - rotation)) | ||
rotated | ||
/** Murmur hash | ||
* {{{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't think we need most of this.
* var k = hcode * 0x5bd1e995 | ||
* k ^= k >> 24 | ||
* k * 0x5bd1e995 | ||
* }}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code doesn't explain whether this hash is good or bad. Can discard it.
* {{{ | ||
* i ^= i >> 6 | ||
* }}} | ||
* For performance reasons, we avoid this improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not using a multiplicative hash any more, so this lengthy explanation is obsolete.
* h ^= (h >> 2) | ||
* h += (h << 7) | ||
* h ^ (h >> 12) | ||
* }}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, doesn't say anything about whether the Jenkins Hash is good or bad. Can get rid of it.
* h = h ^ (h >>> 14) | ||
* h = h + (h << 4) | ||
* h ^ (h >>> 10) | ||
* }}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is important: a fast and small algorithm that we actually used but in practice works poorly. "Don't do this!" Leave this in.
* h ^ (h >>> 10) | ||
* }}} | ||
* | ||
* the rest of the computation is due to SI-5293 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked if this is true? In any case, might just say, "Defer to a high-quality hash in scala.util.hashing
. The goal is to distribute across bins as well as possible even if a hash code has low entropy at some bits."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't part of my PR to come up with a new hashing, I haven't questioned the validity of the code or the comments.
Would gladly do it, if you think I can be of any service in this area :)
Applied your comments, thank you for your insight!
b2ecdc3
to
68ad3a7
Compare
68ad3a7
to
86f1898
Compare
(`ops/s`, smaller is better) `Before (9c5d3f8)`: ```scala [info] # Run complete. Total time: 00:08:15 [info] [info] Benchmark (size) Mode Cnt Score Error Units [info] s.c.immutable.VectorMapBenchmark.groupBy 10 avgt 20 645.594 ± 9.435 ns/op [info] s.c.immutable.VectorMapBenchmark.groupBy 100 avgt 20 2084.216 ± 37.814 ns/op [info] s.c.immutable.VectorMapBenchmark.groupBy 1000 avgt 20 19878.481 ± 262.404 ns/op [info] s.c.mutable.HashMapBenchmark.get 10 avgt 20 689.941 ± 5.850 ns/op [info] s.c.mutable.HashMapBenchmark.get 100 avgt 20 7357.330 ± 45.956 ns/op [info] s.c.mutable.HashMapBenchmark.get 1000 avgt 20 95767.200 ± 1550.771 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 10 avgt 20 509.181 ± 2.683 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 100 avgt 20 5563.301 ± 32.335 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 1000 avgt 20 71965.365 ± 1809.738 ns/op [info] s.c.mutable.HashMapBenchmark.put 10 avgt 20 247.270 ± 3.972 ns/op [info] s.c.mutable.HashMapBenchmark.put 100 avgt 20 5646.185 ± 106.172 ns/op [info] s.c.mutable.HashMapBenchmark.put 1000 avgt 20 81303.663 ± 954.938 ns/op ``` `Changed modulo to bitwise and in hash calculation (4c729fe)`: ```scala [info] Benchmark (size) Mode Cnt Score Error Units [info] s.c.immutable.VectorMapBenchmark.groupBy 10 avgt 20 631.291 ± 9.269 ns/op [info] s.c.immutable.VectorMapBenchmark.groupBy 100 avgt 20 2077.885 ± 59.737 ns/op [info] s.c.immutable.VectorMapBenchmark.groupBy 1000 avgt 20 15458.278 ± 317.347 ns/op [info] s.c.mutable.HashMapBenchmark.get 10 avgt 20 678.013 ± 4.453 ns/op [info] s.c.mutable.HashMapBenchmark.get 100 avgt 20 7258.522 ± 76.088 ns/op [info] s.c.mutable.HashMapBenchmark.get 1000 avgt 20 94748.845 ± 1226.120 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 10 avgt 20 498.042 ± 5.006 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 100 avgt 20 5243.154 ± 110.372 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 1000 avgt 20 68194.752 ± 655.436 ns/op [info] s.c.mutable.HashMapBenchmark.put 10 avgt 20 257.275 ± 1.411 ns/op [info] s.c.mutable.HashMapBenchmark.put 100 avgt 20 5318.532 ± 152.923 ns/op [info] s.c.mutable.HashMapBenchmark.put 1000 avgt 20 79607.160 ± 651.779 ns/op ``` `Optimized HashTable.index (6cc1504)`: ```scala [info] Benchmark (size) Mode Cnt Score Error Units [info] s.c.immutable.VectorMapBenchmark.groupBy 10 avgt 20 616.164 ± 4.712 ns/op [info] s.c.immutable.VectorMapBenchmark.groupBy 100 avgt 20 2034.447 ± 14.495 ns/op [info] s.c.immutable.VectorMapBenchmark.groupBy 1000 avgt 20 14712.164 ± 119.983 ns/op [info] s.c.mutable.HashMapBenchmark.get 10 avgt 20 679.046 ± 6.872 ns/op [info] s.c.mutable.HashMapBenchmark.get 100 avgt 20 7242.097 ± 41.244 ns/op [info] s.c.mutable.HashMapBenchmark.get 1000 avgt 20 95342.919 ± 1521.328 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 10 avgt 20 488.034 ± 4.554 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 100 avgt 20 4883.123 ± 59.268 ns/op [info] s.c.mutable.HashMapBenchmark.getOrElseUpdate 1000 avgt 20 65174.034 ± 496.759 ns/op [info] s.c.mutable.HashMapBenchmark.put 10 avgt 20 267.983 ± 1.797 ns/op [info] s.c.mutable.HashMapBenchmark.put 100 avgt 20 5097.351 ± 104.538 ns/op [info] s.c.mutable.HashMapBenchmark.put 1000 avgt 20 78772.540 ± 543.935 ns/op ``` Summary, i.e. the effect of this PR, according to the benchmarks: * `groupBy` has a `~35%` speedup * `get` didn't change * `getOrElseUpdate` has a `~10%` speedup * `put` has a `~3%` speedup Note: caching the `exponent` to a local private field (`Byte` or `Int`) didn't have any performance advantage (only a minor slowdown was measured, possibly because it's accessed via an interface now)
86f1898
to
a501444
Compare
val improved = improve(hcode, seedvalue) | ||
val shifted = (improved >> (32 - java.lang.Integer.bitCount(ones))) & ones | ||
shifted | ||
val exponent = Integer.numberOfLeadingZeros(ones) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ichoran, numberOfLeadingZeros
call is optimized to LZCNT
assembly call (see http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8045398):
0x00007f65791cdbc5: mov 0x10(%r13,%rbp,4),%r10d
0x00007f65791cdbca: lzcnt %r10d,%r10d ;*invokestatic numberOfLeadingZeros
; - javaslang.collection.VectorBenchmark$Test::scala_withIntegerNumberOfLeadingZeros@25 (line 33)
It remains an open question why DeBruijn
is faster in the benchmarks, though.
BTW, extracting it as a variable didn't improve the speed at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Added
BitManipulationBenchmark
to measure alternative solutions (ns/op
, smaller is better):(*) previous impl
Integer.numberOfLeadingZeros
was chosen, as it's optimized to a singleLZCNT
assembly call.Benchmarks (
ops/s
, smaller is better)Before (9c5d3f8)
:Changed modulo to bitwise and in hash calculation (4c729fe)
:Optimized HashTable.index (6cc1504)
:Summary, i.e. the effect of this PR, according to the benchmarks:
groupBy
has a~35%
speedupget
didn't changegetOrElseUpdate
has a~10%
speedupput
has a~3%
speedupNote: caching the
exponent
to a local private field (Byte
orInt
) didn't have any performance advantage (only a minor slowdown was measured, possibly because it's accessed via an interface now)comparing it to
2.11.8
viawe get the following (
ops/s
, greater is better):after this PR
:i.e.
~20%
fasterVector.groupBy
than in2.11.8