Skip to content

Optimized HashTable.index #5537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 30, 2016
Merged

Optimized HashTable.index #5537

merged 2 commits into from
Nov 30, 2016

Conversation

l0rinc
Copy link
Contributor

@l0rinc l0rinc commented Nov 17, 2016

Added BitManipulationBenchmark to measure alternative solutions (ns/op, smaller is better):

Benchmark                                                 Mode  Cnt    Score    Error  Units
BitManipulationBenchmark.with2DeBruijn                    avgt   20  252.940 ±  1.968  ns/op
BitManipulationBenchmark.withSumBinSearch                 avgt   20  295.604 ±  2.026  ns/op
BitManipulationBenchmark.withIntegerNumberOfLeadingZeros  avgt   20  298.597 ±  1.155  ns/op
BitManipulationBenchmark.withIntegerBitCount*             avgt   20  323.815 ±  1.351  ns/op
BitManipulationBenchmark.withBinSearch                    avgt   20  330.712 ± 12.232  ns/op
BitManipulationBenchmark.withMatch                        avgt   20  360.977 ±  3.469  ns/op
BitManipulationBenchmark.withLoop                         avgt   20  613.530 ±  9.326  ns/op

(*) previous impl

Integer.numberOfLeadingZeros was chosen, as it's optimized to a single LZCNT assembly call.


Benchmarks (ops/s, smaller is better)

Before (9c5d3f8):

Benchmark                                     (size)  Mode  Cnt      Score      Error  Units
s.c.immutable.VectorMapBenchmark.groupBy          10  avgt   20    645.594 ±    9.435  ns/op
s.c.immutable.VectorMapBenchmark.groupBy         100  avgt   20   2084.216 ±   37.814  ns/op
s.c.immutable.VectorMapBenchmark.groupBy        1000  avgt   20  19878.481 ±  262.404  ns/op
s.c.mutable.HashMapBenchmark.get                  10  avgt   20    689.941 ±    5.850  ns/op
s.c.mutable.HashMapBenchmark.get                 100  avgt   20   7357.330 ±   45.956  ns/op
s.c.mutable.HashMapBenchmark.get                1000  avgt   20  95767.200 ± 1550.771  ns/op
s.c.mutable.HashMapBenchmark.getOrElseUpdate      10  avgt   20    509.181 ±    2.683  ns/op
s.c.mutable.HashMapBenchmark.getOrElseUpdate     100  avgt   20   5563.301 ±   32.335  ns/op
s.c.mutable.HashMapBenchmark.getOrElseUpdate    1000  avgt   20  71965.365 ± 1809.738  ns/op
s.c.mutable.HashMapBenchmark.put                  10  avgt   20    247.270 ±    3.972  ns/op
s.c.mutable.HashMapBenchmark.put                 100  avgt   20   5646.185 ±  106.172  ns/op
s.c.mutable.HashMapBenchmark.put                1000  avgt   20  81303.663 ±  954.938  ns/op

Changed modulo to bitwise and in hash calculation (4c729fe):

Benchmark                                     (size)  Mode  Cnt      Score      Error  Units
s.c.immutable.VectorMapBenchmark.groupBy          10  avgt   20    631.291 ±    9.269  ns/op
s.c.immutable.VectorMapBenchmark.groupBy         100  avgt   20   2077.885 ±   59.737  ns/op
s.c.immutable.VectorMapBenchmark.groupBy        1000  avgt   20  15458.278 ±  317.347  ns/op
s.c.mutable.HashMapBenchmark.get                  10  avgt   20    678.013 ±    4.453  ns/op
s.c.mutable.HashMapBenchmark.get                 100  avgt   20   7258.522 ±   76.088  ns/op
s.c.mutable.HashMapBenchmark.get                1000  avgt   20  94748.845 ± 1226.120  ns/op
s.c.mutable.HashMapBenchmark.getOrElseUpdate      10  avgt   20    498.042 ±    5.006  ns/op
s.c.mutable.HashMapBenchmark.getOrElseUpdate     100  avgt   20   5243.154 ±  110.372  ns/op
s.c.mutable.HashMapBenchmark.getOrElseUpdate    1000  avgt   20  68194.752 ±  655.436  ns/op
s.c.mutable.HashMapBenchmark.put                  10  avgt   20    257.275 ±    1.411  ns/op
s.c.mutable.HashMapBenchmark.put                 100  avgt   20   5318.532 ±  152.923  ns/op
s.c.mutable.HashMapBenchmark.put                1000  avgt   20  79607.160 ±  651.779  ns/op

Optimized HashTable.index (6cc1504):

Benchmark                                     (size)  Mode  Cnt      Score      Error  Units
s.c.immutable.VectorMapBenchmark.groupBy          10  avgt   20    616.164 ±    4.712  ns/op
s.c.immutable.VectorMapBenchmark.groupBy         100  avgt   20   2034.447 ±   14.495  ns/op
s.c.immutable.VectorMapBenchmark.groupBy        1000  avgt   20  14712.164 ±  119.983  ns/op
s.c.mutable.HashMapBenchmark.get                  10  avgt   20    679.046 ±    6.872  ns/op
s.c.mutable.HashMapBenchmark.get                 100  avgt   20   7242.097 ±   41.244  ns/op
s.c.mutable.HashMapBenchmark.get                1000  avgt   20  95342.919 ± 1521.328  ns/op
s.c.mutable.HashMapBenchmark.getOrElseUpdate      10  avgt   20    488.034 ±    4.554  ns/op
s.c.mutable.HashMapBenchmark.getOrElseUpdate     100  avgt   20   4883.123 ±   59.268  ns/op
s.c.mutable.HashMapBenchmark.getOrElseUpdate    1000  avgt   20  65174.034 ±  496.759  ns/op
s.c.mutable.HashMapBenchmark.put                  10  avgt   20    267.983 ±    1.797  ns/op
s.c.mutable.HashMapBenchmark.put                 100  avgt   20   5097.351 ±  104.538  ns/op
s.c.mutable.HashMapBenchmark.put                1000  avgt   20  78772.540 ±  543.935  ns/op

Summary, i.e. the effect of this PR, according to the benchmarks:

  • groupBy has a ~35% speedup
  • get didn't change
  • getOrElseUpdate has a ~10% speedup
  • put has a ~3% speedup

Note: caching the exponent to a local private field (Byte or Int) didn't have any performance advantage (only a minor slowdown was measured, possibly because it's accessed via an interface now)


comparing it to 2.11.8 via

@Benchmark public Object scala_persistent_some() { return scalaPersistent.groupBy(Integer::bitCount); }

we get the following (ops/s, greater is better):

Benchmark      (CONTAINER_SIZE)   Mode  Cnt       Score      Error  Units
VectorBenchmar               10  thrpt   30  613353.461 ± 5432.843  ops/s
VectorBenchmar              100  thrpt   30  213606.187 ± 5014.126  ops/s
VectorBenchmar             1000  thrpt   30   32443.292 ±  328.302  ops/s

after this PR:

Benchmark      (CONTAINER_SIZE)   Mode  Cnt       Score      Error  Units
VectorBenchmark              10  thrpt   30  658523.151 ± 6456.227  ops/s
VectorBenchmark             100  thrpt   30  241050.196 ± 1416.711  ops/s
VectorBenchmark            1000  thrpt   30   40219.643 ±  268.688  ops/s

i.e. ~20% faster Vector.groupBy than in 2.11.8

@scala-jenkins scala-jenkins added this to the 2.12.2 milestone Nov 17, 2016
@@ -1,11 +1,11 @@
scalaHome := Some(file("../../build/pack"))
scalaVersion := "2.12.0-dev"
scalaVersion := "2.12.1-dev"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these will disappear once #5528 is merged

def withIntegerBitCount(bh: Blackhole) {
for (v <- powersOfTwo) {
val leadingZeros = withIntegerBitCount(v)
// assert (leadingZeros == withLoop(v), s"$leadingZeros != ${withLoop(v)} ($v)")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asserts are meant for correctness checking, but there is no easy way to disable assertions yet

// https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn
val multiplyDeBruijnBitPosition2 = Array(32, 31, 4, 30, 3, 18, 8, 29, 2, 10, 12, 17, 7, 15, 28, 24, 1, 5, 19, 9, 11, 13, 16, 25, 6, 20, 14, 26, 21, 27, 22, 23)

def with2DeBruijn(v: Int) = multiplyDeBruijnBitPosition2((v * 0x077CB531) >>> 27)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this turned out to be the fastest impl

}
}

def withSumBinSearch(v: Int): Int = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried this with the last two lines replaced with a lookup table -> no change

}
}

def withBinSearch(v: Int) =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very proud of this contraption :D

val shifted = (improved >> (32 - java.lang.Integer.bitCount(ones))) & ones
shifted
val exponent = exponents((table.length * 0x077CB531) >>> 27)
(improve(hcode, seedvalue) >>> exponent) & (table.length - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum ... this code (before and after) seems to assume that table.length is an exact power of 2. I believe this is assumed elsewhere in the implementation.

If that is true, then exponent is in fact equal to java.lang.Integer.numberOfTrailingZeros(table.length). Have you tried that? Or also 31 - java.lang.Integer.numberOfLeadingZeros(table.length)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume that table.length is an exact power of 2

Indeed, d677823#diff-a24c1e91e1a44ff65f3da12201465cebR154 resizes this to be a power of two always (please verify).

numberOfTrailingZeros(table.length). Have you tried that?

Yes, please check out the benchmarks below:
d677823#diff-9ee4c1bdb2df6f45b8c03d48dd2b0b0dR41

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please check out the benchmarks below:

Oops, I had actually had a look at the benchmarks, but somehow my eyes did not see the one I was looking for ^^

@Ichoran
Copy link
Contributor

Ichoran commented Nov 17, 2016

This is fun--and I can't imagine the array lookup is actually faster than a proper LZCNT assembly instruction, so it indicates that the JIT compiler isn't actually emitting ideal assembly. That suggests to me that we shouldn't bake in a table lookup because (1) if the JIT compiler improves the table will be slower than the alternatives, and (2) if the loop is less hot, the table lookup will be more expensive since it won't be hanging out in L1 cache the entire time.

But why not store tableLengthLeadingZeros and update it only when the table size changes (i.e. rarely)? That should be faster yet. Yes, it takes an extra 4 bytes of storage, but HashMap isn't exactly careful with how much it allocates as it is.

@l0rinc
Copy link
Contributor Author

l0rinc commented Nov 17, 2016

Valid concerns, @Ichoran!
I pushed a version with the built-in leading zeros counter, and will verify whether HashMap's object layout permits an additional byte for free, and post the benchmark results here, so that we can take a informed decision :).
Thanks for all your reviews! :)


Edit: The good news: adding an exponent: Byte doesn't modify the object size, displayed by JOL: new HashMap -> 120 bytes.
The bad news: somehow most methods got slightly slower (except for groupBy). Will try to investigate tomorrow :/

@l0rinc l0rinc changed the title Changed HashTable.index to DeBruijn's integer log2 calculation (~30% faster) Optimized HashTable.index Nov 18, 2016
@@ -80,7 +80,7 @@ sealed class HashMap[A, +B] extends AbstractMap[A, B]

protected def elemHashCode(key: A) = key.##

protected final def improve(hcode: Int) = {
protected final def improve(hcode: Int) = { // TODO unify with HashUtils.improve
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ichoran, @retronym, should I unify the two, leave the comment or delete it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change the hashing scheme in point releases for 2.12. Leave it alone until 2.13.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, is it ok if I leave the comment here?

rotated
protected final def improve(hcode: Int, seed: Int): Int = {
val i = scala.util.hashing.byteswap32(hcode)
val rotation = seed & ((1 << 5) - 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x % 32 is the same as x & 31 for positive numbers (which should be the case since we're using it for bit shifts below), just faster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've deleted all the discussion about hash code refinement. Could you preserve enough so we have some idea of the thought process behind it (record of what worked & didn't)? This is an important choice, so it's nice to have some documentation of it.

Copy link
Contributor Author

@l0rinc l0rinc Nov 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other cases the history of a given method was stored in git, the method documentation or in the issue, corresponding to the change.

Here it seems like there are leftover comments that have nothing to do with the current state, which I found very confusing - especially since the code itself is not very complicated, though a documentation link would be welcome.

Could you please help me out in deciding which parts to keep and which to bury in git history instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning about the old bad algorithm seems particularly pertinent. Anything that says, "We already did this but it was bad", and it isn't obvious from inspection that it is bad, should be preserved. Everything else can go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Ichoran, reformatted the old comments as method comment, with separated code parts, hope you like it this way :)

val rotated = (i >>> rotation) | (i << (32 - rotation))
rotated
/** Murmur hash
* {{{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't think we need most of this.

* var k = hcode * 0x5bd1e995
* k ^= k >> 24
* k * 0x5bd1e995
* }}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code doesn't explain whether this hash is good or bad. Can discard it.

* {{{
* i ^= i >> 6
* }}}
* For performance reasons, we avoid this improvement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using a multiplicative hash any more, so this lengthy explanation is obsolete.

* h ^= (h >> 2)
* h += (h << 7)
* h ^ (h >> 12)
* }}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, doesn't say anything about whether the Jenkins Hash is good or bad. Can get rid of it.

* h = h ^ (h >>> 14)
* h = h + (h << 4)
* h ^ (h >>> 10)
* }}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is important: a fast and small algorithm that we actually used but in practice works poorly. "Don't do this!" Leave this in.

* h ^ (h >>> 10)
* }}}
*
* the rest of the computation is due to SI-5293
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked if this is true? In any case, might just say, "Defer to a high-quality hash in scala.util.hashing. The goal is to distribute across bins as well as possible even if a hash code has low entropy at some bits."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't part of my PR to come up with a new hashing, I haven't questioned the validity of the code or the comments.
Would gladly do it, if you think I can be of any service in this area :)

Applied your comments, thank you for your insight!

(`ops/s`, smaller is better)

`Before (9c5d3f8)`:
```scala
[info] # Run complete. Total time: 00:08:15
[info]
[info] Benchmark                                     (size)  Mode  Cnt      Score      Error  Units
[info] s.c.immutable.VectorMapBenchmark.groupBy          10  avgt   20    645.594 ±    9.435  ns/op
[info] s.c.immutable.VectorMapBenchmark.groupBy         100  avgt   20   2084.216 ±   37.814  ns/op
[info] s.c.immutable.VectorMapBenchmark.groupBy        1000  avgt   20  19878.481 ±  262.404  ns/op
[info] s.c.mutable.HashMapBenchmark.get                  10  avgt   20    689.941 ±    5.850  ns/op
[info] s.c.mutable.HashMapBenchmark.get                 100  avgt   20   7357.330 ±   45.956  ns/op
[info] s.c.mutable.HashMapBenchmark.get                1000  avgt   20  95767.200 ± 1550.771  ns/op
[info] s.c.mutable.HashMapBenchmark.getOrElseUpdate      10  avgt   20    509.181 ±    2.683  ns/op
[info] s.c.mutable.HashMapBenchmark.getOrElseUpdate     100  avgt   20   5563.301 ±   32.335  ns/op
[info] s.c.mutable.HashMapBenchmark.getOrElseUpdate    1000  avgt   20  71965.365 ± 1809.738  ns/op
[info] s.c.mutable.HashMapBenchmark.put                  10  avgt   20    247.270 ±    3.972  ns/op
[info] s.c.mutable.HashMapBenchmark.put                 100  avgt   20   5646.185 ±  106.172  ns/op
[info] s.c.mutable.HashMapBenchmark.put                1000  avgt   20  81303.663 ±  954.938  ns/op
```

`Changed modulo to bitwise and in hash calculation (4c729fe)`:
```scala
[info] Benchmark                                     (size)  Mode  Cnt      Score      Error  Units
[info] s.c.immutable.VectorMapBenchmark.groupBy          10  avgt   20    631.291 ±    9.269  ns/op
[info] s.c.immutable.VectorMapBenchmark.groupBy         100  avgt   20   2077.885 ±   59.737  ns/op
[info] s.c.immutable.VectorMapBenchmark.groupBy        1000  avgt   20  15458.278 ±  317.347  ns/op
[info] s.c.mutable.HashMapBenchmark.get                  10  avgt   20    678.013 ±    4.453  ns/op
[info] s.c.mutable.HashMapBenchmark.get                 100  avgt   20   7258.522 ±   76.088  ns/op
[info] s.c.mutable.HashMapBenchmark.get                1000  avgt   20  94748.845 ± 1226.120  ns/op
[info] s.c.mutable.HashMapBenchmark.getOrElseUpdate      10  avgt   20    498.042 ±    5.006  ns/op
[info] s.c.mutable.HashMapBenchmark.getOrElseUpdate     100  avgt   20   5243.154 ±  110.372  ns/op
[info] s.c.mutable.HashMapBenchmark.getOrElseUpdate    1000  avgt   20  68194.752 ±  655.436  ns/op
[info] s.c.mutable.HashMapBenchmark.put                  10  avgt   20    257.275 ±    1.411  ns/op
[info] s.c.mutable.HashMapBenchmark.put                 100  avgt   20   5318.532 ±  152.923  ns/op
[info] s.c.mutable.HashMapBenchmark.put                1000  avgt   20  79607.160 ±  651.779  ns/op

```

`Optimized HashTable.index (6cc1504)`:
```scala
[info] Benchmark                                     (size)  Mode  Cnt      Score      Error  Units
[info] s.c.immutable.VectorMapBenchmark.groupBy          10  avgt   20    616.164 ±    4.712  ns/op
[info] s.c.immutable.VectorMapBenchmark.groupBy         100  avgt   20   2034.447 ±   14.495  ns/op
[info] s.c.immutable.VectorMapBenchmark.groupBy        1000  avgt   20  14712.164 ±  119.983  ns/op
[info] s.c.mutable.HashMapBenchmark.get                  10  avgt   20    679.046 ±    6.872  ns/op
[info] s.c.mutable.HashMapBenchmark.get                 100  avgt   20   7242.097 ±   41.244  ns/op
[info] s.c.mutable.HashMapBenchmark.get                1000  avgt   20  95342.919 ± 1521.328  ns/op
[info] s.c.mutable.HashMapBenchmark.getOrElseUpdate      10  avgt   20    488.034 ±    4.554  ns/op
[info] s.c.mutable.HashMapBenchmark.getOrElseUpdate     100  avgt   20   4883.123 ±   59.268  ns/op
[info] s.c.mutable.HashMapBenchmark.getOrElseUpdate    1000  avgt   20  65174.034 ±  496.759  ns/op
[info] s.c.mutable.HashMapBenchmark.put                  10  avgt   20    267.983 ±    1.797  ns/op
[info] s.c.mutable.HashMapBenchmark.put                 100  avgt   20   5097.351 ±  104.538  ns/op
[info] s.c.mutable.HashMapBenchmark.put                1000  avgt   20  78772.540 ±  543.935  ns/op
```

Summary, i.e. the effect of this PR, according to the benchmarks:
* `groupBy` has a `~35%` speedup
* `get` didn't change
* `getOrElseUpdate` has a `~10%` speedup
* `put` has a `~3%` speedup

Note: caching the `exponent` to a local private field (`Byte` or `Int`) didn't have any performance advantage (only a minor slowdown was measured, possibly because it's accessed via an interface now)
val improved = improve(hcode, seedvalue)
val shifted = (improved >> (32 - java.lang.Integer.bitCount(ones))) & ones
shifted
val exponent = Integer.numberOfLeadingZeros(ones)
Copy link
Contributor Author

@l0rinc l0rinc Nov 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ichoran, numberOfLeadingZeros call is optimized to LZCNT assembly call (see http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8045398):

0x00007f65791cdbc5: mov    0x10(%r13,%rbp,4),%r10d
0x00007f65791cdbca: lzcnt  %r10d,%r10d        ;*invokestatic numberOfLeadingZeros
                                              ; - javaslang.collection.VectorBenchmark$Test::scala_withIntegerNumberOfLeadingZeros@25 (line 33)

It remains an open question why DeBruijn is faster in the benchmarks, though.
BTW, extracting it as a variable didn't improve the speed at all

Copy link
Contributor

@Ichoran Ichoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@retronym retronym modified the milestones: 2.12.1, 2.12.2 Nov 30, 2016
@retronym retronym merged commit 711e261 into scala:2.12.x Nov 30, 2016
@l0rinc l0rinc deleted the hashTableIndex branch November 30, 2016 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants