-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[WIP] Hash table unification #18790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Hash table unification #18790
Conversation
17afb09
to
abab947
Compare
stdlib/public/core/Set.swift
Outdated
|
||
extension _SwiftEmptySetStorage: _NSSetCore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI we did find it worthwhile in Foundation to have 1 element variants as well in addition to this strategy of a zero element container.
Couldn't this be a singleton?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a singleton -- the sole instance is created by the Swift runtime in _swiftEmptySetStorage
(lowercased). (It's done by the runtime because we can't currently define globals in the stdlib.) It'd be a good idea to name the class _SwiftEmptySetSingleton
; I wonder why I haven't thought of that before!
Our storage classes themselves are really simple -- they are really just there for reference counting their tail-allocated buffers. (And to implement the NSSet/NSDictionary/NSArray interfaces, which is merely a bridging optimization, eliminating the need for allocating a wrapper object around them.)
Most of the native logic is in the wrapper structs/enums that make up most of this file. We don't use subtype polymorphism on the native Swift path, which makes things super efficient as long as we only have an extremely limited number of inlinable fast paths. (1 in most cases, except String which comes in a dozen or so flavors.) :)
Every additional case has to be individually checked for at runtime, so it's only worth adding a new case if it brings a huge potential improvement. Since storage is tail-allocated, there wouldn't be much difference between a single-element _SwiftNativeSetStorage
and a dedicated _SwiftSingleElementSetStorage
class. (Memory overhead might be smaller if the latter removes the hash table bookkeeping info, but that means Set operations wouldn't be inlinable at all, or they all need to start with a dynamic type check -- both of which would probably slow things down too much.)
Instead of defining a single-element storage class, it might be interesting to create a small set representation, where an element can be stored directly in place of the storage reference. (As long as it fits in a pointer-sized word minus some bits, of course.) However, this wouldn't work for Dictionary, so it's probably not worth trying it.
Another similar idea I'm experimenting with is to disable hashing altogether for small sets and dictionaries (say, less than 4-16 elements) -- basically reverting them to unsorted arrays. This could have a large impact, as calculating a single hash value can be many times slower than comparing unequal elements. However, adding that tiny extra condition to every operation changes inlining behavior such that I see significant regressions in Dictionary performance. This PR significantly shrinks Set's inlinable code size; we'll see if this will make this a more viable idea!
stdlib/public/core/Set.swift
Outdated
fatalError(""" | ||
Duplicate elements of type '\(elementType)' were found in a Set. | ||
This usually means that either the type violates Hashable's requirements, or | ||
that members of such a set were mutated after insertion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the case we have seen often in Foundation is where an element of a NSMutableSet has a non-stable hash value. is this function breakpointable to help developers figure out why they are getting this failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the function is declared internal, the @usableFromInline
attribute means there will still be a symbol generated for it -- so it'll be breakpointable. (It essentially makes the symbol public on the ABI level, so lldb will see it.)
I'll have to play with this a little to make sure the idea is sound. Multiline error messages may cause issues. Also, I think fatalError
generates a bunch of stack frames before aborting, so I'm not sure if this will be as visible as I hope. Plus I think it may be useful to print one of the duplicate values (passed in as an Any
parameter), along with the type -- although that may be a step too far.
e77323b
to
3834a2a
Compare
@swift-ci please test |
Build failed |
@swift-ci benchmark |
Build failed |
Build comment file:Optimized (O)Regression (14)
Improvement (15)
No Changes (418)
Unoptimized (Onone)Regression (21)
Improvement (23)
No Changes (403)
Hardware Overview
|
53d1aa4
to
b8f8941
Compare
Recent commits attempt to optimize high-level set operations; let's see how much these will help balance the performance impact of resilient hash tables. |
@swift-ci please benchmark |
(I haven't updated lldb's data formatters to read the new storage class layout yet, so the lldb tests will definitely fail.) |
Build comment file:Optimized (O)Regression (18)
Improvement (20)
No Changes (409)
Unoptimized (Onone)Regression (14)
Improvement (49)
No Changes (384)
Hardware Overview
|
That's funny! These results are the opposite of my own |
Separately benchmarking SetIntersect in detail reveals an important clue -- the benchmark as it currently stands measures the intersection between two non-overlapping sets. I can't locally reproduce the +75% slowdown -- this use-case benchmarks roughly the same before/after this PR. Fixing the benchmark to include a non-empty intersection (with ~25% of elements in both sets) shows clear improvements over the non-overlapping case. (See |
Note how the new algorithm is faster at almost every stage. It gets relatively faster as the % of common elements increase. The intersection of two equal sets is calculated particularly efficiently: The new algorithm detects that there is no need to construct a third set, making it even faster than the 0% case for many sizes. (This is because looking up an existing element is faster than one that's not in the hash table.) |
b8f8941
to
84a405a
Compare
@swift-ci test |
Build failed |
Build failed |
afaa699
to
2504d9a
Compare
@swift-ci please test linux platform |
@swift-ci please smoke benchmark staging |
Build comment file:Performance: -O
Performance: -Osize
How to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the regressions before you merge the PR. Noise: Sometimes the performance results (not code size!) contain false alarms. Unexpected regressions which are marked with '(?)' are probably noise. If you see regressions which you cannot explain you can try to run the benchmarks again. If regressions still show up, please consult with the performance team (@eeckstein). Hardware Overview
|
- Use bitmaps to eliminate the need for multiple passes on the elements of the input sequence. - When given two sets, prefer calculating the intersection on the native one. When both sets are native, iterate over the one with fewest elements.
- Use bitmaps to eliminate the need for multiple passes on the elements of the input sequence. - When given two sets, prefer calculating the intersection on the native one. When both sets are native, iterate over the one with fewest elements.
- The union operation is odd because it seems best to base it on its mutating variant. - When given the option, start with a native set. This may prevent one round of rehashing. - When both sides are sets, they establish a minimum bound on the result's element count.
Exposing the number of buckets allows _NativeSet.reallocate() to be made inlinable, which seems to be important for performance.
I don’t expect we’ll ever need to change this, but the typealias is a readability win.
Set.union and .symmetricDifference still exhibit quadratic behaviour with similar-sized but almost disjunct input sets. This was due to hash values correlation between hash tables of the same size. Enabling true per-instance seeding fixes this.
Evidently it has no effect on benchmarks.
Leaving this overridable caused a ~2x slowdown on Set<Int>.contains benchmarks. D’oh.
This gets us a nice baseline to start with.
- Reduce the number of hash table buckets by a factor of 8. - Increase the size of buckets from 1 to 8 entries. - Use SIMD within a register to find matching entries, instead of iterating over individual bytes.
- Simplify generated code; eliminate obvious inefficiencies. - Revive guaranteedNative trick; removing the bridged branch does have a measurable impact for contains at least. (Apart from code size benefits) Lookups in sets are still not quite as fast as on master for small sets (up to ~1k elements) when the element type has fast equality checks — I believe this is because the SIMD approach has worse latency than direct lookup, and the lookup chains aren’t long enough to take advantage of higher throughput. (The latency difference only matters when the set is fully cached, which is why the slowdown only occurs for small sets.)
9b0e735
to
d4c4171
Compare
@swift-ci please smoke benchmark staging |
1 similar comment
@swift-ci please smoke benchmark staging |
Build comment file:Build failed before running benchmark. |
@swift-ci please smoke benchmark staging |
_HashTable
struct, parts of which can potentially be made resilient.Set
and (soon)Dictionary
.Dictionary
where only one of the element types may need non-verbatim bridging.)This hasn't been fully debugged yet, and also needs to be extensively benchmarked/fine-tuned.
An lldb data formatter update will need to land before merging this PR.