-
Notifications
You must be signed in to change notification settings - Fork 440
Use custom hash table as node lookup table #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7db8475
to
2155561
Compare
I haven’t looked at the implementation in detail yet but the idea sounds great. Thanks for addressing this. |
5610c34
to
f49811e
Compare
It's a great idea to use HashTable! My concern is why do we need to maintain our own implementation of it? can we define extensions on some existing stdlib or Foundation types to achieve the same goal? |
I guess the custom part is "if the bucket contains nil, feel free to reuse it". I wonder if existing data structures can allow us to hack it. |
Generally, erasing element in open addressing hash table needs extra work if any hash collided objects exist. But in this "weakly referencing bucket", we can't do that because we don't know when they become I don't think we can reasonably re-use existing data structures. |
Yeah, the stdlib's hash table implementation does not currently support weak keys -- they need modified low-level lookup/insert/removal logic that the stdlib doesn't implement yet. It is not feasible to build a weak-keyed set on top of Dealloc'd entries need to work like tombstones -- lookups need to ignore them, while insertions can reuse their space, and removals/rehashes can compress them away. Implementation note: it's usually a bad idea to mutate storage on lookup operations -- people expect lookups to be thread-safe, and guaranteeing that could be difficult. (It would also mean that lookups could invalidate indices, which would be highly peculiar.) In swiftlang/swift#19213, I'm working on a new |
Thank you for the explanation! @rintaro, @lorentey . If we have to include a non-trivial custom HashTable implementation in SwiftSyntax, i prefer we hold on landing this until we're sure the ever-increasing One potential solution is that server side periodically sends the syntax tree in full with a bit telling the clients to clear the |
I didn't mean to close it. Pushed the wrong button. Sorry! |
The hashtable implementation doesn't seem to be a huge amount of code and once the stdlib provides a WeakSet we'll just replace it with the stdlib's version and keep the same mode of operation. Long-term it seems better to me to go with the weak-ref hashtable since it is a mode of operation that will be future-proof. |
I forgot to mention that Foundation already provides NSHashTable, which could be exactly what you need! |
f49811e
to
041e793
Compare
Makes sense. I removed mutation part from
Thank you! I wasn't aware of We will discuss about this next week. |
041e793
to
0a92126
Compare
We discussed about this, and decided to merge this for now to address immediate problem. |
…kupTable This way we don't continue to retain RawSyntax nodes that are no longer needed for incremental transfer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I only now found the time the PR in detail.
Overall looking very good. I've got a few minor comments. In particular I'd like to document whenever we are using &+
etc. just because of performance reasons because I usually expect to see a expected overflow whenever I see the unsafe operators. Also, do these actually make a performance difference?
/// `nodeLookupTable`. Because `nodeLookupTable` only holds a weak reference | ||
/// to the RawSyntax nodes, all retired `RawSyntax` nodes will be deallocated | ||
/// once we set a new tree. The weak references in `nodeLookupTable` will then | ||
/// become `nil` but will also never be accessed again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment needs to be updated for WeakLookupTable
private static func _bucketCount(for capacity: Int, | ||
from current: Int = 2) -> Int { | ||
// Make sure it's representable. | ||
precondition(capacity <= (Int.max >> 1) + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we have e.g. capacity == Int.max - 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the condition was wrong.
The condition here is _minimalBucketCount(for: capacity) <= (Int.max >> 1) + 1
.
As I added comments, the max bucket count here is 0b0100_0000_...
because 0b1000_0000_...
is out of bound.
let minimalBucketCount = _minimalBucketCount(for: capacity) | ||
var bucketCount = current | ||
while bucketCount < minimalBucketCount { | ||
bucketCount &*= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't expect any overflow here, right? Is the &*=
just for performance reasons?
|
||
private var _bucketMask: Int { | ||
@inline(__always) get { | ||
return bucketCount &- 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again is this &-
just for performance reasons?
|
||
/// Finds the bucket where the object with the specified id should be stored | ||
/// to. | ||
private func _findHole(_ id: Element.Identifier) -> (pos: Int, found: Bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think alreadyExists
would be clearer in the return type than found
. found
implies that we found a hole which we will always do.
/// resizing happened. | ||
private func _ensurePlusOneCapacity() -> Bool { | ||
if bucketCount >= WeakLookupTable<Element> | ||
._minimalBucketCount(for: estimatedCount &+ 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same &+
.
|
||
// Slow path. | ||
estimatedCount = _countOccupiedBuckets() | ||
return reserveCapacity(estimatedCount &+ 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&+
pos = _findHole(obj.id).pos | ||
} | ||
buckets[pos].value = obj | ||
estimatedCount &+= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&+
// know), we can't stop iteration at a hole. So in the worst case (i.e. if | ||
// the object doesn't exist in the table), full linear search is needed. | ||
// However, since we assume the object exists and hasn't been freed yet, | ||
// we expect it's stored near the 'idealBucket' anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could think about adding a tombstone property to WeakReference
so that we can detect when we found a hole.
I.e.
WeakReference starts of as being {occupied: false, value: nil}
. Once a value is set it becomes {occupied: true, value: someValue}
. When someValue
is freed it becomes {occupied: true, value: nil}
. That way we can stop the search whenever we reach a WeakReference
with occupied == false
and thus avoid the full linear search.
However, in our use case, as you mentioned, we always expect to find the element, I think its worth it to make the trade-off to save an extra bit/byte for a linear worst-case complexity that we should never encounter in practice.
TL;DR: No need to change anything unless we are planning to merge this into corelibs-foundation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't intend to merge this into corelibs-foundation
. :)
NSHashTable
has very different API anyway.
if let obj = buckets[bucket].value, obj.id == id { | ||
return obj | ||
} | ||
bucket = (bucket &+ 1) & _bucketMask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&+
2b0c155
to
c89372d
Compare
Thanks for detailed feedback @ahoppen !
All
Yes, it does affect the performance. I'm not sure how much, but using '&' operation is low-hanging-fruits to minimize the overflow check penalty. |
/// resizing happened. | ||
private func _ensurePlusOneCapacity() -> Bool { | ||
// '&+' for performance. 'estimatedCount' is always less than 'bucketCount' | ||
// which is 0x4000_... or below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true. estimatedCount
is always less than buckecCount / maxLoadFactor
(it ends up giving the same guarantees for a maxLoadFactor > 0.5
but still.
Consider the following example:
class Foo: Identifiable {
let id: Int = 1
}
let lookupTable = WeakLookupTable<Foo>()
lookupTable.reserveCapacity(32)
while true {
let newValue = Foo()
lookupTable.insert(newValue)
print(lookupTable.estimatedCount) /* need to make estimatedCount public for this test */
}
With each insert
estimatedCount
gets increased until _minimalBucketCount(for: estimatedCount + 1)
, i.e. (estimatedCount + 1) / maxLoadFactor
is greater than bucketCount
at which point the slow path is taken and the lookup table realises that all the previous elements have been released and recounts.
This might influence some of the other comments as well. I haven't looked at them in detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
estimatedCount
is always less thanbuckecCount / maxLoadFactor
(it ends up giving the same guarantees for amaxLoadFactor > 0.5
but still.
I believe this is "estimatedCount
is alway less than or equal to bucketCount * maxLoadFactor
". So, 'estimatedCount' is always less than 'bucketCount'
is true because maxLoadFactor
is less than 1.
Either way, what we want to justify here is that estimatedCount + 1
doesn't overflow. i.e. estimatedCount < Int.max
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. I see. Sorry for the confusion. I somehow assumed that bucketCount
is 32
in the example above, but its 64
. You're right and it definitely doesn't overflow.
Use Set<T> like custom hash table (WeakLookupTable) that holds weak references for 'RawSyntax' nodes. This hash table doesn't holds ids separately, instead, use 'T.id' via Identifiable protocol. So freeing 'RawSyntax' turns the bucket in the table into just 'nil', so it's reusable for referencing another object.
c89372d
to
336d1d5
Compare
Squashed my commits. |
Build failed |
Could you also close rdar://43516167 if you haven‘t done so already? |
[SR-11115] Missing type annotation on multiple declarations
Attempt to address the problem raised in #3 .
Use
Set<T>
like custom hash table that holds weak references forRawSyntax
nodes.This hash table doesn't holds
id
s separately, instead, useT.id
viaIdentifiable
protocol.So freeing
RawSyntax
turns the bucket in the table into justnil
, so it's reusable for referencing another object.