Skip to content

Use custom hash table as node lookup table #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 22, 2018

Conversation

rintaro
Copy link
Member

@rintaro rintaro commented Sep 14, 2018

Attempt to address the problem raised in #3 .

Use Set<T> like custom hash table that holds weak references for RawSyntax nodes.
This hash table doesn't holds ids separately, instead, use T.id via Identifiable protocol.
So freeing RawSyntax turns the bucket in the table into just nil, so it's reusable for referencing another object.

@rintaro
Copy link
Member Author

rintaro commented Sep 14, 2018

@ahoppen @nkcsgexi
What do you think?

@rintaro rintaro force-pushed the lookuptable-custom branch 2 times, most recently from 7db8475 to 2155561 Compare September 14, 2018 13:44
@ahoppen
Copy link
Member

ahoppen commented Sep 14, 2018

I haven’t looked at the implementation in detail yet but the idea sounds great. Thanks for addressing this.

@rintaro rintaro force-pushed the lookuptable-custom branch 2 times, most recently from 5610c34 to f49811e Compare September 14, 2018 16:43
@nkcsgexi
Copy link
Contributor

It's a great idea to use HashTable! My concern is why do we need to maintain our own implementation of it? can we define extensions on some existing stdlib or Foundation types to achieve the same goal?

@nkcsgexi
Copy link
Contributor

I guess the custom part is "if the bucket contains nil, feel free to reuse it". I wonder if existing data structures can allow us to hack it.

@rintaro
Copy link
Member Author

rintaro commented Sep 14, 2018

Generally, erasing element in open addressing hash table needs extra work if any hash collided objects exist. But in this "weakly referencing bucket", we can't do that because we don't know when they become nil. So the lookup logic is very different from normal Set<Element> (see subscript implementation).

I don't think we can reasonably re-use existing data structures.
CC: @lorentey I'd appreciate it if you could give us your opinion about this. Please let me know if you need more context.

@lorentey
Copy link
Member

lorentey commented Sep 14, 2018

Yeah, the stdlib's hash table implementation does not currently support weak keys -- they need modified low-level lookup/insert/removal logic that the stdlib doesn't implement yet. It is not feasible to build a weak-keyed set on top of Set. (However, we should add a WeakSet and multiple WeakDictionary variants to the stdlib at some point!)

Dealloc'd entries need to work like tombstones -- lookups need to ignore them, while insertions can reuse their space, and removals/rehashes can compress them away. Implementation note: it's usually a bad idea to mutate storage on lookup operations -- people expect lookups to be thread-safe, and guaranteeing that could be difficult. (It would also mean that lookups could invalidate indices, which would be highly peculiar.)

In swiftlang/swift#19213, I'm working on a new _HashTable construct to unify low-level hash table operations across Set and Dictionary. (I expect to land that next week, after some additional work.) That struct would simplify creating a WeakSet/WeakDictionary within the standard library, but since it's internal to the stdlib, unfortunately it won't be directly helpful for SwiftSyntax. :(

@nkcsgexi
Copy link
Contributor

Thank you for the explanation! @rintaro, @lorentey . If we have to include a non-trivial custom HashTable implementation in SwiftSyntax, i prefer we hold on landing this until we're sure the ever-increasing nodeLookupTable is an actual problem. With that being said, even if it is an actual problem, i prefer a simpler solution.

One potential solution is that server side periodically sends the syntax tree in full with a bit telling the clients to clear the nodeLookupTable all together. Thus we don't need to maintain the table forever in the SwiftSyntax side. What do you think? @ahoppen

@nkcsgexi nkcsgexi closed this Sep 14, 2018
@nkcsgexi nkcsgexi reopened this Sep 14, 2018
@nkcsgexi
Copy link
Contributor

I didn't mean to close it. Pushed the wrong button. Sorry!

@akyrtzi
Copy link
Contributor

akyrtzi commented Sep 14, 2018

The hashtable implementation doesn't seem to be a huge amount of code and once the stdlib provides a WeakSet we'll just replace it with the stdlib's version and keep the same mode of operation.
If we introduce a different mode of operation, like periodically starting from scratch, and adding the heuristics to figure out when to do that, then this mode of operation should be replaced with the WeakSet model once the stdlib provides it.

Long-term it seems better to me to go with the weak-ref hashtable since it is a mode of operation that will be future-proof.

@lorentey
Copy link
Member

I forgot to mention that Foundation already provides NSHashTable, which could be exactly what you need!

@rintaro
Copy link
Member Author

rintaro commented Sep 15, 2018

Implementation note: it's usually a bad idea to mutate storage on lookup operations -- people expect lookups to be thread-safe, and guaranteeing that could be difficult.

Makes sense. I removed mutation part from subscript.

NSHashTable, which could be exactly what you need!

Thank you! I wasn't aware of NSHashTable. It seems corelibs-foundation doesn't have implementation of it though. So, perhaps what we should do is implementing NSHashTable in corelibs-foundation...

We will discuss about this next week.

@rintaro
Copy link
Member Author

rintaro commented Sep 20, 2018

We discussed about this, and decided to merge this for now to address immediate problem.
I will write unit tests for WeakLookupTable before merging.

…kupTable

This way we don't continue to retain RawSyntax nodes that are no longer
needed for incremental transfer.
Copy link
Member

@ahoppen ahoppen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I only now found the time the PR in detail.

Overall looking very good. I've got a few minor comments. In particular I'd like to document whenever we are using &+ etc. just because of performance reasons because I usually expect to see a expected overflow whenever I see the unsafe operators. Also, do these actually make a performance difference?

/// `nodeLookupTable`. Because `nodeLookupTable` only holds a weak reference
/// to the RawSyntax nodes, all retired `RawSyntax` nodes will be deallocated
/// once we set a new tree. The weak references in `nodeLookupTable` will then
/// become `nil` but will also never be accessed again.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment needs to be updated for WeakLookupTable

private static func _bucketCount(for capacity: Int,
from current: Int = 2) -> Int {
// Make sure it's representable.
precondition(capacity <= (Int.max >> 1) + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we have e.g. capacity == Int.max - 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the condition was wrong.
The condition here is _minimalBucketCount(for: capacity) <= (Int.max >> 1) + 1.
As I added comments, the max bucket count here is 0b0100_0000_... because 0b1000_0000_... is out of bound.

let minimalBucketCount = _minimalBucketCount(for: capacity)
var bucketCount = current
while bucketCount < minimalBucketCount {
bucketCount &*= 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't expect any overflow here, right? Is the &*= just for performance reasons?


private var _bucketMask: Int {
@inline(__always) get {
return bucketCount &- 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again is this &- just for performance reasons?


/// Finds the bucket where the object with the specified id should be stored
/// to.
private func _findHole(_ id: Element.Identifier) -> (pos: Int, found: Bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think alreadyExists would be clearer in the return type than found. found implies that we found a hole which we will always do.

/// resizing happened.
private func _ensurePlusOneCapacity() -> Bool {
if bucketCount >= WeakLookupTable<Element>
._minimalBucketCount(for: estimatedCount &+ 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same &+.


// Slow path.
estimatedCount = _countOccupiedBuckets()
return reserveCapacity(estimatedCount &+ 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&+

pos = _findHole(obj.id).pos
}
buckets[pos].value = obj
estimatedCount &+= 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&+

// know), we can't stop iteration at a hole. So in the worst case (i.e. if
// the object doesn't exist in the table), full linear search is needed.
// However, since we assume the object exists and hasn't been freed yet,
// we expect it's stored near the 'idealBucket' anyway.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could think about adding a tombstone property to WeakReference so that we can detect when we found a hole.
I.e.
WeakReference starts of as being {occupied: false, value: nil}. Once a value is set it becomes {occupied: true, value: someValue}. When someValue is freed it becomes {occupied: true, value: nil}. That way we can stop the search whenever we reach a WeakReference with occupied == false and thus avoid the full linear search.

However, in our use case, as you mentioned, we always expect to find the element, I think its worth it to make the trade-off to save an extra bit/byte for a linear worst-case complexity that we should never encounter in practice.

TL;DR: No need to change anything unless we are planning to merge this into corelibs-foundation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't intend to merge this into corelibs-foundation. :)
NSHashTable has very different API anyway.

if let obj = buckets[bucket].value, obj.id == id {
return obj
}
bucket = (bucket &+ 1) & _bucketMask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&+

@rintaro rintaro force-pushed the lookuptable-custom branch 2 times, most recently from 2b0c155 to c89372d Compare September 21, 2018 05:25
@rintaro
Copy link
Member Author

rintaro commented Sep 21, 2018

Thanks for detailed feedback @ahoppen !

In particular I'd like to document whenever we are using &+ etc. just because of performance reasons because I usually expect to see a expected overflow whenever I see the unsafe operators.

All &+ family operations in this PR are only for performance. I added comments to justify the use of them.

Also, do these actually make a performance difference?

Yes, it does affect the performance. I'm not sure how much, but using '&' operation is low-hanging-fruits to minimize the overflow check penalty.

/// resizing happened.
private func _ensurePlusOneCapacity() -> Bool {
// '&+' for performance. 'estimatedCount' is always less than 'bucketCount'
// which is 0x4000_... or below.
Copy link
Member

@ahoppen ahoppen Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true. estimatedCount is always less than buckecCount / maxLoadFactor (it ends up giving the same guarantees for a maxLoadFactor > 0.5 but still.

Consider the following example:

class Foo: Identifiable {
  let id: Int = 1
}

let lookupTable = WeakLookupTable<Foo>()
lookupTable.reserveCapacity(32)

while true {
  let newValue = Foo()
  lookupTable.insert(newValue)
  print(lookupTable.estimatedCount) /* need to make estimatedCount public for this test */
}

With each insert estimatedCount gets increased until _minimalBucketCount(for: estimatedCount + 1), i.e. (estimatedCount + 1) / maxLoadFactor is greater than bucketCount at which point the slow path is taken and the lookup table realises that all the previous elements have been released and recounts.

This might influence some of the other comments as well. I haven't looked at them in detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

estimatedCount is always less than buckecCount / maxLoadFactor (it ends up giving the same guarantees for a maxLoadFactor > 0.5 but still.

I believe this is "estimatedCount is alway less than or equal to bucketCount * maxLoadFactor". So, 'estimatedCount' is always less than 'bucketCount' is true because maxLoadFactor is less than 1.

Either way, what we want to justify here is that estimatedCount + 1 doesn't overflow. i.e. estimatedCount < Int.max, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. I see. Sorry for the confusion. I somehow assumed that bucketCount is 32 in the example above, but its 64. You're right and it definitely doesn't overflow.

Use Set<T> like custom hash table (WeakLookupTable) that holds weak
references for 'RawSyntax' nodes. This hash table doesn't holds ids
separately, instead, use 'T.id' via Identifiable protocol. So freeing
'RawSyntax' turns the bucket in the table into just 'nil', so it's
reusable for referencing another object.
@rintaro
Copy link
Member Author

rintaro commented Sep 22, 2018

Squashed my commits.
@swift-ci Please test

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - 9e00b74

@rintaro
Copy link
Member Author

rintaro commented Sep 22, 2018

@rintaro rintaro changed the title [WIP][Experiment] Use custom hash table as node lookup table Use custom hash table as node lookup table Sep 22, 2018
@rintaro rintaro merged commit a997013 into swiftlang:master Sep 22, 2018
@rintaro rintaro deleted the lookuptable-custom branch September 22, 2018 12:45
@ahoppen
Copy link
Member

ahoppen commented Sep 24, 2018

Could you also close rdar://43516167 if you haven‘t done so already?

adevress pushed a commit to adevress/swift-syntax that referenced this pull request Jan 14, 2024
[SR-11115] Missing type annotation on multiple declarations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants