-
Notifications
You must be signed in to change notification settings - Fork 12
Memory mappable vocab #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
First of all, that's quite a bit to digest and I'm not sure whether I followed everytihng. Right now, I have a hard time exactly imagining how this chunk would be implemented and I think we should consider not making things too complicated.
So, that'd be a vector of
Is This would be implemented as
Essentially,
Index into
As mentioned above, I think we should also consider how complex an implementation turns out and whether it's straight forward to implement in other languages.
Although, pruning already comes with this downside. I think it'd be nicer to either forbid similarity/analogy queries for vocabularies with indirections or change the similarity/analogy API. |
I have thought yet more about this and am now more in favor of an approach based on perfect hash automata. This just brings more benefits:
I now have half an implementation. I am piggy-backing on the The rest has to wait until after the workshop though ;). |
This time hopefully without soundness errors ;).
We should be very conservative with adding new chunks. But I think there is a place for a new vocabulary chunk that solves three problems of the current vocab and (bucket) subword vocab chunks:
The proposed vocabulary format would be structured similarly to a fully loaded hash table.
Data structures
The data would be structured as follows (after the chunk identifier and length). Given a vocabulary of size
vocab_len
:vocab_len
pairs (see notes 1 and 2 below for refinements) of:u64
);vocab_len
indexes (u64), mapping a storage index to a pair from (1);Lookups
i
: get thei
-th element of the storage->string link table. Thisis the index of the string in the string index.
w
: hashw
and map the hashh
to the vocabulary space,h_vocab
. Start linear probing ath_vocab
until a matchingh
is found. Then verify that the strings match (otherwise continue probing).Notes
In practice we'd want to use another number than the vocabulary length, e.g. one of the next powers of two. First to make outcomes of hashing uniformly distributed, second to avoid degenerate cases in linear probing. However, without these blank slots, the storage->string link table would not be necessary (since we could sort the storage by hash table order).
The birthday paradox square estimation puts the hash collision probability of 0.5 at 2^32 items, so in practice the first actual string match would be a hit.
If the table is constructed in word frequency order, the amount of linear probing is a function of the word rank/frequency, since when the most frequent words are inserted, most pairs will still be empty.
Downside
I guess that adding this chunk entails adding three new chunks, since we have combined the vocabs with subword vocabs. Also, it probably requires a redesign of the explicit n-gram chunk, since otherwise it would not be memory mappable.
Footnote
We discussed earlier that it may have been a design mistake not to store storage indices with words and relying on the order of words. However, while thinking about this new chunk, I realized that doing this is actually unsound in our current setup. E.g. if one does a similarity or analogy query indices of the top-n results do not map to individual words anymore. We would need to change the APIs to return multiple words for a given index.
The text was updated successfully, but these errors were encountered: