Skip to content

Memory mappable vocab #126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
danieldk opened this issue Nov 16, 2019 · 2 comments
Closed

Memory mappable vocab #126

danieldk opened this issue Nov 16, 2019 · 2 comments
Assignees

Comments

@danieldk
Copy link
Member

This time hopefully without soundness errors ;).


We should be very conservative with adding new chunks. But I think there is a place for a new vocabulary chunk that solves three problems of the current vocab and (bucket) subword vocab chunks:

  • Memory mappable;
  • avoid storing every string twice when reading the vocabulary into memory,
  • quicker deserialization, so that it does not require incrementally building a hash map.

The proposed vocabulary format would be structured similarly to a fully loaded hash table.

Data structures

The data would be structured as follows (after the chunk identifier and length). Given a vocabulary of size vocab_len:

  1. string index: vocab_len pairs (see notes 1 and 2 below for refinements) of:
  • the offset of the string from the start of the string table (u64);
  • the FNV1a hash of the word;
  • the index of the word in the storage;
  1. storage->string link table: vocab_len indexes (u64), mapping a storage index to a pair from (1);
  2. string table: a concatenation of all the vocabulary words in UTF-8 format with their string lengths.

Lookups

  • Look up the word of storage index i: get the i-th element of the storage->string link table. This
    is the index of the string in the string index.
  • Look up the index of word w: hash w and map the hash h to the vocabulary space, h_vocab. Start linear probing at h_vocab until a matching h is found. Then verify that the strings match (otherwise continue probing).

Notes

  1. In practice we'd want to use another number than the vocabulary length, e.g. one of the next powers of two. First to make outcomes of hashing uniformly distributed, second to avoid degenerate cases in linear probing. However, without these blank slots, the storage->string link table would not be necessary (since we could sort the storage by hash table order).

  2. The birthday paradox square estimation puts the hash collision probability of 0.5 at 2^32 items, so in practice the first actual string match would be a hit.

  3. If the table is constructed in word frequency order, the amount of linear probing is a function of the word rank/frequency, since when the most frequent words are inserted, most pairs will still be empty.

Downside

I guess that adding this chunk entails adding three new chunks, since we have combined the vocabs with subword vocabs. Also, it probably requires a redesign of the explicit n-gram chunk, since otherwise it would not be memory mappable.

Footnote

We discussed earlier that it may have been a design mistake not to store storage indices with words and relying on the order of words. However, while thinking about this new chunk, I realized that doing this is actually unsound in our current setup. E.g. if one does a similarity or analogy query indices of the top-n results do not map to individual words anymore. We would need to change the APIs to return multiple words for a given index.

@sebpuetz
Copy link
Member

sebpuetz commented Nov 16, 2019

First of all, that's quite a bit to digest and I'm not sure whether I followed everytihng. Right now, I have a hard time exactly imagining how this chunk would be implemented and I think we should consider not making things too complicated.

This time hopefully without soundness errors ;).

We should be very conservative with adding new chunks. But I think there is a place for a new vocabulary chunk that solves three problems of the current vocab and (bucket) subword vocab chunks:

* Memory mappable;

* avoid storing every string twice when reading the vocabulary into memory,

* quicker deserialization, so that it does not require incrementally building a hash map.

The proposed vocabulary format would be structured similarly to a fully loaded hash table.

Data structures

The data would be structured as follows (after the chunk identifier and length). Given a vocabulary of size vocab_len:

1. string index: `vocab_len` pairs (see notes 1 and 2 below for refinements) of:


* the offset of the string from the start of the string table (`u64`);

* the FNV1a hash of the word;

* the index of the word in the storage;

So, that'd be a vector of (u64, u64, u64) with not necessarily n_words entries? If a given entry belongs to a word, it provides the exact address in the string table (through the offset), it's hash and the index in the link table?

1. storage->string link table: `vocab_len` indexes (u64), mapping a storage index to a pair from (1);

Is storage the same as string index?

This would be implemented as Vec<u64> where the ith entry is the index into the Vec<(u64, u64, u64)> from above?

2. string table: a concatenation of all the vocabulary words in UTF-8 format with their string lengths.

Essentially, an on-disk Vec<String>? String is probably not the correct way to think about this, rather a String with data instead of a pointer?

Lookups

* Look up the word of storage index `i`: get the `i`-th element of the storage->string link table. This
  is the index of the string in the _string index_.

Index into Vec<u64> from 2. with i, then go use that index to get the correct (u64, u64, u64) from 1. and use the offset to retrieve the value from the Vec<String> in 3.?

* Look up the index of word `w`: hash `w` and map the hash `h` to the vocabulary space, `h_vocab`. Start linear probing at `h_vocab` until a matching `h` is found. Then verify that the strings match (otherwise continue probing).

fnv1a(word) % n_buckets = idx, start with the (u64, u64, u64) at idx in 1. and use the offset to match against the Strings stored in the Vec<String> from 3., if matched, return the idx?

Downside

I guess that adding this chunk entails adding three new chunks, since we have combined the vocabs with subword vocabs. Also, it probably requires a redesign of the explicit n-gram chunk, since otherwise it would not be memory mappable.

As mentioned above, I think we should also consider how complex an implementation turns out and whether it's straight forward to implement in other languages.

Footnote

We discussed earlier that it may have been a design mistake not to store storage indices with words and relying on the order of words. However, while thinking about this new chunk, I realized that doing this is actually unsound in our current setup. E.g. if one does a similarity or analogy query indices of the top-n results do not map to individual words anymore. We would need to change the APIs to return multiple words for a given index.

Although, pruning already comes with this downside. I think it'd be nicer to either forbid similarity/analogy queries for vocabularies with indirections or change the similarity/analogy API.

@danieldk
Copy link
Member Author

danieldk commented Dec 2, 2019

I have thought yet more about this and am now more in favor of an approach based on perfect hash automata. This just brings more benefits:

  • Memory-mappable.
  • Much smaller than explicitly storing strings when loaded into memory.
  • Has no tricky tunables or degenerate cases to minimize collisions.
  • Reverse lookups (index -> word) are easily provided and do not require extra memory overhead.
  • Can be stored as two simple tables (states and transitions), that could be read easily in other languages.

I now have half an implementation. I am piggy-backing on the fst crate to build the (almost) minimal dictionary automaton. The code to compute the state right-language cardinalities is done + the code to go from the fst format to simpler, flat tables.

The rest has to wait until after the workshop though ;).

@danieldk danieldk closed this as completed Dec 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants