Add prune support #120

RealNicolasBourbaki · 2019-11-11T14:53:11Z

No description provided.

RealNicolasBourbaki · 2019-11-11T15:00:03Z

src/chunks/storage/array.rs

+            if offset > toss {
+                offset = toss;
+            }
+            let batch = toss_embeds.slice(s![n..offset, ..]);


clippy complains at here, but this seems to be a bug

You can add this above the line to silence clippy #[allow(clippy::deref_addrof)], I think we have used that at other parts as well, tho I never looked deeper into why that happens.

Btw, this is another case where clippy complained on my laptop and not on CI

RealNicolasBourbaki

And clippy also suggests that we put this into Box to reduce the size of enum:

MmapQuantizedArray(Box<MmapQuantizedArray>),

sebpuetz · 2019-11-11T15:25:21Z

And clippy also suggests that we put this into Box to reduce the size of enum:
MmapQuantizedArray(Box<MmapQuantizedArray>),

That's odd, did the max size of structs change? I know that @danieldk boxed the other QuantizedArray a while ago because clippy complained. Any idea what's going on with that?

RealNicolasBourbaki · 2019-11-11T15:57:10Z

That's odd, did the max size of structs change? I know that @danieldk boxed the other QuantizedArray a while ago because clippy complained. Any idea what's going on with that?

CI seems to be fine with it... I tested it on my laptop with cargo clippy and it complained. Maybe something to do with my Windows system?

danieldk · 2019-11-11T16:15:14Z

That's odd, did the max size of structs change? I know that @danieldk boxed the other QuantizedArray a while ago because clippy complained. Any idea what's going on with that?

CI seems to be fine with it... I tested it on my laptop with cargo clippy and it complained. Maybe something to do with my Windows system?

When I encountered this a few months (?) ago, the maximum size was a clippy default. It could of course be that the memory mapping implementation on Windows is more involved. This seems to be the case. On UNIX, it's a pointer + len. On Windows it also has a File object and an extra bool:

https://github.com/danburkert/memmap-rs/blob/3b047cc2b04558d8a1de3933be5f573c74bc8e0f/src/windows.rs#L19

For me, UNIX is the golden standard. So, maybe you could silence clippy when the target is a Window platform? (I prefer not to silence it completely, in case we cross the boundary in UNIX.)

RealNicolasBourbaki · 2019-11-11T16:31:44Z

This seems to be the case. On UNIX, it's a pointer + len. On Windows it also has a File object and an extra bool:

Ah! Thanks!

For me, UNIX is the golden standard. So, maybe you could silence clippy when the target is a Window platform? (I prefer not to silence it completely, in case we cross the boundary in UNIX.)

Make sense! I won't silence it :)

danieldk

Thank you for working on this!

I have done one read over the PR and I think I understand the most of it. Before I do another read, would it be possible to:

Add documentation to traits, trait methods, and other methods?
Add unit tests for the functionality. In finalfusion a lot of functionality is covered by unit tests, and it does help us capturing bugs.

Additional question:

Pruning currently does not survive a write/read roundtrip, right? (Since the vocab chunk currently does not store storage offsets for tokens.)

danieldk · 2019-11-11T18:41:25Z

src/chunks/norms.rs

@@ -127,6 +127,20 @@ impl WriteChunk for NdNorms {
    }
 }

+pub trait PruneNorms {


Please add rustdoc for the trait and for the prune_norms method,

danieldk · 2019-11-11T18:47:12Z

src/chunks/storage/mod.rs

@@ -33,3 +33,13 @@ pub(crate) trait StorageViewMut: Storage {
    /// Get a view of the embedding matrix.
    fn view_mut(&mut self) -> ArrayViewMut2<f32>;
 }
+
+pub trait StoragePrune: Storage {


Add rustdoc.

danieldk · 2019-11-11T18:56:31Z

src/chunks/vocab/subword.rs

+    fn part_indices(&self, n_keep: usize) -> (Vec<usize>, Vec<usize>) {
+        let mut keep_indices = vec![0; n_keep];
+        let mut toss_indices = vec![0; self.words_len() - n_keep];
+        for (n, each_word) in self.words()[0..n_keep].iter().enumerate() {


In these loops: is there initially a difference between nn and self.indices.get(each_word)? I guess so, because the vocabulary may be the result of earlier pruning?

I guess these loops could be simplified with something along the lines of:

let keep_indices = self.words().iter().take(n_keep).map(|w| *self.indices.get(w).unwrap).collect();

danieldk · 2019-11-11T18:58:05Z

src/embeddings.rs

@@ -566,6 +566,44 @@ impl<'a> Iterator for IterWithNorms<'a> {
    }
 }

+pub trait Prune<V, S> {
+    fn simple_prune(&self, n_keep: usize, batch_size: usize) -> Embeddings<VocabWrap, StorageWrap>;


This could just be called Prune.

RealNicolasBourbaki · 2019-11-17T00:23:11Z

Additional question:

Pruning currently does not survive a write/read roundtrip, right? (Since the vocab chunk currently does not store storage offsets for tokens.)

Yes! Actually in addition to that, there is another issue keeping it from surviving the roundtrip: read_chunk function always read words first and then create indices, which means after reading a pruned embedding file, the indices will be "recovered" back to whatever they were before pruning.

So I think maybe for the pruned embeddings, we need to store the remapping information: How many vectors were pruned off, and the remapped indices of the words whose vectors were tossed.

danieldk · 2019-11-17T06:29:19Z

Yes! Actually in addition to that, there is another issue keeping it from surviving the roundtrip: read_chunk function always read words first and then create indices, which means after reading a pruned embedding file, the indices will be "recovered" back to whatever they were before pruning.

So I think maybe for the pruned embeddings, we need to store the remapping information: How many vectors were pruned off, and the remapped indices of the words whose vectors were tossed.

Indeed, however, such a chunk would require some larger changes across the crate. We didn't really realize that, but as discussed in #126 , the problem is that one storage index can map to multiple words. However, Vocab::words currently returns &[String], so every index is mapped to a single word. Also, similarity/analogy queries need to be adjusted to deal with this (though I think that this could be done without any interface changes).

Adding an explicit mapping from words to indices also requires the addition of four new chunks in our current setup (vocab, bucket vocab, fasttext vocab, explicit ngram vocab).

All these things are possible, but need to be worked out carefully. However, the first change would require an API change.

We can proceed in getting this PR in shape, but we before merging, we should also look at the improvements it brings. So, I am really looking forward to your presentation!

sebpuetz · 2020-04-29T15:42:16Z

This was another feature that was left hanging last fall, I wrote an implementation for a PrunedVocab chunk in my Python port:

https://github.com/sebpuetz/ffp/blob/3ad3d1c51b1f9e028f1d70f7b85fb8635df27e60/ffp/vocab.py#L596

I think this approach would allow persisting pruned embeddings while being minimally invasive. The idea is to have a wrapper around the actual vocabulary:

struct PrunedVocab<V> where V: Vocab {
    mapping: Vec<usize>,
    vocab: Vocab, // or VocabWrap fwiw
}

with mapping.len() == storage.rows(). mapping[i] would translate the original index to the new one.

Persistence requires a new Chunk, this PrunedVocab chunk would be somewhat different from those we have so far. It would preceed the original Vocab and call read_chunk() for the vocab chunk following it.

I haven't written any Rust code for this and I don't know if there are additional obstacles, but from my perspective it wouldn't require changes to any existing on-disk formats or existing APIs - apart from the wrappers which could be made [non-exhaustive]?.

Any opinions on that or has the idea of pruning been thrown out anyways?

danieldk · 2020-04-30T07:49:47Z

I think this approach would allow persisting pruned embeddings while being minimally invasive. The idea is to have a wrapper around the actual vocabulary:
struct PrunedVocab<V> where V: Vocab {
    mapping: Vec<usize>,
    vocab: Vocab, // or VocabWrap fwiw
}
with mapping.len() == storage.rows(). mapping[i] would translate the original index to the new one.

Just for my understanding, that would be storage.rows() from before pruning, right?

Persistence requires a new Chunk, this PrunedVocab chunk would be somewhat different from those we have so far. It would preceed the original Vocab and call read_chunk() for the vocab chunk following it.

You wouldn't need the mapping member here, I think? This could be a struct PrunedVocab(impl Vocab). You would read the vocabulary and then update the indices in the vocabulary. If you place the remapping table after the actual vocabulary, you wouldn't even have to store the mapping in memory, you could update the mapping of the vocabulary as you are reading the table. Then after reading, PrunedVocab could just forward queries to the actual vocab.

I think this has some problems though:

We still have the mapping from index to a word. This mapping is now incomplete, since a storage index can map to multiple words. (E.g. needed for similarity and analogy queries.)
If combining the vocab and the mapping is handled by PrunedVocab, it introduces a chunk order (mapping before or after the vocab). We currently assume an ordering in finalfusion-rust, but there is no limitation in the file format that we cannot have an arbitrary ordering of chunks. Once we get such explicit tying between chunks, we bind ourselves to an ordering or the chunk readers need to become idempotent in some way.

Of course, two concerns get mingled here:

What should the API look like when one storage index can map to multiple words?
How should the one-to-many mapping be represented in terms of chunks?

When it comes to the representation - I think a variant of your proposal is possible where we tack the mapping table onto the existing vocab chunks, but give these variants new chunk identifiers. Then we would not need any new data types, but could extend the existing vocab types. The ReadChunk implementation would then read the optional table based on the identifier. The WriteChunk implementation would write the old identifier if the mapping None and the new identifier otherwise.

For the API, we could have Vocab::words return &[BTreeSet<String>] or &[Vec<String>]. But that would entail another API change and would require quite a bit of extra memory use due to the overhead of BTreeSet or Vec.

sebpuetz · 2020-04-30T08:44:17Z

Just for my understanding, that would be storage.rows() from before pruning, right?

Yes, every original storage row is mapped to its corresponding index in the pruned storage.

You wouldn't need the mapping member here, I think? This could be a struct PrunedVocab(impl Vocab). You would read the vocabulary and then update the indices in the vocabulary. If you place the remapping table after the actual vocabulary, you wouldn't even have to store the mapping in memory, you could update the mapping of the vocabulary as you are reading the table. Then after reading, PrunedVocab could just forward queries to the actual vocab.

That's a good point, but I see one issue: The Hash indexers don't allow updating the indices in memory, so at least for those an in-memory indirection would probably be necessary.

I think this has some problems though:

* We still have the mapping from index to a word. This mapping is now incomplete, since a storage index can map to multiple words. (E.g. needed for similarity and analogy queries.)

Relying on row_n being word_n in the vocab does complicate things. I guess this could be addressed in changing how the queries are handled; rather than iterating over the storage rows, the implementation could iterate the words in the vocab and retrieve the embeddings explicitly through the word.

* If combining the vocab and the mapping is handled by `PrunedVocab`, it introduces a chunk order (mapping before or after the vocab). We currently assume an ordering in `finalfusion-rust`, but there is no limitation in the file format that we cannot have an arbitrary ordering of chunks. Once we get such explicit tying between chunks, we bind ourselves to an ordering or the chunk readers need to become idempotent in some way.

I guess my formulation would introduce a chunk inside a chunk, the PrunedVocab would be the actual Chunk and it would contain the proper Vocab chunk. That way we wouldn't rely on order, they'd simply be tied together.

Of course, two concerns get mingled here:
1. What should the API look like when one storage index can map to multiple words?

2. How should the one-to-many mapping be represented in terms of chunks?
When it comes to the representation - I think a variant of your proposal is possible where we tack the mapping table onto the existing vocab chunks, but give these variants new chunk identifiers. Then we would not need any new data types, but could extend the existing vocab types. The ReadChunk implementation would then read the optional table based on the identifier. The WriteChunk implementation would write the old identifier if the mapping None and the new identifier otherwise.

That's also an option but I think the implementation of the existing Vocabs could become rather complex with the optional mapping being part of it. But I guess that's most easily seen by actually writing it out.

For the API, we could have Vocab::words return &[BTreeSet<String>] or &[Vec<String>]. But that would entail another API change and would require quite a bit of extra memory use due to the overhead of BTreeSet or Vec.

I was just looking at ExplicitIndexer::ngrams, here we allow a many-to-one mapping for ngrams -> index and we're returning the ngrams as a &[String]. We missed that part when I implemented the explicit ngrams. I'm not sure about the correct course of action wrt. having a two-way-mapping for the vocabularies - including the ngram indexer.

Maybe the conclusion is that it's not worth the hassle?

danieldk · 2020-04-30T10:22:48Z

I guess this could be addressed in changing how the queries are handled; rather than iterating over the storage rows, the implementation could iterate the words in the vocab and retrieve the embeddings explicitly through the word.

But then you lose the benefit of fast matrix-vector multiplication implementations (either through ndarray or third-party BLAS), IIRC individual dot products were quite a bit slower.

Maybe the conclusion is that it's not worth the hassle?

Possibly. I think we should only bite the bullet and add such complexity if there is a very clear gain from pruning. IIRC quantization generally provides better compression with a smaller l2 loss (but I'd have to recheck Nicole's slides). Quantized embeddings are a fair bit slower, but that is in many cases acceptable, when they are the input to some expensive neural net. I think some other projects used pruning because they didn't have quantization and it's easier to implement than quantization. (I saw that ffp now also supports quantized matrices, nice!)

Also, if downloading and storing a large embedding matrix is not problematic, then you could as well just mmap it and be done with it.

RealNicolasBourbaki requested review from danieldk and sebpuetz as code owners November 11, 2019 14:53

RealNicolasBourbaki force-pushed the prune branch from d3225d2 to cbab312 Compare November 11, 2019 14:58

RealNicolasBourbaki commented Nov 11, 2019

View reviewed changes

RealNicolasBourbaki force-pushed the prune branch from 841c302 to 9a6a536 Compare November 11, 2019 15:39

danieldk reviewed Nov 11, 2019

View reviewed changes

Add prune support

6ab0371

RealNicolasBourbaki force-pushed the prune branch from 99f2568 to 6ab0371 Compare November 17, 2019 00:14

RealNicolasBourbaki requested a review from danieldk November 17, 2019 00:28

danieldk closed this Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prune support #120

Add prune support #120

RealNicolasBourbaki commented Nov 11, 2019

RealNicolasBourbaki Nov 11, 2019 •

edited

Loading

sebpuetz Nov 11, 2019

RealNicolasBourbaki Nov 11, 2019

RealNicolasBourbaki left a comment

sebpuetz commented Nov 11, 2019

RealNicolasBourbaki commented Nov 11, 2019

danieldk commented Nov 11, 2019

RealNicolasBourbaki commented Nov 11, 2019 •

edited

Loading

danieldk left a comment

danieldk Nov 11, 2019

danieldk Nov 11, 2019

danieldk Nov 11, 2019

danieldk Nov 11, 2019

RealNicolasBourbaki commented Nov 17, 2019 •

edited

Loading

danieldk commented Nov 17, 2019

sebpuetz commented Apr 29, 2020

danieldk commented Apr 30, 2020 •

edited

Loading

sebpuetz commented Apr 30, 2020

danieldk commented Apr 30, 2020 •

edited

Loading

Add prune support #120

Add prune support #120

Conversation

RealNicolasBourbaki commented Nov 11, 2019

RealNicolasBourbaki Nov 11, 2019 • edited Loading

Choose a reason for hiding this comment

sebpuetz Nov 11, 2019

Choose a reason for hiding this comment

RealNicolasBourbaki Nov 11, 2019

Choose a reason for hiding this comment

RealNicolasBourbaki left a comment

Choose a reason for hiding this comment

sebpuetz commented Nov 11, 2019

RealNicolasBourbaki commented Nov 11, 2019

danieldk commented Nov 11, 2019

RealNicolasBourbaki commented Nov 11, 2019 • edited Loading

danieldk left a comment

Choose a reason for hiding this comment

danieldk Nov 11, 2019

Choose a reason for hiding this comment

danieldk Nov 11, 2019

Choose a reason for hiding this comment

danieldk Nov 11, 2019

Choose a reason for hiding this comment

danieldk Nov 11, 2019

Choose a reason for hiding this comment

RealNicolasBourbaki commented Nov 17, 2019 • edited Loading

danieldk commented Nov 17, 2019

sebpuetz commented Apr 29, 2020

danieldk commented Apr 30, 2020 • edited Loading

sebpuetz commented Apr 30, 2020

danieldk commented Apr 30, 2020 • edited Loading

RealNicolasBourbaki Nov 11, 2019 •

edited

Loading

RealNicolasBourbaki commented Nov 11, 2019 •

edited

Loading

RealNicolasBourbaki commented Nov 17, 2019 •

edited

Loading

danieldk commented Apr 30, 2020 •

edited

Loading

danieldk commented Apr 30, 2020 •

edited

Loading