Skip to content

Add prune support #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

RealNicolasBourbaki
Copy link

No description provided.

if offset > toss {
offset = toss;
}
let batch = toss_embeds.slice(s![n..offset, ..]);
Copy link
Author

@RealNicolasBourbaki RealNicolasBourbaki Nov 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clippy complains at here, but this seems to be a bug

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add this above the line to silence clippy #[allow(clippy::deref_addrof)], I think we have used that at other parts as well, tho I never looked deeper into why that happens.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, this is another case where clippy complained on my laptop and not on CI

Copy link
Author

@RealNicolasBourbaki RealNicolasBourbaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And clippy also suggests that we put this into Box to reduce the size of enum:

MmapQuantizedArray(Box<MmapQuantizedArray>),

@sebpuetz
Copy link
Member

And clippy also suggests that we put this into Box to reduce the size of enum:

MmapQuantizedArray(Box<MmapQuantizedArray>),

That's odd, did the max size of structs change? I know that @danieldk boxed the other QuantizedArray a while ago because clippy complained. Any idea what's going on with that?

@RealNicolasBourbaki
Copy link
Author

That's odd, did the max size of structs change? I know that @danieldk boxed the other QuantizedArray a while ago because clippy complained. Any idea what's going on with that?

CI seems to be fine with it... I tested it on my laptop with cargo clippy and it complained. Maybe something to do with my Windows system?

@danieldk
Copy link
Member

That's odd, did the max size of structs change? I know that @danieldk boxed the other QuantizedArray a while ago because clippy complained. Any idea what's going on with that?

CI seems to be fine with it... I tested it on my laptop with cargo clippy and it complained. Maybe something to do with my Windows system?

When I encountered this a few months (?) ago, the maximum size was a clippy default. It could of course be that the memory mapping implementation on Windows is more involved. This seems to be the case. On UNIX, it's a pointer + len. On Windows it also has a File object and an extra bool:

https://github.com/danburkert/memmap-rs/blob/3b047cc2b04558d8a1de3933be5f573c74bc8e0f/src/windows.rs#L19

For me, UNIX is the golden standard. So, maybe you could silence clippy when the target is a Window platform? (I prefer not to silence it completely, in case we cross the boundary in UNIX.)

@RealNicolasBourbaki
Copy link
Author

RealNicolasBourbaki commented Nov 11, 2019

This seems to be the case. On UNIX, it's a pointer + len. On Windows it also has a File object and an extra bool:

Ah! Thanks!

For me, UNIX is the golden standard. So, maybe you could silence clippy when the target is a Window platform? (I prefer not to silence it completely, in case we cross the boundary in UNIX.)

Make sense! I won't silence it :)

Copy link
Member

@danieldk danieldk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

I have done one read over the PR and I think I understand the most of it. Before I do another read, would it be possible to:

  • Add documentation to traits, trait methods, and other methods?
  • Add unit tests for the functionality. In finalfusion a lot of functionality is covered by unit tests, and it does help us capturing bugs.

Additional question:

Pruning currently does not survive a write/read roundtrip, right? (Since the vocab chunk currently does not store storage offsets for tokens.)

@@ -127,6 +127,20 @@ impl WriteChunk for NdNorms {
}
}

pub trait PruneNorms {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add rustdoc for the trait and for the prune_norms method,

@@ -33,3 +33,13 @@ pub(crate) trait StorageViewMut: Storage {
/// Get a view of the embedding matrix.
fn view_mut(&mut self) -> ArrayViewMut2<f32>;
}

pub trait StoragePrune: Storage {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add rustdoc.

fn part_indices(&self, n_keep: usize) -> (Vec<usize>, Vec<usize>) {
let mut keep_indices = vec![0; n_keep];
let mut toss_indices = vec![0; self.words_len() - n_keep];
for (n, each_word) in self.words()[0..n_keep].iter().enumerate() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these loops: is there initially a difference between nn and self.indices.get(each_word)? I guess so, because the vocabulary may be the result of earlier pruning?

I guess these loops could be simplified with something along the lines of:

let keep_indices = self.words().iter().take(n_keep).map(|w| *self.indices.get(w).unwrap).collect();

@@ -566,6 +566,44 @@ impl<'a> Iterator for IterWithNorms<'a> {
}
}

pub trait Prune<V, S> {
fn simple_prune(&self, n_keep: usize, batch_size: usize) -> Embeddings<VocabWrap, StorageWrap>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be called Prune.

@RealNicolasBourbaki
Copy link
Author

RealNicolasBourbaki commented Nov 17, 2019

Additional question:

Pruning currently does not survive a write/read roundtrip, right? (Since the vocab chunk currently does not store storage offsets for tokens.)

Yes! Actually in addition to that, there is another issue keeping it from surviving the roundtrip: read_chunk function always read words first and then create indices, which means after reading a pruned embedding file, the indices will be "recovered" back to whatever they were before pruning.

So I think maybe for the pruned embeddings, we need to store the remapping information: How many vectors were pruned off, and the remapped indices of the words whose vectors were tossed.

@danieldk
Copy link
Member

Yes! Actually in addition to that, there is another issue keeping it from surviving the roundtrip: read_chunk function always read words first and then create indices, which means after reading a pruned embedding file, the indices will be "recovered" back to whatever they were before pruning.

So I think maybe for the pruned embeddings, we need to store the remapping information: How many vectors were pruned off, and the remapped indices of the words whose vectors were tossed.

Indeed, however, such a chunk would require some larger changes across the crate. We didn't really realize that, but as discussed in #126 , the problem is that one storage index can map to multiple words. However, Vocab::words currently returns &[String], so every index is mapped to a single word. Also, similarity/analogy queries need to be adjusted to deal with this (though I think that this could be done without any interface changes).

Adding an explicit mapping from words to indices also requires the addition of four new chunks in our current setup (vocab, bucket vocab, fasttext vocab, explicit ngram vocab).

All these things are possible, but need to be worked out carefully. However, the first change would require an API change.

We can proceed in getting this PR in shape, but we before merging, we should also look at the improvements it brings. So, I am really looking forward to your presentation!

@sebpuetz
Copy link
Member

This was another feature that was left hanging last fall, I wrote an implementation for a PrunedVocab chunk in my Python port:

https://github.com/sebpuetz/ffp/blob/3ad3d1c51b1f9e028f1d70f7b85fb8635df27e60/ffp/vocab.py#L596

I think this approach would allow persisting pruned embeddings while being minimally invasive. The idea is to have a wrapper around the actual vocabulary:

struct PrunedVocab<V> where V: Vocab {
    mapping: Vec<usize>,
    vocab: Vocab, // or VocabWrap fwiw
}

with mapping.len() == storage.rows(). mapping[i] would translate the original index to the new one.

Persistence requires a new Chunk, this PrunedVocab chunk would be somewhat different from those we have so far. It would preceed the original Vocab and call read_chunk() for the vocab chunk following it.

I haven't written any Rust code for this and I don't know if there are additional obstacles, but from my perspective it wouldn't require changes to any existing on-disk formats or existing APIs - apart from the wrappers which could be made [non-exhaustive]?.

Any opinions on that or has the idea of pruning been thrown out anyways?

@danieldk
Copy link
Member

danieldk commented Apr 30, 2020

I think this approach would allow persisting pruned embeddings while being minimally invasive. The idea is to have a wrapper around the actual vocabulary:

struct PrunedVocab<V> where V: Vocab {
    mapping: Vec<usize>,
    vocab: Vocab, // or VocabWrap fwiw
}

with mapping.len() == storage.rows(). mapping[i] would translate the original index to the new one.

Just for my understanding, that would be storage.rows() from before pruning, right?

Persistence requires a new Chunk, this PrunedVocab chunk would be somewhat different from those we have so far. It would preceed the original Vocab and call read_chunk() for the vocab chunk following it.

You wouldn't need the mapping member here, I think? This could be a struct PrunedVocab(impl Vocab). You would read the vocabulary and then update the indices in the vocabulary. If you place the remapping table after the actual vocabulary, you wouldn't even have to store the mapping in memory, you could update the mapping of the vocabulary as you are reading the table. Then after reading, PrunedVocab could just forward queries to the actual vocab.

I think this has some problems though:

  • We still have the mapping from index to a word. This mapping is now incomplete, since a storage index can map to multiple words. (E.g. needed for similarity and analogy queries.)
  • If combining the vocab and the mapping is handled by PrunedVocab, it introduces a chunk order (mapping before or after the vocab). We currently assume an ordering in finalfusion-rust, but there is no limitation in the file format that we cannot have an arbitrary ordering of chunks. Once we get such explicit tying between chunks, we bind ourselves to an ordering or the chunk readers need to become idempotent in some way.

Of course, two concerns get mingled here:

  1. What should the API look like when one storage index can map to multiple words?
  2. How should the one-to-many mapping be represented in terms of chunks?

When it comes to the representation - I think a variant of your proposal is possible where we tack the mapping table onto the existing vocab chunks, but give these variants new chunk identifiers. Then we would not need any new data types, but could extend the existing vocab types. The ReadChunk implementation would then read the optional table based on the identifier. The WriteChunk implementation would write the old identifier if the mapping None and the new identifier otherwise.

For the API, we could have Vocab::words return &[BTreeSet<String>] or &[Vec<String>]. But that would entail another API change and would require quite a bit of extra memory use due to the overhead of BTreeSet or Vec.

@sebpuetz
Copy link
Member

Just for my understanding, that would be storage.rows() from before pruning, right?

Yes, every original storage row is mapped to its corresponding index in the pruned storage.

You wouldn't need the mapping member here, I think? This could be a struct PrunedVocab(impl Vocab). You would read the vocabulary and then update the indices in the vocabulary. If you place the remapping table after the actual vocabulary, you wouldn't even have to store the mapping in memory, you could update the mapping of the vocabulary as you are reading the table. Then after reading, PrunedVocab could just forward queries to the actual vocab.

That's a good point, but I see one issue: The Hash indexers don't allow updating the indices in memory, so at least for those an in-memory indirection would probably be necessary.

I think this has some problems though:

* We still have the mapping from index to a word. This mapping is now incomplete, since a storage index can map to multiple words. (E.g. needed for similarity and analogy queries.)

Relying on row_n being word_n in the vocab does complicate things. I guess this could be addressed in changing how the queries are handled; rather than iterating over the storage rows, the implementation could iterate the words in the vocab and retrieve the embeddings explicitly through the word.

* If combining the vocab and the mapping is handled by `PrunedVocab`, it introduces a chunk order (mapping before or after the vocab). We currently assume an ordering in `finalfusion-rust`, but there is no limitation in the file format that we cannot have an arbitrary ordering of chunks. Once we get such explicit tying between chunks, we bind ourselves to an ordering or the chunk readers need to become idempotent in some way.

I guess my formulation would introduce a chunk inside a chunk, the PrunedVocab would be the actual Chunk and it would contain the proper Vocab chunk. That way we wouldn't rely on order, they'd simply be tied together.

Of course, two concerns get mingled here:

1. What should the API look like when one storage index can map to multiple words?

2. How should the one-to-many mapping be represented in terms of chunks?

When it comes to the representation - I think a variant of your proposal is possible where we tack the mapping table onto the existing vocab chunks, but give these variants new chunk identifiers. Then we would not need any new data types, but could extend the existing vocab types. The ReadChunk implementation would then read the optional table based on the identifier. The WriteChunk implementation would write the old identifier if the mapping None and the new identifier otherwise.

That's also an option but I think the implementation of the existing Vocabs could become rather complex with the optional mapping being part of it. But I guess that's most easily seen by actually writing it out.

For the API, we could have Vocab::words return &[BTreeSet<String>] or &[Vec<String>]. But that would entail another API change and would require quite a bit of extra memory use due to the overhead of BTreeSet or Vec.

I was just looking at ExplicitIndexer::ngrams, here we allow a many-to-one mapping for ngrams -> index and we're returning the ngrams as a &[String]. We missed that part when I implemented the explicit ngrams. I'm not sure about the correct course of action wrt. having a two-way-mapping for the vocabularies - including the ngram indexer.

Maybe the conclusion is that it's not worth the hassle?

@danieldk
Copy link
Member

danieldk commented Apr 30, 2020

I guess this could be addressed in changing how the queries are handled; rather than iterating over the storage rows, the implementation could iterate the words in the vocab and retrieve the embeddings explicitly through the word.

But then you lose the benefit of fast matrix-vector multiplication implementations (either through ndarray or third-party BLAS), IIRC individual dot products were quite a bit slower.

Maybe the conclusion is that it's not worth the hassle?

Possibly. I think we should only bite the bullet and add such complexity if there is a very clear gain from pruning. IIRC quantization generally provides better compression with a smaller l2 loss (but I'd have to recheck Nicole's slides). Quantized embeddings are a fair bit slower, but that is in many cases acceptable, when they are the input to some expensive neural net. I think some other projects used pruning because they didn't have quantization and it's easier to implement than quantization. (I saw that ffp now also supports quantized matrices, nice!)

Also, if downloading and storing a large embedding matrix is not problematic, then you could as well just mmap it and be done with it.

@danieldk danieldk closed this Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants