-
Notifications
You must be signed in to change notification settings - Fork 12
Add prune support #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prune support #120
Conversation
d3225d2
to
cbab312
Compare
if offset > toss { | ||
offset = toss; | ||
} | ||
let batch = toss_embeds.slice(s![n..offset, ..]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clippy
complains at here, but this seems to be a bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add this above the line to silence clippy #[allow(clippy::deref_addrof)]
, I think we have used that at other parts as well, tho I never looked deeper into why that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, this is another case where clippy complained on my laptop and not on CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And clippy
also suggests that we put this into Box
to reduce the size of enum
:
MmapQuantizedArray(Box<MmapQuantizedArray>),
841c302
to
9a6a536
Compare
CI seems to be fine with it... I tested it on my laptop with |
When I encountered this a few months (?) ago, the maximum size was a clippy default. It could of course be that the memory mapping implementation on Windows is more involved. This seems to be the case. On UNIX, it's a pointer + len. On Windows it also has a For me, UNIX is the golden standard. So, maybe you could silence clippy when the target is a Window platform? (I prefer not to silence it completely, in case we cross the boundary in UNIX.) |
Ah! Thanks!
Make sense! I won't silence it :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this!
I have done one read over the PR and I think I understand the most of it. Before I do another read, would it be possible to:
- Add documentation to traits, trait methods, and other methods?
- Add unit tests for the functionality. In
finalfusion
a lot of functionality is covered by unit tests, and it does help us capturing bugs.
Additional question:
Pruning currently does not survive a write/read roundtrip, right? (Since the vocab chunk currently does not store storage offsets for tokens.)
@@ -127,6 +127,20 @@ impl WriteChunk for NdNorms { | |||
} | |||
} | |||
|
|||
pub trait PruneNorms { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add rustdoc for the trait and for the prune_norms
method,
@@ -33,3 +33,13 @@ pub(crate) trait StorageViewMut: Storage { | |||
/// Get a view of the embedding matrix. | |||
fn view_mut(&mut self) -> ArrayViewMut2<f32>; | |||
} | |||
|
|||
pub trait StoragePrune: Storage { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add rustdoc.
src/chunks/vocab/subword.rs
Outdated
fn part_indices(&self, n_keep: usize) -> (Vec<usize>, Vec<usize>) { | ||
let mut keep_indices = vec![0; n_keep]; | ||
let mut toss_indices = vec![0; self.words_len() - n_keep]; | ||
for (n, each_word) in self.words()[0..n_keep].iter().enumerate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In these loops: is there initially a difference between n
n and self.indices.get(each_word)
? I guess so, because the vocabulary may be the result of earlier pruning?
I guess these loops could be simplified with something along the lines of:
let keep_indices = self.words().iter().take(n_keep).map(|w| *self.indices.get(w).unwrap).collect();
src/embeddings.rs
Outdated
@@ -566,6 +566,44 @@ impl<'a> Iterator for IterWithNorms<'a> { | |||
} | |||
} | |||
|
|||
pub trait Prune<V, S> { | |||
fn simple_prune(&self, n_keep: usize, batch_size: usize) -> Embeddings<VocabWrap, StorageWrap>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just be called Prune
.
99f2568
to
6ab0371
Compare
Yes! Actually in addition to that, there is another issue keeping it from surviving the roundtrip: So I think maybe for the pruned embeddings, we need to store the remapping information: How many vectors were pruned off, and the remapped indices of the words whose vectors were tossed. |
Indeed, however, such a chunk would require some larger changes across the crate. We didn't really realize that, but as discussed in #126 , the problem is that one storage index can map to multiple words. However, Adding an explicit mapping from words to indices also requires the addition of four new chunks in our current setup (vocab, bucket vocab, fasttext vocab, explicit ngram vocab). All these things are possible, but need to be worked out carefully. However, the first change would require an API change. We can proceed in getting this PR in shape, but we before merging, we should also look at the improvements it brings. So, I am really looking forward to your presentation! |
This was another feature that was left hanging last fall, I wrote an implementation for a PrunedVocab chunk in my Python port: I think this approach would allow persisting pruned embeddings while being minimally invasive. The idea is to have a wrapper around the actual vocabulary: struct PrunedVocab<V> where V: Vocab {
mapping: Vec<usize>,
vocab: Vocab, // or VocabWrap fwiw
} with Persistence requires a new I haven't written any Rust code for this and I don't know if there are additional obstacles, but from my perspective it wouldn't require changes to any existing on-disk formats or existing APIs - apart from the wrappers which could be made Any opinions on that or has the idea of pruning been thrown out anyways? |
Just for my understanding, that would be
You wouldn't need the I think this has some problems though:
Of course, two concerns get mingled here:
When it comes to the representation - I think a variant of your proposal is possible where we tack the mapping table onto the existing vocab chunks, but give these variants new chunk identifiers. Then we would not need any new data types, but could extend the existing vocab types. The For the API, we could have |
Yes, every original storage row is mapped to its corresponding index in the pruned storage.
That's a good point, but I see one issue: The Hash indexers don't allow updating the indices in memory, so at least for those an in-memory indirection would probably be necessary.
Relying on row_n being word_n in the vocab does complicate things. I guess this could be addressed in changing how the queries are handled; rather than iterating over the storage rows, the implementation could iterate the words in the vocab and retrieve the embeddings explicitly through the word.
I guess my formulation would introduce a chunk inside a chunk, the PrunedVocab would be the actual Chunk and it would contain the proper Vocab chunk. That way we wouldn't rely on order, they'd simply be tied together.
That's also an option but I think the implementation of the existing Vocabs could become rather complex with the optional mapping being part of it. But I guess that's most easily seen by actually writing it out.
I was just looking at Maybe the conclusion is that it's not worth the hassle? |
But then you lose the benefit of fast matrix-vector multiplication implementations (either through ndarray or third-party BLAS), IIRC individual dot products were quite a bit slower.
Possibly. I think we should only bite the bullet and add such complexity if there is a very clear gain from pruning. IIRC quantization generally provides better compression with a smaller l2 loss (but I'd have to recheck Nicole's slides). Quantized embeddings are a fair bit slower, but that is in many cases acceptable, when they are the input to some expensive neural net. I think some other projects used pruning because they didn't have quantization and it's easier to implement than quantization. (I saw that ffp now also supports quantized matrices, nice!) Also, if downloading and storing a large embedding matrix is not problematic, then you could as well just mmap it and be done with it. |
No description provided.