Add LEMUR module implementation ref issue 1024 #1068
Add LEMUR module implementation ref issue 1024 #1068Vishal35198 wants to merge 5 commits intoneuml:masterfrom
Conversation
…8/txtai into enh-pooling-add-lemur
| "tokenizer": kwargs.get("tokenizer"), | ||
| "maxlength": kwargs.get("maxlength"), | ||
| "modelargs": {**kwargs.get("vectors", {}), **{"muvera": None}}, | ||
| "modelargs": {**kwargs.get("vectors", {}), **{"muvera": None}, **{"lemur": None}}, |
There was a problem hiding this comment.
You can combine muvera and lemur into a single dictionary.
| @@ -0,0 +1,216 @@ | |||
| """ | |||
There was a problem hiding this comment.
Please match the project coding conventions. I realize much of this code is likely ported over from the lemur reference implementation
There was a problem hiding this comment.
Yes I am working over it the problem is lemur needs training on data first but txtai pipeline encodes query first what to do can you help me out ?
There was a problem hiding this comment.
I haven't read the paper closely enough to know what kind of data is necessary for this. Is it just general training? Would this be a separate downstream model that is trained per multi-vector model? Or is it data specific.
There was a problem hiding this comment.
Yes it's a seprate model included in lemur implementation itself a two layer mlp that is refered to as feature encoder, it's required to be trained in the data (the docs) to estimate the maxsim from query and docs . It is trained in the docs token . And the catch is , to encode query from that feature encoder it needs to trained on the docs token .. a supervised training to estimate maxsim . Please guide for this specific part .
There was a problem hiding this comment.
I don't really have a good answer as they pooling module doesn't get a single pass of the data to do something like that. Perhaps there would need to be a separate training process for that outside of the normal indexing flow.
There was a problem hiding this comment.
I am confused here , can you clarify me with what is seprate training. And since it needs docs to be trained how can it be done ?
There was a problem hiding this comment.
I'm suggesting a separate pipeline process that would need to be run first to train a model. Then the pooling module would load the pre-trained model.
There was a problem hiding this comment.
How about adding the training specific part into the lateencoder.py in the LateEncoder class ?
Also currently there is no option between lemur and muvera I'll need to specify between those two in the same class suggest ur thoughts on this ?
There was a problem hiding this comment.
Perhaps it makes sense to just put it in it's own pipeline class (LateTrainer, LemurTrainer etc). You certainly could use the LateEncoder pipeline to do the maxsim calculations.
I wonder if a sample of the dataset is good enough. In other words, if a input dataset is 1 million rows, would training be fine with 10% of that or 1% of that.
I would use Safetensors as the vector storage format for this model. Also save the model parameters into a config.json file. And I'd have the loading logic using the standard huggingface_hub interfaces so it supports both loading a model from local storage and from the HFHub.
With this setup I think a lot of the same scaffolding that's already in the late pooling class (https://github.com/neuml/txtai/blob/master/src/python/txtai/models/pooling/late.py) can be reused.
I'd see applying the pretrained lemur model right here: https://github.com/neuml/txtai/blob/master/src/python/txtai/models/pooling/late.py#L67
There was a problem hiding this comment.
Also I wonder if you could train a general LEMUR model that would generalize to other datasets. It seems like this algorithm is just learning to approximate maxsim, which doesn't seem dataset-specific.
If that were the case, we'd train a general LEMUR model and that would be the default LEMUR implementation. People could choose to customize that but wouldn't always have to.
|
Thank you for this PR! As you're developing, I recommend running the txtai benchmark scripts as was done here: #1023 (comment) That way we can have an idea of the usefulness of this method. |
|
Any luck in implementing this? |
|
@Vishal35198 Do you plan to continue this work or should I close this PR? |
LEMUR Module implemented need some guidance @davidmezzetti from your side .