Skip to content

Add LEMUR module implementation ref issue 1024 #1068

Draft
Vishal35198 wants to merge 5 commits intoneuml:masterfrom
Vishal35198:enh-pooling-add-lemur
Draft

Add LEMUR module implementation ref issue 1024 #1068
Vishal35198 wants to merge 5 commits intoneuml:masterfrom
Vishal35198:enh-pooling-add-lemur

Conversation

@Vishal35198
Copy link
Copy Markdown

@Vishal35198 Vishal35198 commented Mar 27, 2026

LEMUR Module implemented need some guidance @davidmezzetti from your side .

  • 1. we need to train the Two Layer MLP in LEMUR implementation , In current implementation I have not stored the weights and loaded do I need that ?
  • 2. I haven't added the LUMER implementation to the necessary file I found related to the MUVERA . Below Listed .
  • 3. Late.py builds late pooled vectors using outputs from a transformer's models
  • 4. Latenencoder.py Pipeline specific part that computes similarity between query and list of the text using the late interaction model.
  • 5. Testpolling.py - test the functionalities of the late pooling models.
  • 6. I will add the test Once I get familiar with the implementation details.
  • 7. If there's anything extra you can list it here.

"tokenizer": kwargs.get("tokenizer"),
"maxlength": kwargs.get("maxlength"),
"modelargs": {**kwargs.get("vectors", {}), **{"muvera": None}},
"modelargs": {**kwargs.get("vectors", {}), **{"muvera": None}, **{"lemur": None}},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can combine muvera and lemur into a single dictionary.

@@ -0,0 +1,216 @@
"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please match the project coding conventions. I realize much of this code is likely ported over from the lemur reference implementation

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I am working over it the problem is lemur needs training on data first but txtai pipeline encodes query first what to do can you help me out ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read the paper closely enough to know what kind of data is necessary for this. Is it just general training? Would this be a separate downstream model that is trained per multi-vector model? Or is it data specific.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's a seprate model included in lemur implementation itself a two layer mlp that is refered to as feature encoder, it's required to be trained in the data (the docs) to estimate the maxsim from query and docs . It is trained in the docs token . And the catch is , to encode query from that feature encoder it needs to trained on the docs token .. a supervised training to estimate maxsim . Please guide for this specific part .

Copy link
Copy Markdown
Member

@davidmezzetti davidmezzetti Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have a good answer as they pooling module doesn't get a single pass of the data to do something like that. Perhaps there would need to be a separate training process for that outside of the normal indexing flow.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused here , can you clarify me with what is seprate training. And since it needs docs to be trained how can it be done ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting a separate pipeline process that would need to be run first to train a model. Then the pooling module would load the pre-trained model.

Copy link
Copy Markdown
Author

@Vishal35198 Vishal35198 Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding the training specific part into the lateencoder.py in the LateEncoder class ?
Also currently there is no option between lemur and muvera I'll need to specify between those two in the same class suggest ur thoughts on this ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it makes sense to just put it in it's own pipeline class (LateTrainer, LemurTrainer etc). You certainly could use the LateEncoder pipeline to do the maxsim calculations.

I wonder if a sample of the dataset is good enough. In other words, if a input dataset is 1 million rows, would training be fine with 10% of that or 1% of that.

I would use Safetensors as the vector storage format for this model. Also save the model parameters into a config.json file. And I'd have the loading logic using the standard huggingface_hub interfaces so it supports both loading a model from local storage and from the HFHub.

With this setup I think a lot of the same scaffolding that's already in the late pooling class (https://github.com/neuml/txtai/blob/master/src/python/txtai/models/pooling/late.py) can be reused.

I'd see applying the pretrained lemur model right here: https://github.com/neuml/txtai/blob/master/src/python/txtai/models/pooling/late.py#L67

Copy link
Copy Markdown
Member

@davidmezzetti davidmezzetti Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I wonder if you could train a general LEMUR model that would generalize to other datasets. It seems like this algorithm is just learning to approximate maxsim, which doesn't seem dataset-specific.

If that were the case, we'd train a general LEMUR model and that would be the default LEMUR implementation. People could choose to customize that but wouldn't always have to.

@davidmezzetti
Copy link
Copy Markdown
Member

Thank you for this PR!

As you're developing, I recommend running the txtai benchmark scripts as was done here: #1023 (comment)

That way we can have an idea of the usefulness of this method.

@davidmezzetti
Copy link
Copy Markdown
Member

Any luck in implementing this?

@davidmezzetti
Copy link
Copy Markdown
Member

@Vishal35198 Do you plan to continue this work or should I close this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants