Add LEMUR module implementation ref issue 1024 by Vishal35198 · Pull Request #1068 · neuml/txtai

Vishal35198 · 2026-03-27T14:00:41Z

LEMUR Module implemented need some guidance @davidmezzetti from your side .

1. we need to train the Two Layer MLP in LEMUR implementation , In current implementation I have not stored the weights and loaded do I need that ?
2. I haven't added the LUMER implementation to the necessary file I found related to the MUVERA . Below Listed .
3. Late.py builds late pooled vectors using outputs from a transformer's models
4. Latenencoder.py Pipeline specific part that computes similarity between query and list of the text using the late interaction model.
5. Testpolling.py - test the functionalities of the late pooling models.
6. I will add the test Once I get familiar with the implementation details.
7. If there's anything extra you can list it here.

…nencoder

…8/txtai into enh-pooling-add-lemur

davidmezzetti · 2026-03-28T11:47:37Z

                "tokenizer": kwargs.get("tokenizer"),
                "maxlength": kwargs.get("maxlength"),
-                "modelargs": {**kwargs.get("vectors", {}), **{"muvera": None}},
+                "modelargs": {**kwargs.get("vectors", {}), **{"muvera": None}, **{"lemur": None}},


You can combine muvera and lemur into a single dictionary.

davidmezzetti · 2026-03-28T11:48:10Z

@@ -0,0 +1,216 @@
+"""


Please match the project coding conventions. I realize much of this code is likely ported over from the lemur reference implementation

Yes I am working over it the problem is lemur needs training on data first but txtai pipeline encodes query first what to do can you help me out ?

I haven't read the paper closely enough to know what kind of data is necessary for this. Is it just general training? Would this be a separate downstream model that is trained per multi-vector model? Or is it data specific.

Yes it's a seprate model included in lemur implementation itself a two layer mlp that is refered to as feature encoder, it's required to be trained in the data (the docs) to estimate the maxsim from query and docs . It is trained in the docs token . And the catch is , to encode query from that feature encoder it needs to trained on the docs token .. a supervised training to estimate maxsim . Please guide for this specific part .

I don't really have a good answer as they pooling module doesn't get a single pass of the data to do something like that. Perhaps there would need to be a separate training process for that outside of the normal indexing flow.

I am confused here , can you clarify me with what is seprate training. And since it needs docs to be trained how can it be done ?

I'm suggesting a separate pipeline process that would need to be run first to train a model. Then the pooling module would load the pre-trained model.

How about adding the training specific part into the lateencoder.py in the LateEncoder class ?
Also currently there is no option between lemur and muvera I'll need to specify between those two in the same class suggest ur thoughts on this ?

Perhaps it makes sense to just put it in it's own pipeline class (LateTrainer, LemurTrainer etc). You certainly could use the LateEncoder pipeline to do the maxsim calculations.

I wonder if a sample of the dataset is good enough. In other words, if a input dataset is 1 million rows, would training be fine with 10% of that or 1% of that.

I would use Safetensors as the vector storage format for this model. Also save the model parameters into a config.json file. And I'd have the loading logic using the standard huggingface_hub interfaces so it supports both loading a model from local storage and from the HFHub.

With this setup I think a lot of the same scaffolding that's already in the late pooling class (https://github.com/neuml/txtai/blob/master/src/python/txtai/models/pooling/late.py) can be reused.

I'd see applying the pretrained lemur model right here: https://github.com/neuml/txtai/blob/master/src/python/txtai/models/pooling/late.py#L67

Also I wonder if you could train a general LEMUR model that would generalize to other datasets. It seems like this algorithm is just learning to approximate maxsim, which doesn't seem dataset-specific.

If that were the case, we'd train a general LEMUR model and that would be the default LEMUR implementation. People could choose to customize that but wouldn't always have to.

davidmezzetti · 2026-03-28T11:50:26Z

Thank you for this PR!

As you're developing, I recommend running the txtai benchmark scripts as was done here: #1023 (comment)

That way we can have an idea of the usefulness of this method.

davidmezzetti · 2026-04-05T13:22:38Z

Any luck in implementing this?

davidmezzetti · 2026-04-20T11:23:38Z

@Vishal35198 Do you plan to continue this work or should I close this PR?

Vishal35198 added 5 commits March 27, 2026 19:21

Add LEMUR module implementation ref issue 1024 neuml#1024

b39de60

Add LEMUR support to LatePooling and update tests and disable in late…

bf3fa88

…nencoder

Add LEMUR module implementation ref issue 1024 neuml#1024

5ba2283

Add LEMUR support to LatePooling and update tests and disable in late…

040ebe0

…nencoder

Merge branch 'enh-pooling-add-lemur' of https://github.com/Vishal3519…

05fadbb

…8/txtai into enh-pooling-add-lemur

davidmezzetti reviewed Mar 28, 2026

View reviewed changes

davidmezzetti assigned Vishal35198 Mar 28, 2026

Conversation

Vishal35198 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidmezzetti Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vishal35198 Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidmezzetti Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidmezzetti commented Mar 28, 2026

Uh oh!

davidmezzetti commented Apr 5, 2026

Uh oh!

davidmezzetti commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vishal35198 commented Mar 27, 2026 •

edited

Loading

davidmezzetti Mar 29, 2026 •

edited

Loading

Vishal35198 Mar 29, 2026 •

edited

Loading

davidmezzetti Mar 29, 2026 •

edited

Loading