Feature Request: Traversal-Integrated groupBy for Vector Search
Follow-up to #4067, which shipped the groupBy / groupSize options on vector.neighbors and vector.sparseNeighbors as an MVP using post-traversal filtering. This issue tracks the work to push the grouping into the index traversal proper.
Background
The MVP delivered in #4067 implements groupBy as a post-traversal filter:
- Over-fetch
k * groupSize * 5 candidates from JVector (dense) or the sparse-index top-K loop.
- Iterate the candidates in rank order, keeping per-group counters.
- Stop once
k groups are filled at groupSize each.
This is correct and works well for typical RAG workloads (10s of groups, modest k). It has two limits:
- Wasted work: candidates that get filtered out were still scored and (for dense) included in HNSW graph traversal.
- Loose recall guarantee: when group cardinality is high or distribution is skewed, the 5x over-fetch multiplier may not be enough to fill
k groups, and the user has to bump efSearch or accept fewer groups.
Qdrant's query_points_groups integrates grouping into the search loop; Milvus's group_by_field does the same. Both avoid the over-fetch heuristic.
Design
Dense (vector.neighbors)
JVector exposes a per-point Bits filter and a top-K heap. The cleanest integration is a custom search-result collector that:
- Accepts the
groupBy field reader (a RID -> Object lookup, typically record.get("source_file")).
- Maintains a per-group count and a per-group worst-score watermark.
- Admits a candidate to the heap only if
count[groupKey] < groupSize, OR the candidate beats the group's worst member (in which case the worst member is evicted and replaced).
- Stops the HNSW traversal once every group in flight is full.
Two integration points to investigate in JVector:
SearchScoreProvider + custom collector: replace JVector's default heap with a group-aware one, called via the search loop's collect callback.
Bits filter with adaptive bookkeeping: the filter reads from a shared per-group counter table and rejects candidates whose group is already full and whose score wouldn't beat the group's worst member. Cheaper to implement; possibly less effective than option 1.
For the first iteration option 2 is probably enough.
Sparse (vector.sparseNeighbors)
LSMSparseVectorIndex.topK is our own code, so the integration is straightforward: the WAND DAAT loop already maintains a min-heap of size K. Replace it with a per-group min-heap and the same admission logic as above. Pivot upper bound becomes per-group: a candidate's RID can only enter a group's heap if its score can beat that group's worst member.
Acceptance criteria
Out of scope
- Cross-collection grouping (Qdrant's
with_lookup).
- Multi-key grouping (e.g.
groupBy: ['author', 'year']).
- Group-level scoring (e.g. avg score per group as a tiebreaker).
Related
- Parent: #4067 (groupBy MVP, shipped)
- Sibling follow-up: dotted-nested-field groupBy
- Discussion: #4044
Feature Request: Traversal-Integrated
groupByfor Vector SearchFollow-up to #4067, which shipped the
groupBy/groupSizeoptions onvector.neighborsandvector.sparseNeighborsas an MVP using post-traversal filtering. This issue tracks the work to push the grouping into the index traversal proper.Background
The MVP delivered in #4067 implements
groupByas a post-traversal filter:k * groupSize * 5candidates from JVector (dense) or the sparse-index top-K loop.kgroups are filled atgroupSizeeach.This is correct and works well for typical RAG workloads (10s of groups, modest
k). It has two limits:kgroups, and the user has to bumpefSearchor accept fewer groups.Qdrant's
query_points_groupsintegrates grouping into the search loop; Milvus'sgroup_by_fielddoes the same. Both avoid the over-fetch heuristic.Design
Dense (
vector.neighbors)JVector exposes a per-point
Bitsfilter and a top-K heap. The cleanest integration is a custom search-result collector that:groupByfield reader (aRID -> Objectlookup, typicallyrecord.get("source_file")).count[groupKey] < groupSize, OR the candidate beats the group's worst member (in which case the worst member is evicted and replaced).Two integration points to investigate in JVector:
SearchScoreProvider+ custom collector: replace JVector's default heap with a group-aware one, called via the search loop'scollectcallback.Bitsfilter with adaptive bookkeeping: the filter reads from a shared per-group counter table and rejects candidates whose group is already full and whose score wouldn't beat the group's worst member. Cheaper to implement; possibly less effective than option 1.For the first iteration option 2 is probably enough.
Sparse (
vector.sparseNeighbors)LSMSparseVectorIndex.topKis our own code, so the integration is straightforward: the WAND DAAT loop already maintains a min-heap of size K. Replace it with a per-group min-heap and the same admission logic as above. Pivot upper bound becomes per-group: a candidate's RID can only enter a group's heap if its score can beat that group's worst member.Acceptance criteria
vector.neighborsintegrates grouping into the JVector search loop, not as a post-filter on a fixed over-fetch.vector.sparseNeighborsintegrates grouping into the WAND DAAT loop inLSMSparseVectorIndex.topK.SQLFunctionVectorGroupByTest.k=10, groupSize=1on a 100k-doc / 1000-group benchmark.efSearch.Out of scope
with_lookup).groupBy: ['author', 'year']).Related