Skip to content

feat: traversal-integrated groupBy for vector search (follow-up to #4067) #4071

@lvca

Description

@lvca

Feature Request: Traversal-Integrated groupBy for Vector Search

Follow-up to #4067, which shipped the groupBy / groupSize options on vector.neighbors and vector.sparseNeighbors as an MVP using post-traversal filtering. This issue tracks the work to push the grouping into the index traversal proper.

Background

The MVP delivered in #4067 implements groupBy as a post-traversal filter:

  1. Over-fetch k * groupSize * 5 candidates from JVector (dense) or the sparse-index top-K loop.
  2. Iterate the candidates in rank order, keeping per-group counters.
  3. Stop once k groups are filled at groupSize each.

This is correct and works well for typical RAG workloads (10s of groups, modest k). It has two limits:

  • Wasted work: candidates that get filtered out were still scored and (for dense) included in HNSW graph traversal.
  • Loose recall guarantee: when group cardinality is high or distribution is skewed, the 5x over-fetch multiplier may not be enough to fill k groups, and the user has to bump efSearch or accept fewer groups.

Qdrant's query_points_groups integrates grouping into the search loop; Milvus's group_by_field does the same. Both avoid the over-fetch heuristic.

Design

Dense (vector.neighbors)

JVector exposes a per-point Bits filter and a top-K heap. The cleanest integration is a custom search-result collector that:

  • Accepts the groupBy field reader (a RID -> Object lookup, typically record.get("source_file")).
  • Maintains a per-group count and a per-group worst-score watermark.
  • Admits a candidate to the heap only if count[groupKey] < groupSize, OR the candidate beats the group's worst member (in which case the worst member is evicted and replaced).
  • Stops the HNSW traversal once every group in flight is full.

Two integration points to investigate in JVector:

  1. SearchScoreProvider + custom collector: replace JVector's default heap with a group-aware one, called via the search loop's collect callback.
  2. Bits filter with adaptive bookkeeping: the filter reads from a shared per-group counter table and rejects candidates whose group is already full and whose score wouldn't beat the group's worst member. Cheaper to implement; possibly less effective than option 1.

For the first iteration option 2 is probably enough.

Sparse (vector.sparseNeighbors)

LSMSparseVectorIndex.topK is our own code, so the integration is straightforward: the WAND DAAT loop already maintains a min-heap of size K. Replace it with a per-group min-heap and the same admission logic as above. Pivot upper bound becomes per-group: a candidate's RID can only enter a group's heap if its score can beat that group's worst member.

Acceptance criteria

  • vector.neighbors integrates grouping into the JVector search loop, not as a post-filter on a fixed over-fetch.
  • vector.sparseNeighbors integrates grouping into the WAND DAAT loop in LSMSparseVectorIndex.topK.
  • Identical results (RIDs and ordering) to the MVP post-filter implementation on existing tests in SQLFunctionVectorGroupByTest.
  • Lower wall-clock latency than the MVP at k=10, groupSize=1 on a 100k-doc / 1000-group benchmark.
  • No regression in recall vs the MVP at fixed efSearch.
  • Benchmarks vs the MVP on representative shapes (low / medium / high group cardinality).

Out of scope

  • Cross-collection grouping (Qdrant's with_lookup).
  • Multi-key grouping (e.g. groupBy: ['author', 'year']).
  • Group-level scoring (e.g. avg score per group as a tiebreaker).

Related

  • Parent: #4067 (groupBy MVP, shipped)
  • Sibling follow-up: dotted-nested-field groupBy
  • Discussion: #4044

Metadata

Metadata

Assignees

No fields configured for Feature.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions