Background
We run vector search over a ~6.2 M-row corpus of vector(1024)
candidates on PostgreSQL 18 with VectorChord. Most of our production
queries are cohort-shaped: a vector ORDER BY combined with several
mandatory equality / range / bitmask predicates over fixed-width
columns (feed, status, visibility, geo, time bucket, etc.). The
predicates typically cut the working set to 5–30 % of rows, but the
ORDER BY still needs ANN over the survivors.
Stock vchordrq.prefilter (the existing prefilter from #247, thanks!)
already helps by letting us avoid quicksort and scan the index. But
it's a heap-side prefilter: the IVF scan returns candidates, the
index fetches each candidate's full f32 vector, then the heap row is
checked, and rejected candidates' vector reads are wasted I/O.
For us this matters most on memory-constrained instances. With a 33 GB
vchord index that doesn't fit in shared_buffers, every wasted vector
fetch is a random page fault on the index, and at the cohort
selectivities we see (~10 %), that means 90 % of the IVF candidates
get read, scored, and discarded.
Proposed feature
Allow INCLUDE columns of bigint type on vchordrq indexes, store their
values in the H0 leaf tuples, and add a candidate-filter callback that
runs before the vector page fetch. Introduce a GUC
vchordrq.metadata_prefilter with three modes: off, reject_only
(definitely-false predicates short-circuit further work), and
covered_skip_heap (lossless predicates can also skip the heap
recheck).
In the IVF scan: for each H0 candidate, evaluate the compiled metadata
predicates against the candidate's INCLUDE values; if any predicate
proves "definitely false", skip the candidate without faulting in its
vector. Otherwise proceed as today (vector fetch + rerank + heap
recheck if the predicate is lossy).
Prototype + numbers
We have a working prototype in a private fork (it carries some
company-specific tooling and configuration in its history that we can't
publish). It implements all of the above. Happy to share the repo
privately with a maintainer if that would help review; we'd rather not
upstream-PR off that branch directly anyway, since it hardcodes our
domain's column names and would need to be generalized first.
Two boxes, two bottleneck regimes, identity recall verified vs the
no-metadata path on 20 random queries:
Clean attribution test (32 GiB / 4 vCPU, CPU-bound).
Stock 1.0.0 vs the prototype's metadata path on the same hardware,
with vchordrq.prefilter=on, enable_sort=off, and read_stream IO
applied to both runs (the only delta is the metadata feature itself
plus the wide-INCLUDE index):
| config |
peak QPS |
p95 at peak |
| stock 1.0.0, plain index, GUCs tuned |
5.68 (c=8) |
2 324 ms |
prototype, wide-INCLUDE + metadata_prefilter=reject_only |
6.84 (c=4) |
836 ms |
That's +20 % peak QPS, −64 % p95. The QPS gain is modest because
this box is CPU-bound at c≥4 with only 4 vCPUs. The latency gain is
where the metadata reject earns its keep: sub-second p95 at the same
concurrency where stock-tuned is still over 2 seconds.
Memory-bound regime (16 vCPU host, 16 GiB cgroup cap). Same
prototype, toggling whether the metadata predicates are emitted:
| config |
peak QPS @ c=16 |
p95 @ c=16 |
| prototype, no metadata predicates emitted |
11.86 |
2 635 ms |
| prototype, wide-metadata reject |
37.79 |
732 ms |
+218 % peak QPS, −72 % p95 under memory pressure. The gain is much
larger here because rejecting a candidate before its f32 vector page
faults in is exactly the right thing to do when shared_buffers can't
hold the index. The 33 GB index on this corpus is dominated by per-row
f32 vector copies on VectorTuple pages. Those copies exist solely so
rerank can avoid a heap fetch, and they're exactly what the metadata
reject lets us avoid reading in the first place.
A side note from the attribution work: the prototype's non-metadata
improvements (read_stream IO, prefetcher batching, search-loop
refactor) produced essentially zero throughput delta over stock 1.0.0
with the same GUCs applied. The metadata-prefilter feature is the only
novel contribution worth upstreaming.
Open questions for the maintainers
A couple of decisions are upstream-shaped and we'd value your steer
before writing the public PR:
-
How users mark "lossless" vs "lossy" INCLUDE columns.
Some metadata predicates prove the heap row matches (e.g. a
stored bitmask passing (col & mask) = mask is exact); some only
accelerate the reject case (e.g. a 64-bit hash where a non-match
is conclusive but a match is "probably matches, recheck the heap").
Lossless predicates let us skip the heap recheck; lossy ones can't.
The most natural way to surface this in PG would be per-column
opclass options. We'd define two bigint opclasses,
bigint_lossless_ops and bigint_lossy_ops, and use them on the
INCLUDE columns:
CREATE INDEX … USING vchordrq (vec vector_ip_ops)
INCLUDE (flags_col bigint_lossless_ops,
hash_col bigint_lossy_ops);
That keeps the SQL declarative and uses standard PG plumbing. Any
pushback on that approach, or a different shape you'd prefer?
-
Tuple format change. The prototype adds a versioned metadata
tail to FrozenTuple and AppendableTuple
(METADATA_TAIL_VERSION = 2). For indexes built without INCLUDE
columns, attr_count = 0 ⇒ no tail emitted ⇒ existing 1.x indexes
read identically. Is that backward-compat shape acceptable, or
would you rather the entire feature be gated behind a new index
reloption / opclass so the on-disk format is unchanged for existing
users?
-
Tests. We'd plan to add ≥6 sqllogictest files in
tests/vchordrq/ covering eq / IN / range / bitmask / NULL / mixed
with ORDER BY, plus an identity-recall test against the
no-metadata path. Anything else you'd want before review?
If this is on your roadmap or you'd rather take a different shape,
please let us know. We'd rather align early than rebuild later.
Acknowledgements
The prototype was implemented by @SamORichards (cc'd). I (@spenc-r) am happy to do the upstream-generalization work and own the
public PR; planning to add Sam as Co-authored-by: on each commit.
Background
We run vector search over a ~6.2 M-row corpus of
vector(1024)candidates on PostgreSQL 18 with VectorChord. Most of our production
queries are cohort-shaped: a vector ORDER BY combined with several
mandatory equality / range / bitmask predicates over fixed-width
columns (feed, status, visibility, geo, time bucket, etc.). The
predicates typically cut the working set to 5–30 % of rows, but the
ORDER BY still needs ANN over the survivors.
Stock
vchordrq.prefilter(the existing prefilter from #247, thanks!)already helps by letting us avoid quicksort and scan the index. But
it's a heap-side prefilter: the IVF scan returns candidates, the
index fetches each candidate's full f32 vector, then the heap row is
checked, and rejected candidates' vector reads are wasted I/O.
For us this matters most on memory-constrained instances. With a 33 GB
vchord index that doesn't fit in
shared_buffers, every wasted vectorfetch is a random page fault on the index, and at the cohort
selectivities we see (~10 %), that means 90 % of the IVF candidates
get read, scored, and discarded.
Proposed feature
Allow
INCLUDEcolumns of bigint type on vchordrq indexes, store theirvalues in the H0 leaf tuples, and add a candidate-filter callback that
runs before the vector page fetch. Introduce a GUC
vchordrq.metadata_prefilterwith three modes:off,reject_only(definitely-false predicates short-circuit further work), and
covered_skip_heap(lossless predicates can also skip the heaprecheck).
In the IVF scan: for each H0 candidate, evaluate the compiled metadata
predicates against the candidate's INCLUDE values; if any predicate
proves "definitely false", skip the candidate without faulting in its
vector. Otherwise proceed as today (vector fetch + rerank + heap
recheck if the predicate is lossy).
Prototype + numbers
We have a working prototype in a private fork (it carries some
company-specific tooling and configuration in its history that we can't
publish). It implements all of the above. Happy to share the repo
privately with a maintainer if that would help review; we'd rather not
upstream-PR off that branch directly anyway, since it hardcodes our
domain's column names and would need to be generalized first.
Two boxes, two bottleneck regimes, identity recall verified vs the
no-metadata path on 20 random queries:
Clean attribution test (32 GiB / 4 vCPU, CPU-bound).
Stock 1.0.0 vs the prototype's metadata path on the same hardware,
with
vchordrq.prefilter=on,enable_sort=off, andread_streamIOapplied to both runs (the only delta is the metadata feature itself
plus the wide-INCLUDE index):
metadata_prefilter=reject_onlyThat's +20 % peak QPS, −64 % p95. The QPS gain is modest because
this box is CPU-bound at c≥4 with only 4 vCPUs. The latency gain is
where the metadata reject earns its keep: sub-second p95 at the same
concurrency where stock-tuned is still over 2 seconds.
Memory-bound regime (16 vCPU host, 16 GiB cgroup cap). Same
prototype, toggling whether the metadata predicates are emitted:
+218 % peak QPS, −72 % p95 under memory pressure. The gain is much
larger here because rejecting a candidate before its f32 vector page
faults in is exactly the right thing to do when
shared_bufferscan'thold the index. The 33 GB index on this corpus is dominated by per-row
f32 vector copies on
VectorTuplepages. Those copies exist solely sorerank can avoid a heap fetch, and they're exactly what the metadata
reject lets us avoid reading in the first place.
A side note from the attribution work: the prototype's non-metadata
improvements (read_stream IO, prefetcher batching, search-loop
refactor) produced essentially zero throughput delta over stock 1.0.0
with the same GUCs applied. The metadata-prefilter feature is the only
novel contribution worth upstreaming.
Open questions for the maintainers
A couple of decisions are upstream-shaped and we'd value your steer
before writing the public PR:
How users mark "lossless" vs "lossy" INCLUDE columns.
Some metadata predicates prove the heap row matches (e.g. a
stored bitmask passing
(col & mask) = maskis exact); some onlyaccelerate the reject case (e.g. a 64-bit hash where a non-match
is conclusive but a match is "probably matches, recheck the heap").
Lossless predicates let us skip the heap recheck; lossy ones can't.
The most natural way to surface this in PG would be per-column
opclass options. We'd define two
bigintopclasses,bigint_lossless_opsandbigint_lossy_ops, and use them on theINCLUDE columns:
CREATE INDEX … USING vchordrq (vec vector_ip_ops) INCLUDE (flags_col bigint_lossless_ops, hash_col bigint_lossy_ops);That keeps the SQL declarative and uses standard PG plumbing. Any
pushback on that approach, or a different shape you'd prefer?
Tuple format change. The prototype adds a versioned metadata
tail to
FrozenTupleandAppendableTuple(
METADATA_TAIL_VERSION = 2). For indexes built without INCLUDEcolumns,
attr_count = 0⇒ no tail emitted ⇒ existing 1.x indexesread identically. Is that backward-compat shape acceptable, or
would you rather the entire feature be gated behind a new index
reloption / opclass so the on-disk format is unchanged for existing
users?
Tests. We'd plan to add ≥6 sqllogictest files in
tests/vchordrq/covering eq / IN / range / bitmask / NULL / mixedwith
ORDER BY, plus an identity-recall test against theno-metadata path. Anything else you'd want before review?
If this is on your roadmap or you'd rather take a different shape,
please let us know. We'd rather align early than rebuild later.
Acknowledgements
The prototype was implemented by @SamORichards (cc'd). I (@spenc-r) am happy to do the upstream-generalization work and own the
public PR; planning to add Sam as
Co-authored-by:on each commit.