Proposal: collection-level data freshness gate for RAG retrieval pipelines #2591

SomeshZanwar · 2026-05-26T05:24:04Z

SomeshZanwar
May 26, 2026

Summary

agent-rag-governance already covers important retrieval-time controls: collection access policy, rate limiting, content scanning, and audit logging.

One adjacent governance question is not currently represented clearly:

Is the collection being queried still trustworthy at the time of retrieval?

This proposal is about exploring a collection-level freshness / quality gate for RAG retrieval pipelines.

The goal is not to require a specific data quality provider or add mandatory behavior. The goal is to validate whether AGT should support an optional pattern for blocking retrieval when the source collection is stale, failing validation, or below an acceptable quality threshold.

Problem

A RAG pipeline can pass the existing governance checks and still retrieve from a weak source collection.

For example:

the agent is allowed to query the collection
the request is under the rate limit
retrieved chunks do not contain PII
retrieved chunks do not contain prompt-injection patterns

But the collection itself may still be problematic:

last refreshed several weeks ago
failed the latest validation run
below a quality threshold for the workflow
flagged by a catalog or data quality system as unfit for use

In that case, the retrieval may be governance-compliant at the access/content layer, but still unreliable at the data-source layer.

Proposed direction

Add an optional collection-level data freshness / quality gate to the RAG governance pattern.

Conceptually:

RAG request
  -> collection access check
  -> rate-limit check
  -> optional collection freshness / quality check
  -> retrieval
  -> content scan
  -> audit log

A possible example shape could look like:

from datetime import datetime

freshness_record = CollectionFreshnessRecord(
    collection="user_events",
    last_validated_at=datetime(2026, 5, 14, 8, 0),
    freshness_threshold_hours=6.0,
    validation_status="fail",
    quality_score=0.72,
    quality_threshold=0.85,
    failed_tests=["not_null_user_id", "accepted_values_event_type"],
)

If the collection fails the freshness / quality gate, retrieval would be blocked before chunks are fetched or passed to the LLM.

The audit record could then explain that the denial happened at the collection freshness layer, not at the access-policy or content-scanning layer.

Why this fits the RAG governance model

RAG governance does not only depend on who can retrieve.

It also depends on whether the retrieved source material is fit to use.

This is especially relevant for analytics, compliance, finance, healthcare, and operational decision-support workflows where stale or failing source data can produce misleading downstream answers.

The proposed gate would be:

optional
additive
provider-neutral
compatible with dbt, Great Expectations, Soda, data catalogs, or custom metadata systems
separate from content scanning
separate from collection access control

Relationship to existing AGT work

This is related to the data-quality-aware governance examples already merged into AGT.

At the agent-action layer:

AGT policy + data quality evidence -> allow/block action

At the RAG retrieval layer, the equivalent pattern would be:

RAG policy + collection freshness evidence -> allow/block retrieval

Same principle, different enforcement point.

Non-goals

This proposal is not asking AGT to:

require dbt, Great Expectations, Soda, or any single data quality tool
define how freshness records must be populated
add mandatory freshness checks to all RAG pipelines
replace collection access control
replace content scanning
change audit semantics broadly
standardize a universal data quality schema immediately

Open questions

Does this belong in agent-rag-governance as an optional package component?
Should this start as an example-only pattern first?
Should the freshness record be a small typed object, a protocol/interface, or just documented example metadata?
Should the audit log include a dedicated collection_freshness decision reason?
Should this be positioned as collection freshness, collection quality, or more generally collection trust?

Suggested first step

If maintainers think this direction is useful, I can start with a low-risk example or documentation PR first.

That would avoid changing core RAG behavior while showing the pattern end-to-end:

collection freshness record -> RAG governance decision -> audit-friendly allow/block reason

I am happy to follow whichever path maintainers prefer: docs/example first, or a small optional package component if that fits the current direction of agent-rag-governance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: collection-level data freshness gate for RAG retrieval pipelines #2591

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Proposal: collection-level data freshness gate for RAG retrieval pipelines #2591

Uh oh!

SomeshZanwar May 26, 2026

Summary

Problem

Proposed direction

Why this fits the RAG governance model

Relationship to existing AGT work

Non-goals

Open questions

Suggested first step

Replies: 0 comments

SomeshZanwar
May 26, 2026