Skip to content

Investigate a cache_if_supported() helper for backend-portable Ibis caching #479

Description

@murray-ds

Background

CustomerDecisionHierarchy (openretailscience/analysis/customer_decision_hierarchy.py) reduces transactions to a distinct (customer, product) pairs table and then references it three times in a single query: the two sides of a customer-keyed self-join plus a nunique(customer) scalar subquery that broadcasts the customer total onto every row.

An earlier revision wrapped this with pairs = pairs.cache() to materialize the reduction (the DISTINCT, and in the exclusion path an anti-join) once, so the three references read a small temp table instead of potentially recomputing the reduction from the base table each time. That mattered most for the 1B–10B row inputs this module targets.

.cache() was dropped (commit on branch claude/ibis-support-gaps-ZOOSJ) because it is not supported on every Ibis backend — notably Databricks — and would make the module non-portable.

The tradeoff we're leaving on the table

Whether dropping .cache() costs anything is engine-dependent:

  • DuckDBEXPLAIN confirms one SEQ_SCAN of the base table; the compiler factors pairs into a single CTE (WITH t1 AS …) materialized once and reused via three CTE_SCANs. Cache buys essentially nothing here.
  • Spark/Databricks, BigQuery, Snowflake — these engines can inline a CTE that is referenced multiple times, which would recompute the DISTINCT + anti-join subgraph up to 3×. On a 10B-row base, this is exactly where .cache() earned its keep.

So on the backends that can't use .cache(), we may also be the most exposed to CTE-inlining recomputation. We should measure this rather than assume.

Proposed idea

A small backend-aware helper that caches only where the backend supports it, and is a no-op elsewhere:

import ibis.expr.types as ir


def cache_if_supported(table: ir.Table) -> ir.Table:
    """Cache an Ibis table expression if the backend supports temp tables."""
    backend = table.get_backend()

    if getattr(backend, "supports_temporary_tables", False):
        return table.cache()

    return table

Call site would become pairs = cache_if_supported(pairs).

Scope of investigation

  • Confirm the right capability flag. Verify whether backend.supports_temporary_tables (or another documented attribute / a try/except around .cache()) is the correct, stable way to detect support across the backends we target (BigQuery, Snowflake, PySpark/Databricks, DuckDB). Do not ship a hardcoded backend-name denylist.
  • Decide where the helper lives if adopted. A query-engine helper is not validation, so it does not belong in core/validation.py. Consider a small core/ibis_utils.py (new) or similar. Keep it to one obvious home.
  • Benchmark on a realistic (ideally billion-row) dataset on a CTE-inlining backend (Spark/Databricks or BigQuery): compare wall-clock and bytes-scanned for the _get_yules_q_distances query with vs without caching. The helper only earns its place if there is a measurable win on a backend that actually supports it.
  • Verify portability: confirm the helper is a clean no-op (still one execute(), identical result matrix) on Databricks and any other backend lacking temp-table support.
  • Per the repo simplicity criterion, only adopt the helper if the benchmark shows a real win. If the engines that support .cache() also already factor the CTE well (as DuckDB does), the helper may not be worth the abstraction — capture that finding either way.

Acceptance

Either:

  • a benchmarked recommendation to add cache_if_supported() (with the win quantified and the detection mechanism validated), or
  • a written conclusion that it isn't worth it, with the supporting numbers, so we don't revisit this blindly.

References

  • Module: openretailscience/analysis/customer_decision_hierarchy.py (_get_yules_q_distances)
  • Prior art for un-cached multi-reference reuse: ProductAssociation._calc_association in openretailscience/analysis/product_association.py derives unique_transactions once and references it 4× without caching.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions