Background
CustomerDecisionHierarchy (openretailscience/analysis/customer_decision_hierarchy.py) reduces transactions to a distinct (customer, product) pairs table and then references it three times in a single query: the two sides of a customer-keyed self-join plus a nunique(customer) scalar subquery that broadcasts the customer total onto every row.
An earlier revision wrapped this with pairs = pairs.cache() to materialize the reduction (the DISTINCT, and in the exclusion path an anti-join) once, so the three references read a small temp table instead of potentially recomputing the reduction from the base table each time. That mattered most for the 1B–10B row inputs this module targets.
.cache() was dropped (commit on branch claude/ibis-support-gaps-ZOOSJ) because it is not supported on every Ibis backend — notably Databricks — and would make the module non-portable.
The tradeoff we're leaving on the table
Whether dropping .cache() costs anything is engine-dependent:
- DuckDB —
EXPLAIN confirms one SEQ_SCAN of the base table; the compiler factors pairs into a single CTE (WITH t1 AS …) materialized once and reused via three CTE_SCANs. Cache buys essentially nothing here.
- Spark/Databricks, BigQuery, Snowflake — these engines can inline a CTE that is referenced multiple times, which would recompute the
DISTINCT + anti-join subgraph up to 3×. On a 10B-row base, this is exactly where .cache() earned its keep.
So on the backends that can't use .cache(), we may also be the most exposed to CTE-inlining recomputation. We should measure this rather than assume.
Proposed idea
A small backend-aware helper that caches only where the backend supports it, and is a no-op elsewhere:
import ibis.expr.types as ir
def cache_if_supported(table: ir.Table) -> ir.Table:
"""Cache an Ibis table expression if the backend supports temp tables."""
backend = table.get_backend()
if getattr(backend, "supports_temporary_tables", False):
return table.cache()
return table
Call site would become pairs = cache_if_supported(pairs).
Scope of investigation
Acceptance
Either:
- a benchmarked recommendation to add
cache_if_supported() (with the win quantified and the detection mechanism validated), or
- a written conclusion that it isn't worth it, with the supporting numbers, so we don't revisit this blindly.
References
- Module:
openretailscience/analysis/customer_decision_hierarchy.py (_get_yules_q_distances)
- Prior art for un-cached multi-reference reuse:
ProductAssociation._calc_association in openretailscience/analysis/product_association.py derives unique_transactions once and references it 4× without caching.
Background
CustomerDecisionHierarchy(openretailscience/analysis/customer_decision_hierarchy.py) reduces transactions to a distinct(customer, product)pairstable and then references it three times in a single query: the two sides of a customer-keyed self-join plus anunique(customer)scalar subquery that broadcasts the customer total onto every row.An earlier revision wrapped this with
pairs = pairs.cache()to materialize the reduction (theDISTINCT, and in the exclusion path an anti-join) once, so the three references read a small temp table instead of potentially recomputing the reduction from the base table each time. That mattered most for the 1B–10B row inputs this module targets..cache()was dropped (commit on branchclaude/ibis-support-gaps-ZOOSJ) because it is not supported on every Ibis backend — notably Databricks — and would make the module non-portable.The tradeoff we're leaving on the table
Whether dropping
.cache()costs anything is engine-dependent:EXPLAINconfirms oneSEQ_SCANof the base table; the compiler factorspairsinto a single CTE (WITH t1 AS …) materialized once and reused via threeCTE_SCANs. Cache buys essentially nothing here.DISTINCT+ anti-join subgraph up to 3×. On a 10B-row base, this is exactly where.cache()earned its keep.So on the backends that can't use
.cache(), we may also be the most exposed to CTE-inlining recomputation. We should measure this rather than assume.Proposed idea
A small backend-aware helper that caches only where the backend supports it, and is a no-op elsewhere:
Call site would become
pairs = cache_if_supported(pairs).Scope of investigation
backend.supports_temporary_tables(or another documented attribute / atry/exceptaround.cache()) is the correct, stable way to detect support across the backends we target (BigQuery, Snowflake, PySpark/Databricks, DuckDB). Do not ship a hardcoded backend-name denylist.core/validation.py. Consider a smallcore/ibis_utils.py(new) or similar. Keep it to one obvious home._get_yules_q_distancesquery with vs without caching. The helper only earns its place if there is a measurable win on a backend that actually supports it.execute(), identical result matrix) on Databricks and any other backend lacking temp-table support..cache()also already factor the CTE well (as DuckDB does), the helper may not be worth the abstraction — capture that finding either way.Acceptance
Either:
cache_if_supported()(with the win quantified and the detection mechanism validated), orReferences
openretailscience/analysis/customer_decision_hierarchy.py(_get_yules_q_distances)ProductAssociation._calc_associationinopenretailscience/analysis/product_association.pyderivesunique_transactionsonce and references it 4× without caching.