Investigate a `cache_if_supported()` helper for backend-portable Ibis caching

## Background

`CustomerDecisionHierarchy` (`openretailscience/analysis/customer_decision_hierarchy.py`) reduces transactions to a distinct `(customer, product)` `pairs` table and then references it **three times** in a single query: the two sides of a customer-keyed self-join plus a `nunique(customer)` scalar subquery that broadcasts the customer total onto every row.

An earlier revision wrapped this with `pairs = pairs.cache()` to materialize the reduction (the `DISTINCT`, and in the exclusion path an anti-join) once, so the three references read a small temp table instead of potentially recomputing the reduction from the base table each time. That mattered most for the 1B–10B row inputs this module targets.

`.cache()` was dropped (commit on branch `claude/ibis-support-gaps-ZOOSJ`) because **it is not supported on every Ibis backend — notably Databricks** — and would make the module non-portable.

## The tradeoff we're leaving on the table

Whether dropping `.cache()` costs anything is **engine-dependent**:

- **DuckDB** — `EXPLAIN` confirms one `SEQ_SCAN` of the base table; the compiler factors `pairs` into a single CTE (`WITH t1 AS …`) materialized once and reused via three `CTE_SCAN`s. Cache buys essentially nothing here.
- **Spark/Databricks, BigQuery, Snowflake** — these engines can **inline** a CTE that is referenced multiple times, which would recompute the `DISTINCT` + anti-join subgraph up to 3×. On a 10B-row base, this is exactly where `.cache()` earned its keep.

So on the backends that *can't* use `.cache()`, we may also be the most exposed to CTE-inlining recomputation. We should measure this rather than assume.

## Proposed idea

A small backend-aware helper that caches only where the backend supports it, and is a no-op elsewhere:

```python
import ibis.expr.types as ir


def cache_if_supported(table: ir.Table) -> ir.Table:
    """Cache an Ibis table expression if the backend supports temp tables."""
    backend = table.get_backend()

    if getattr(backend, "supports_temporary_tables", False):
        return table.cache()

    return table
```

Call site would become `pairs = cache_if_supported(pairs)`.

## Scope of investigation

- [ ] Confirm the right capability flag. Verify whether `backend.supports_temporary_tables` (or another documented attribute / a `try/except` around `.cache()`) is the correct, stable way to detect support across the backends we target (BigQuery, Snowflake, PySpark/Databricks, DuckDB). Do **not** ship a hardcoded backend-name denylist.
- [ ] Decide where the helper lives if adopted. A query-engine helper is not validation, so it does **not** belong in `core/validation.py`. Consider a small `core/ibis_utils.py` (new) or similar. Keep it to one obvious home.
- [ ] Benchmark on a realistic (ideally billion-row) dataset on a CTE-inlining backend (Spark/Databricks or BigQuery): compare wall-clock and bytes-scanned for the `_get_yules_q_distances` query **with** vs **without** caching. The helper only earns its place if there is a measurable win on a backend that actually supports it.
- [ ] Verify portability: confirm the helper is a clean no-op (still one `execute()`, identical result matrix) on Databricks and any other backend lacking temp-table support.
- [ ] Per the repo simplicity criterion, only adopt the helper if the benchmark shows a real win. If the engines that support `.cache()` also already factor the CTE well (as DuckDB does), the helper may not be worth the abstraction — capture that finding either way.

## Acceptance

Either:
- a benchmarked recommendation to add `cache_if_supported()` (with the win quantified and the detection mechanism validated), **or**
- a written conclusion that it isn't worth it, with the supporting numbers, so we don't revisit this blindly.

## References
- Module: `openretailscience/analysis/customer_decision_hierarchy.py` (`_get_yules_q_distances`)
- Prior art for un-cached multi-reference reuse: `ProductAssociation._calc_association` in `openretailscience/analysis/product_association.py` derives `unique_transactions` once and references it 4× without caching.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate a `cache_if_supported()` helper for backend-portable Ibis caching #479

Background

The tradeoff we're leaving on the table

Proposed idea

Scope of investigation

Acceptance

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Investigate a cache_if_supported() helper for backend-portable Ibis caching #479

Description

Background

The tradeoff we're leaving on the table

Proposed idea

Scope of investigation

Acceptance

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Investigate a `cache_if_supported()` helper for backend-portable Ibis caching #479