Skip to content

Convert the Gain Loss analysis to Ibis #152

Description

@mvanwyk

Migrate GainLoss Class to Use Ibis for Performance

Problem Statement

The current GainLoss class in pyretailscience/analysis/gain_loss.py performs all data processing using pandas DataFrame operations. This creates performance bottlenecks when analyzing large retail datasets, as all data must be loaded into memory and processed row-by-row using pandas operations.

Current Architecture Issues

  1. Memory Intensive: All data is loaded into pandas DataFrames, limiting scalability
  2. Inefficient Processing: Uses pandas groupby and apply operations instead of optimized SQL queries
  3. Inconsistent with Codebase: Other analysis modules like SegTransactionStats use Ibis for efficient query generation
  4. No Query Inspection: Users cannot see or optimize the SQL being executed
  5. Limited Backend Support: Locked into pandas, cannot leverage database engines (DuckDB, PostgreSQL, etc.)

Proposed Solution

Refactor the GainLoss class to use Ibis for data processing, following the same architecture pattern as SegTransactionStats. This will enable:

  1. Efficient SQL query generation for aggregations
  2. Lazy evaluation with .table property exposing Ibis expressions
  3. Database engine compatibility (DuckDB, PostgreSQL, BigQuery, etc.)
  4. Query inspection and optimization capabilities
  5. Consistent API across all analysis modules

Current Implementation Analysis

Key Methods to Migrate

File: pyretailscience/analysis/gain_loss.py

  1. __init__ method (lines 31-94):

    • Currently accepts pd.DataFrame
    • Needs to accept pd.DataFrame | ibis.Table
    • Convert pandas to ibis.memtable if needed
  2. _calc_gain_loss method (lines 140-226):

    • Core computation using pandas groupby
    • Lines 166-167: Convert customer_id to categorical
    • Lines 169-193: Multiple groupby aggregations with boolean indexing
    • Lines 195: Outer merge of period DataFrames
    • Lines 200-202: Diff calculations
    • Lines 204-224: Apply process_customer_group row-by-row
    • Migration Priority: HIGH - This is the main bottleneck
  3. _calc_gains_loss_table method (lines 229-245):

    • Simple aggregation over gain_loss_df
    • Lines 242-243: Sum aggregation (total row)
    • Lines 245: Group-level aggregation
    • Migration Priority: MEDIUM - Straightforward Ibis conversion
  4. process_customer_group static method (lines 97-137):

    • Business logic for classifying customer movements
    • Pure computation logic (no data operations)
    • Migration Priority: LOW - May need to convert to Ibis expression or keep as-is

Data Flow

Input DataFrame
    ↓
Filter by indices (p1_index, p2_index, focus_group_index, comparison_group_index)
    ↓
Group by customer_id (and optionally group_col)
    ↓
Aggregate value_col for each period/group combination
    ↓
Merge period 1 and period 2 results
    ↓
Calculate differences
    ↓
Apply business logic (process_customer_group)
    ↓
Aggregate to final gain/loss table

API Design

Updated Constructor Signature

class GainLoss:
    def __init__(
        self,
        df: pd.DataFrame | ibis.Table,  # UPDATED: Accept Ibis tables
        p1_index: list[bool] | pd.Series,
        p2_index: list[bool] | pd.Series,
        focus_group_index: list[bool] | pd.Series,
        focus_group_name: str,
        comparison_group_index: list[bool] | pd.Series,
        comparison_group_name: str,
        group_col: str | None = None,
        value_col: str = get_option("column.unit_spend"),
        agg_func: str = "sum",
    ) -> None:

New Properties

Following the SegTransactionStats pattern:

class GainLoss:
    _df: pd.DataFrame | None = None  # Lazy-loaded pandas result

    @property
    def table(self) -> ibis.Table:
        """Returns the Ibis table expression for the gain/loss analysis.

        This allows users to:
        - Inspect the generated SQL query
        - Further compose queries
        - Execute against different backends

        Returns:
            ibis.Table: The Ibis table expression with gain/loss columns
        """
        return self._gain_loss_table

    @property
    def df(self) -> pd.DataFrame:
        """Returns the gain/loss analysis as a pandas DataFrame.

        Lazily evaluates the Ibis expression on first access.

        Returns:
            pd.DataFrame: The gain/loss analysis results
        """
        if self._df is None:
            self._df = self.table.execute()
        return self._df

Implementation Approach

Phase 1: Input Handling and Validation

def __init__(self, df: pd.DataFrame | ibis.Table, ...):
    # Convert pandas to Ibis if needed
    if isinstance(df, pd.DataFrame):
        df = ibis.memtable(df)
    elif not isinstance(df, ibis.Table):
        raise TypeError("df must be either a pandas DataFrame or an ibis Table")

    # Validation stays the same
    # ...existing validation code...

Phase 2: Migrate _calc_gain_loss to Ibis

Challenge: The current implementation uses boolean indexing and row-wise apply operations.

Ibis Approach:

  1. Boolean IndexingIbis Filters

    # Current: df[p1_index & focus_group_index]
    # Ibis: df.filter(p1_condition & focus_condition)

    Since indices are boolean arrays from the original DataFrame, we need to:

    • Add index columns to the Ibis table
    • Use Ibis filter expressions instead of boolean indexing
  2. Groupby AggregationsIbis GroupBy

    # Current: df.groupby([customer_id, group_col])[value_col].agg(agg_func)
    # Ibis: df.group_by([customer_id, group_col]).aggregate(value=df[value_col].sum())
  3. Row-wise ApplyIbis Case Expressions
    The process_customer_group logic needs to be converted to Ibis case/when expressions:

    # Current: apply(lambda x: process_customer_group(...), axis=1)
    # Ibis: Use nested case/when expressions to implement the logic
    
    increased_focus = ibis.case()
        .when(focus_diff > 0,
              ibis.case()
                  .when(comparison_diff > 0, focus_diff)
                  .else_(ibis.greatest(0, comparison_diff + focus_diff))
        )
        .when(comparison_diff < 0, focus_diff)
        .else_(ibis.least(0, comparison_diff + focus_diff))
        .end()

Phase 3: Handle Boolean Index Parameters

Challenge: The method receives boolean indices (p1_index, p2_index, etc.) that reference rows in the original DataFrame.

Solutions:

Option A: Add Filter Columns to Ibis Table (Recommended)

# Add boolean columns to the Ibis table
df = df.mutate(
    _p1_filter=ibis.literal(p1_index),
    _p2_filter=ibis.literal(p2_index),
    _focus_filter=ibis.literal(focus_group_index),
    _comparison_filter=ibis.literal(comparison_group_index)
)

# Use in filters
p1_focus = df.filter(df._p1_filter & df._focus_filter)

Option B: Require Ibis Expression Inputs (Breaking Change)

# Future API (breaking change)
def __init__(
    self,
    df: ibis.Table,
    p1_expr: ibis.BooleanColumn,  # e.g., df.date.between('2024-01-01', '2024-01-31')
    p2_expr: ibis.BooleanColumn,
    ...
):

Recommended: Use Option A for backward compatibility, document Option B as future enhancement.

Phase 4: Migrate _calc_gains_loss_table

Straightforward conversion:

@staticmethod
def _calc_gains_loss_table(
    gain_loss_table: ibis.Table,  # Changed from pd.DataFrame
    group_col: str | None = None,
) -> ibis.Table:  # Changed return type
    if group_col is None:
        # Aggregate all rows
        return gain_loss_table.aggregate(
            focus_p1=gain_loss_table.focus_p1.sum(),
            comparison_p1=gain_loss_table.comparison_p1.sum(),
            # ... all columns
        )

    # Group by group_col and aggregate
    return gain_loss_table.group_by(group_col).aggregate(
        focus_p1=gain_loss_table.focus_p1.sum(),
        # ... all columns
    )

Technical Considerations

1. Boolean Index Handling

Challenge: Current API uses boolean pandas Series/lists to filter data
Solution: Convert boolean indices to filter columns in Ibis table
Alternative: Future breaking change to use Ibis boolean expressions directly

2. Row-wise Logic Translation

Challenge: process_customer_group contains complex conditional logic applied row-by-row
Solution: Convert to nested Ibis case().when().else_() expressions
Risk: Complex logic may be harder to maintain in Ibis expressions
Mitigation: Add comprehensive comments and unit tests for each case

3. Categorical Columns

Challenge: Current code converts customer_id to categorical (line 167)
Solution: Not needed in Ibis - grouping is efficient without categoricals
Benefit: Simpler code, better performance

4. Backward Compatibility

Challenge: Existing users pass pandas DataFrames and boolean Series
Solution: Keep API signature the same, auto-convert pandas to Ibis
Benefit: No breaking changes

5. Testing Strategy

  • Unit tests: Verify Ibis query logic matches pandas results
  • Integration tests: Test with DuckDB backend for performance validation
  • Regression tests: Ensure existing functionality unchanged
  • Performance tests: Benchmark pandas vs Ibis on large datasets

Example: Before and After

Current Pandas Implementation

import pandas as pd
from pyretailscience.analysis.gain_loss import GainLoss

# Current usage
gl = GainLoss(
    df=transactions_df,  # pandas DataFrame
    p1_index=transactions_df['date'].between('2024-01-01', '2024-01-31'),
    p2_index=transactions_df['date'].between('2024-02-01', '2024-02-29'),
    focus_group_index=transactions_df['brand'] == 'Brand A',
    focus_group_name='Brand A',
    comparison_group_index=transactions_df['brand'] == 'Brand B',
    comparison_group_name='Brand B',
)

# Only pandas DataFrame available
result = gl.gain_loss_table_df  # Fully computed in memory

Proposed Ibis Implementation

import pandas as pd
import ibis
from pyretailscience.analysis.gain_loss import GainLoss

# Works with pandas (backward compatible)
gl = GainLoss(
    df=transactions_df,  # Auto-converted to ibis.memtable
    p1_index=transactions_df['date'].between('2024-01-01', '2024-01-31'),
    p2_index=transactions_df['date'].between('2024-02-29', '2024-02-29'),
    focus_group_index=transactions_df['brand'] == 'Brand A',
    focus_group_name='Brand A',
    comparison_group_index=transactions_df['brand'] == 'Brand B',
    comparison_group_name='Brand B',
)

# NEW: Inspect the Ibis query
print(gl.table.compile())  # See the SQL

# NEW: Lazy evaluation
result = gl.df  # Computed on first access, cached

# Works with Ibis tables (new capability)
ibis_conn = ibis.duckdb.connect('retail.db')
transactions_table = ibis_conn.table('transactions')

gl_ibis = GainLoss(
    df=transactions_table,  # Direct Ibis table
    p1_index=transactions_table['date'].between('2024-01-01', '2024-01-31').to_pandas(),
    # ... etc
)

# Query executed in DuckDB, not pandas
result = gl_ibis.df

Acceptance Criteria

  • GainLoss.__init__ accepts both pd.DataFrame and ibis.Table
  • Input pandas DataFrames are auto-converted to ibis.memtable
  • _calc_gain_loss method uses Ibis operations instead of pandas
  • process_customer_group logic converted to Ibis case expressions
  • Boolean index parameters work with Ibis tables
  • _calc_gains_loss_table returns ibis.Table instead of pd.DataFrame
  • Add .table property returning the Ibis table expression
  • Add .df property with lazy evaluation (execute on first access)
  • Maintain backward compatibility - existing code works unchanged
  • All existing unit tests pass without modification
  • Add new tests verifying Ibis query correctness
  • Update docstrings to document Ibis support
  • Change example in module docstring showing Ibis usage
  • Results from Ibis implementation match pandas implementation exactly

Testing Requirements

Unit Tests

  1. Input handling:

    • Test with pandas DataFrame (auto-conversion)
    • Test with ibis.Table (direct use)
    • Test validation errors for invalid input types
  2. Boolean index filtering:

    • Verify p1/p2 filtering works correctly
    • Verify focus/comparison filtering works correctly
    • Test combined filters (e.g., p1 & focus)
  3. Aggregation correctness:

    • Compare Ibis results to pandas results on same data
    • Test with different agg_func values ('sum', 'mean', etc.)
    • Test with and without group_col
  4. Business logic:

    • Verify all branches of process_customer_group logic
    • Test edge cases (all zeros, only new customers, only lost customers)
    • Test switching scenarios
  5. Properties:

    • Test .table returns ibis.Table
    • Test .df lazy evaluation (only executes once)
    • Test .df caching behavior

Regression Tests

  1. Existing functionality: All current test cases must pass
  2. API compatibility: Existing code using GainLoss works without changes
  3. Output format: Verify output DataFrame structure matches current implementation

Related Code Patterns

Reference Implementation: SegTransactionStats

The SegTransactionStats class (lines 59-447 in pyretailscience/segmentation/segstats.py) demonstrates the target architecture:

  1. Constructor (lines 73-148):

    • Accepts pd.DataFrame | ibis.Table
    • Converts pandas to ibis.memtable
    • Validates input types
  2. Static calculation method (lines 200-329):

    • Works entirely in Ibis
    • Returns ibis.Table
    • Uses group_by and aggregate
  3. Properties (lines 331-351):

    • self.table: Stores the Ibis table
    • self.df: Lazy evaluation with caching

Key Takeaway: Follow this exact pattern for consistency across the codebase.

Dependencies

None - this is a refactoring of existing functionality using existing dependencies (Ibis is already in the project).

Breaking Changes

None - This is a backward-compatible enhancement. Existing code using pandas DataFrames will continue to work exactly as before.

Implementation Notes

Code Structure

pyretailscience/analysis/gain_loss.py
├── GainLoss class
│   ├── __init__: Accept pd.DataFrame | ibis.Table
│   ├── _df: Private cached DataFrame
│   ├── table property: Return Ibis table
│   ├── df property: Lazy evaluation
│   ├── _calc_gain_loss: Migrate to Ibis (MAIN WORK)
│   ├── _calc_gains_loss_table: Migrate to Ibis
│   ├── process_customer_group: Convert to Ibis expressions or keep as helper
│   └── plot: Uses .df property (no changes needed)

Estimated Effort

  • Small: Input handling and validation (4 hours)
  • Medium: _calc_gains_loss_table migration (8 hours)
  • Large: _calc_gain_loss migration including boolean index handling (16 hours)
  • Large: process_customer_group logic conversion to Ibis expressions (16 hours)
  • Medium: Testing and validation (12 hours)
  • Small: Documentation updates (4 hours)

Labels

  • type:enhancement - Improving existing functionality
  • priority:normal - Important for performance and consistency
  • size:large - Significant refactoring effort
  • status:ready - Fully specified and ready to implement

Metadata

Metadata

Assignees

Labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions