Convert the Gain Loss analysis to Ibis

# Migrate GainLoss Class to Use Ibis for Performance

## Problem Statement

The current `GainLoss` class in `pyretailscience/analysis/gain_loss.py` performs all data processing using pandas DataFrame operations. This creates performance bottlenecks when analyzing large retail datasets, as all data must be loaded into memory and processed row-by-row using pandas operations.

### Current Architecture Issues

1. **Memory Intensive**: All data is loaded into pandas DataFrames, limiting scalability
2. **Inefficient Processing**: Uses pandas groupby and apply operations instead of optimized SQL queries
3. **Inconsistent with Codebase**: Other analysis modules like `SegTransactionStats` use Ibis for efficient query generation
4. **No Query Inspection**: Users cannot see or optimize the SQL being executed
5. **Limited Backend Support**: Locked into pandas, cannot leverage database engines (DuckDB, PostgreSQL, etc.)

## Proposed Solution

Refactor the `GainLoss` class to use Ibis for data processing, following the same architecture pattern as `SegTransactionStats`. This will enable:

1. Efficient SQL query generation for aggregations
2. Lazy evaluation with `.table` property exposing Ibis expressions
3. Database engine compatibility (DuckDB, PostgreSQL, BigQuery, etc.)
4. Query inspection and optimization capabilities
5. Consistent API across all analysis modules

## Current Implementation Analysis

### Key Methods to Migrate

**File**: `pyretailscience/analysis/gain_loss.py`

1. **`__init__` method** (lines 31-94):
   - Currently accepts `pd.DataFrame`
   - Needs to accept `pd.DataFrame | ibis.Table`
   - Convert pandas to ibis.memtable if needed

2. **`_calc_gain_loss` method** (lines 140-226):
   - Core computation using pandas groupby
   - Lines 166-167: Convert customer_id to categorical
   - Lines 169-193: Multiple groupby aggregations with boolean indexing
   - Lines 195: Outer merge of period DataFrames
   - Lines 200-202: Diff calculations
   - Lines 204-224: Apply `process_customer_group` row-by-row
   - **Migration Priority**: HIGH - This is the main bottleneck

3. **`_calc_gains_loss_table` method** (lines 229-245):
   - Simple aggregation over gain_loss_df
   - Lines 242-243: Sum aggregation (total row)
   - Lines 245: Group-level aggregation
   - **Migration Priority**: MEDIUM - Straightforward Ibis conversion

4. **`process_customer_group` static method** (lines 97-137):
   - Business logic for classifying customer movements
   - Pure computation logic (no data operations)
   - **Migration Priority**: LOW - May need to convert to Ibis expression or keep as-is

### Data Flow

```
Input DataFrame
    ↓
Filter by indices (p1_index, p2_index, focus_group_index, comparison_group_index)
    ↓
Group by customer_id (and optionally group_col)
    ↓
Aggregate value_col for each period/group combination
    ↓
Merge period 1 and period 2 results
    ↓
Calculate differences
    ↓
Apply business logic (process_customer_group)
    ↓
Aggregate to final gain/loss table
```

## API Design

### Updated Constructor Signature

```python
class GainLoss:
    def __init__(
        self,
        df: pd.DataFrame | ibis.Table,  # UPDATED: Accept Ibis tables
        p1_index: list[bool] | pd.Series,
        p2_index: list[bool] | pd.Series,
        focus_group_index: list[bool] | pd.Series,
        focus_group_name: str,
        comparison_group_index: list[bool] | pd.Series,
        comparison_group_name: str,
        group_col: str | None = None,
        value_col: str = get_option("column.unit_spend"),
        agg_func: str = "sum",
    ) -> None:
```

### New Properties

Following the `SegTransactionStats` pattern:

```python
class GainLoss:
    _df: pd.DataFrame | None = None  # Lazy-loaded pandas result

    @property
    def table(self) -> ibis.Table:
        """Returns the Ibis table expression for the gain/loss analysis.

        This allows users to:
        - Inspect the generated SQL query
        - Further compose queries
        - Execute against different backends

        Returns:
            ibis.Table: The Ibis table expression with gain/loss columns
        """
        return self._gain_loss_table

    @property
    def df(self) -> pd.DataFrame:
        """Returns the gain/loss analysis as a pandas DataFrame.

        Lazily evaluates the Ibis expression on first access.

        Returns:
            pd.DataFrame: The gain/loss analysis results
        """
        if self._df is None:
            self._df = self.table.execute()
        return self._df
```

## Implementation Approach

### Phase 1: Input Handling and Validation

```python
def __init__(self, df: pd.DataFrame | ibis.Table, ...):
    # Convert pandas to Ibis if needed
    if isinstance(df, pd.DataFrame):
        df = ibis.memtable(df)
    elif not isinstance(df, ibis.Table):
        raise TypeError("df must be either a pandas DataFrame or an ibis Table")

    # Validation stays the same
    # ...existing validation code...
```

### Phase 2: Migrate `_calc_gain_loss` to Ibis

**Challenge**: The current implementation uses boolean indexing and row-wise apply operations.

**Ibis Approach**:

1. **Boolean Indexing** → **Ibis Filters**
   ```python
   # Current: df[p1_index & focus_group_index]
   # Ibis: df.filter(p1_condition & focus_condition)
   ```

   Since indices are boolean arrays from the original DataFrame, we need to:
   - Add index columns to the Ibis table
   - Use Ibis filter expressions instead of boolean indexing

2. **Groupby Aggregations** → **Ibis GroupBy**
   ```python
   # Current: df.groupby([customer_id, group_col])[value_col].agg(agg_func)
   # Ibis: df.group_by([customer_id, group_col]).aggregate(value=df[value_col].sum())
   ```

3. **Row-wise Apply** → **Ibis Case Expressions**
   The `process_customer_group` logic needs to be converted to Ibis case/when expressions:

   ```python
   # Current: apply(lambda x: process_customer_group(...), axis=1)
   # Ibis: Use nested case/when expressions to implement the logic

   increased_focus = ibis.case()
       .when(focus_diff > 0,
             ibis.case()
                 .when(comparison_diff > 0, focus_diff)
                 .else_(ibis.greatest(0, comparison_diff + focus_diff))
       )
       .when(comparison_diff < 0, focus_diff)
       .else_(ibis.least(0, comparison_diff + focus_diff))
       .end()
   ```

### Phase 3: Handle Boolean Index Parameters

**Challenge**: The method receives boolean indices (`p1_index`, `p2_index`, etc.) that reference rows in the original DataFrame.

**Solutions**:

**Option A: Add Filter Columns to Ibis Table** (Recommended)
```python
# Add boolean columns to the Ibis table
df = df.mutate(
    _p1_filter=ibis.literal(p1_index),
    _p2_filter=ibis.literal(p2_index),
    _focus_filter=ibis.literal(focus_group_index),
    _comparison_filter=ibis.literal(comparison_group_index)
)

# Use in filters
p1_focus = df.filter(df._p1_filter & df._focus_filter)
```

**Option B: Require Ibis Expression Inputs** (Breaking Change)
```python
# Future API (breaking change)
def __init__(
    self,
    df: ibis.Table,
    p1_expr: ibis.BooleanColumn,  # e.g., df.date.between('2024-01-01', '2024-01-31')
    p2_expr: ibis.BooleanColumn,
    ...
):
```

**Recommended**: Use Option A for backward compatibility, document Option B as future enhancement.

### Phase 4: Migrate `_calc_gains_loss_table`

Straightforward conversion:

```python
@staticmethod
def _calc_gains_loss_table(
    gain_loss_table: ibis.Table,  # Changed from pd.DataFrame
    group_col: str | None = None,
) -> ibis.Table:  # Changed return type
    if group_col is None:
        # Aggregate all rows
        return gain_loss_table.aggregate(
            focus_p1=gain_loss_table.focus_p1.sum(),
            comparison_p1=gain_loss_table.comparison_p1.sum(),
            # ... all columns
        )

    # Group by group_col and aggregate
    return gain_loss_table.group_by(group_col).aggregate(
        focus_p1=gain_loss_table.focus_p1.sum(),
        # ... all columns
    )
```

## Technical Considerations

### 1. Boolean Index Handling

**Challenge**: Current API uses boolean pandas Series/lists to filter data
**Solution**: Convert boolean indices to filter columns in Ibis table
**Alternative**: Future breaking change to use Ibis boolean expressions directly

### 2. Row-wise Logic Translation

**Challenge**: `process_customer_group` contains complex conditional logic applied row-by-row
**Solution**: Convert to nested Ibis `case().when().else_()` expressions
**Risk**: Complex logic may be harder to maintain in Ibis expressions
**Mitigation**: Add comprehensive comments and unit tests for each case

### 3. Categorical Columns

**Challenge**: Current code converts customer_id to categorical (line 167)
**Solution**: Not needed in Ibis - grouping is efficient without categoricals
**Benefit**: Simpler code, better performance

### 4. Backward Compatibility

**Challenge**: Existing users pass pandas DataFrames and boolean Series
**Solution**: Keep API signature the same, auto-convert pandas to Ibis
**Benefit**: No breaking changes

### 5. Testing Strategy

- **Unit tests**: Verify Ibis query logic matches pandas results
- **Integration tests**: Test with DuckDB backend for performance validation
- **Regression tests**: Ensure existing functionality unchanged
- **Performance tests**: Benchmark pandas vs Ibis on large datasets

## Example: Before and After

### Current Pandas Implementation

```python
import pandas as pd
from pyretailscience.analysis.gain_loss import GainLoss

# Current usage
gl = GainLoss(
    df=transactions_df,  # pandas DataFrame
    p1_index=transactions_df['date'].between('2024-01-01', '2024-01-31'),
    p2_index=transactions_df['date'].between('2024-02-01', '2024-02-29'),
    focus_group_index=transactions_df['brand'] == 'Brand A',
    focus_group_name='Brand A',
    comparison_group_index=transactions_df['brand'] == 'Brand B',
    comparison_group_name='Brand B',
)

# Only pandas DataFrame available
result = gl.gain_loss_table_df  # Fully computed in memory
```

### Proposed Ibis Implementation

```python
import pandas as pd
import ibis
from pyretailscience.analysis.gain_loss import GainLoss

# Works with pandas (backward compatible)
gl = GainLoss(
    df=transactions_df,  # Auto-converted to ibis.memtable
    p1_index=transactions_df['date'].between('2024-01-01', '2024-01-31'),
    p2_index=transactions_df['date'].between('2024-02-29', '2024-02-29'),
    focus_group_index=transactions_df['brand'] == 'Brand A',
    focus_group_name='Brand A',
    comparison_group_index=transactions_df['brand'] == 'Brand B',
    comparison_group_name='Brand B',
)

# NEW: Inspect the Ibis query
print(gl.table.compile())  # See the SQL

# NEW: Lazy evaluation
result = gl.df  # Computed on first access, cached

# Works with Ibis tables (new capability)
ibis_conn = ibis.duckdb.connect('retail.db')
transactions_table = ibis_conn.table('transactions')

gl_ibis = GainLoss(
    df=transactions_table,  # Direct Ibis table
    p1_index=transactions_table['date'].between('2024-01-01', '2024-01-31').to_pandas(),
    # ... etc
)

# Query executed in DuckDB, not pandas
result = gl_ibis.df
```

## Acceptance Criteria

- [ ] `GainLoss.__init__` accepts both `pd.DataFrame` and `ibis.Table`
- [ ] Input pandas DataFrames are auto-converted to `ibis.memtable`
- [ ] `_calc_gain_loss` method uses Ibis operations instead of pandas
- [ ] `process_customer_group` logic converted to Ibis case expressions
- [ ] Boolean index parameters work with Ibis tables
- [ ] `_calc_gains_loss_table` returns `ibis.Table` instead of `pd.DataFrame`
- [ ] Add `.table` property returning the Ibis table expression
- [ ] Add `.df` property with lazy evaluation (execute on first access)
- [ ] Maintain backward compatibility - existing code works unchanged
- [ ] All existing unit tests pass without modification
- [ ] Add new tests verifying Ibis query correctness
- [ ] Update docstrings to document Ibis support
- [ ] Change example in module docstring showing Ibis usage
- [ ] Results from Ibis implementation match pandas implementation exactly

## Testing Requirements

### Unit Tests

1. **Input handling**:
   - Test with pandas DataFrame (auto-conversion)
   - Test with ibis.Table (direct use)
   - Test validation errors for invalid input types

2. **Boolean index filtering**:
   - Verify p1/p2 filtering works correctly
   - Verify focus/comparison filtering works correctly
   - Test combined filters (e.g., p1 & focus)

3. **Aggregation correctness**:
   - Compare Ibis results to pandas results on same data
   - Test with different agg_func values ('sum', 'mean', etc.)
   - Test with and without group_col

4. **Business logic**:
   - Verify all branches of `process_customer_group` logic
   - Test edge cases (all zeros, only new customers, only lost customers)
   - Test switching scenarios

5. **Properties**:
   - Test `.table` returns ibis.Table
   - Test `.df` lazy evaluation (only executes once)
   - Test `.df` caching behavior

### Regression Tests

1. **Existing functionality**: All current test cases must pass
2. **API compatibility**: Existing code using `GainLoss` works without changes
3. **Output format**: Verify output DataFrame structure matches current implementation

## Related Code Patterns

### Reference Implementation: SegTransactionStats

The `SegTransactionStats` class (lines 59-447 in `pyretailscience/segmentation/segstats.py`) demonstrates the target architecture:

1. **Constructor** (lines 73-148):
   - Accepts `pd.DataFrame | ibis.Table`
   - Converts pandas to ibis.memtable
   - Validates input types

2. **Static calculation method** (lines 200-329):
   - Works entirely in Ibis
   - Returns `ibis.Table`
   - Uses group_by and aggregate

3. **Properties** (lines 331-351):
   - `self.table`: Stores the Ibis table
   - `self.df`: Lazy evaluation with caching

**Key Takeaway**: Follow this exact pattern for consistency across the codebase.

## Dependencies

None - this is a refactoring of existing functionality using existing dependencies (Ibis is already in the project).

## Breaking Changes

**None** - This is a backward-compatible enhancement. Existing code using pandas DataFrames will continue to work exactly as before.

## Implementation Notes

### Code Structure

```
pyretailscience/analysis/gain_loss.py
├── GainLoss class
│   ├── __init__: Accept pd.DataFrame | ibis.Table
│   ├── _df: Private cached DataFrame
│   ├── table property: Return Ibis table
│   ├── df property: Lazy evaluation
│   ├── _calc_gain_loss: Migrate to Ibis (MAIN WORK)
│   ├── _calc_gains_loss_table: Migrate to Ibis
│   ├── process_customer_group: Convert to Ibis expressions or keep as helper
│   └── plot: Uses .df property (no changes needed)
```

### Estimated Effort

- **Small**: Input handling and validation (4 hours)
- **Medium**: `_calc_gains_loss_table` migration (8 hours)
- **Large**: `_calc_gain_loss` migration including boolean index handling (16 hours)
- **Large**: `process_customer_group` logic conversion to Ibis expressions (16 hours)
- **Medium**: Testing and validation (12 hours)
- **Small**: Documentation updates (4 hours)

## Labels

- `type:enhancement` - Improving existing functionality
- `priority:normal` - Important for performance and consistency
- `size:large` - Significant refactoring effort
- `status:ready` - Fully specified and ready to implement

Uh oh!

Convert the Gain Loss analysis to Ibis #152

Description

Migrate GainLoss Class to Use Ibis for Performance

Problem Statement

Current Architecture Issues

Proposed Solution

Current Implementation Analysis

Key Methods to Migrate

Data Flow

API Design

Updated Constructor Signature

New Properties

Implementation Approach

Phase 1: Input Handling and Validation

Phase 2: Migrate _calc_gain_loss to Ibis

Phase 3: Handle Boolean Index Parameters

Phase 4: Migrate _calc_gains_loss_table

Technical Considerations

1. Boolean Index Handling

2. Row-wise Logic Translation

3. Categorical Columns

4. Backward Compatibility

5. Testing Strategy

Example: Before and After

Current Pandas Implementation

Proposed Ibis Implementation

Acceptance Criteria

Testing Requirements

Unit Tests

Regression Tests

Related Code Patterns

Reference Implementation: SegTransactionStats

Dependencies

Breaking Changes

Implementation Notes

Code Structure

Estimated Effort

Labels

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Phase 2: Migrate `_calc_gain_loss` to Ibis

Phase 4: Migrate `_calc_gains_loss_table`