Migrate GainLoss Class to Use Ibis for Performance
Problem Statement
The current GainLoss class in pyretailscience/analysis/gain_loss.py performs all data processing using pandas DataFrame operations. This creates performance bottlenecks when analyzing large retail datasets, as all data must be loaded into memory and processed row-by-row using pandas operations.
Current Architecture Issues
- Memory Intensive: All data is loaded into pandas DataFrames, limiting scalability
- Inefficient Processing: Uses pandas groupby and apply operations instead of optimized SQL queries
- Inconsistent with Codebase: Other analysis modules like
SegTransactionStats use Ibis for efficient query generation
- No Query Inspection: Users cannot see or optimize the SQL being executed
- Limited Backend Support: Locked into pandas, cannot leverage database engines (DuckDB, PostgreSQL, etc.)
Proposed Solution
Refactor the GainLoss class to use Ibis for data processing, following the same architecture pattern as SegTransactionStats. This will enable:
- Efficient SQL query generation for aggregations
- Lazy evaluation with
.table property exposing Ibis expressions
- Database engine compatibility (DuckDB, PostgreSQL, BigQuery, etc.)
- Query inspection and optimization capabilities
- Consistent API across all analysis modules
Current Implementation Analysis
Key Methods to Migrate
File: pyretailscience/analysis/gain_loss.py
-
__init__ method (lines 31-94):
- Currently accepts
pd.DataFrame
- Needs to accept
pd.DataFrame | ibis.Table
- Convert pandas to ibis.memtable if needed
-
_calc_gain_loss method (lines 140-226):
- Core computation using pandas groupby
- Lines 166-167: Convert customer_id to categorical
- Lines 169-193: Multiple groupby aggregations with boolean indexing
- Lines 195: Outer merge of period DataFrames
- Lines 200-202: Diff calculations
- Lines 204-224: Apply
process_customer_group row-by-row
- Migration Priority: HIGH - This is the main bottleneck
-
_calc_gains_loss_table method (lines 229-245):
- Simple aggregation over gain_loss_df
- Lines 242-243: Sum aggregation (total row)
- Lines 245: Group-level aggregation
- Migration Priority: MEDIUM - Straightforward Ibis conversion
-
process_customer_group static method (lines 97-137):
- Business logic for classifying customer movements
- Pure computation logic (no data operations)
- Migration Priority: LOW - May need to convert to Ibis expression or keep as-is
Data Flow
Input DataFrame
↓
Filter by indices (p1_index, p2_index, focus_group_index, comparison_group_index)
↓
Group by customer_id (and optionally group_col)
↓
Aggregate value_col for each period/group combination
↓
Merge period 1 and period 2 results
↓
Calculate differences
↓
Apply business logic (process_customer_group)
↓
Aggregate to final gain/loss table
API Design
Updated Constructor Signature
class GainLoss:
def __init__(
self,
df: pd.DataFrame | ibis.Table, # UPDATED: Accept Ibis tables
p1_index: list[bool] | pd.Series,
p2_index: list[bool] | pd.Series,
focus_group_index: list[bool] | pd.Series,
focus_group_name: str,
comparison_group_index: list[bool] | pd.Series,
comparison_group_name: str,
group_col: str | None = None,
value_col: str = get_option("column.unit_spend"),
agg_func: str = "sum",
) -> None:
New Properties
Following the SegTransactionStats pattern:
class GainLoss:
_df: pd.DataFrame | None = None # Lazy-loaded pandas result
@property
def table(self) -> ibis.Table:
"""Returns the Ibis table expression for the gain/loss analysis.
This allows users to:
- Inspect the generated SQL query
- Further compose queries
- Execute against different backends
Returns:
ibis.Table: The Ibis table expression with gain/loss columns
"""
return self._gain_loss_table
@property
def df(self) -> pd.DataFrame:
"""Returns the gain/loss analysis as a pandas DataFrame.
Lazily evaluates the Ibis expression on first access.
Returns:
pd.DataFrame: The gain/loss analysis results
"""
if self._df is None:
self._df = self.table.execute()
return self._df
Implementation Approach
Phase 1: Input Handling and Validation
def __init__(self, df: pd.DataFrame | ibis.Table, ...):
# Convert pandas to Ibis if needed
if isinstance(df, pd.DataFrame):
df = ibis.memtable(df)
elif not isinstance(df, ibis.Table):
raise TypeError("df must be either a pandas DataFrame or an ibis Table")
# Validation stays the same
# ...existing validation code...
Phase 2: Migrate _calc_gain_loss to Ibis
Challenge: The current implementation uses boolean indexing and row-wise apply operations.
Ibis Approach:
-
Boolean Indexing → Ibis Filters
# Current: df[p1_index & focus_group_index]
# Ibis: df.filter(p1_condition & focus_condition)
Since indices are boolean arrays from the original DataFrame, we need to:
- Add index columns to the Ibis table
- Use Ibis filter expressions instead of boolean indexing
-
Groupby Aggregations → Ibis GroupBy
# Current: df.groupby([customer_id, group_col])[value_col].agg(agg_func)
# Ibis: df.group_by([customer_id, group_col]).aggregate(value=df[value_col].sum())
-
Row-wise Apply → Ibis Case Expressions
The process_customer_group logic needs to be converted to Ibis case/when expressions:
# Current: apply(lambda x: process_customer_group(...), axis=1)
# Ibis: Use nested case/when expressions to implement the logic
increased_focus = ibis.case()
.when(focus_diff > 0,
ibis.case()
.when(comparison_diff > 0, focus_diff)
.else_(ibis.greatest(0, comparison_diff + focus_diff))
)
.when(comparison_diff < 0, focus_diff)
.else_(ibis.least(0, comparison_diff + focus_diff))
.end()
Phase 3: Handle Boolean Index Parameters
Challenge: The method receives boolean indices (p1_index, p2_index, etc.) that reference rows in the original DataFrame.
Solutions:
Option A: Add Filter Columns to Ibis Table (Recommended)
# Add boolean columns to the Ibis table
df = df.mutate(
_p1_filter=ibis.literal(p1_index),
_p2_filter=ibis.literal(p2_index),
_focus_filter=ibis.literal(focus_group_index),
_comparison_filter=ibis.literal(comparison_group_index)
)
# Use in filters
p1_focus = df.filter(df._p1_filter & df._focus_filter)
Option B: Require Ibis Expression Inputs (Breaking Change)
# Future API (breaking change)
def __init__(
self,
df: ibis.Table,
p1_expr: ibis.BooleanColumn, # e.g., df.date.between('2024-01-01', '2024-01-31')
p2_expr: ibis.BooleanColumn,
...
):
Recommended: Use Option A for backward compatibility, document Option B as future enhancement.
Phase 4: Migrate _calc_gains_loss_table
Straightforward conversion:
@staticmethod
def _calc_gains_loss_table(
gain_loss_table: ibis.Table, # Changed from pd.DataFrame
group_col: str | None = None,
) -> ibis.Table: # Changed return type
if group_col is None:
# Aggregate all rows
return gain_loss_table.aggregate(
focus_p1=gain_loss_table.focus_p1.sum(),
comparison_p1=gain_loss_table.comparison_p1.sum(),
# ... all columns
)
# Group by group_col and aggregate
return gain_loss_table.group_by(group_col).aggregate(
focus_p1=gain_loss_table.focus_p1.sum(),
# ... all columns
)
Technical Considerations
1. Boolean Index Handling
Challenge: Current API uses boolean pandas Series/lists to filter data
Solution: Convert boolean indices to filter columns in Ibis table
Alternative: Future breaking change to use Ibis boolean expressions directly
2. Row-wise Logic Translation
Challenge: process_customer_group contains complex conditional logic applied row-by-row
Solution: Convert to nested Ibis case().when().else_() expressions
Risk: Complex logic may be harder to maintain in Ibis expressions
Mitigation: Add comprehensive comments and unit tests for each case
3. Categorical Columns
Challenge: Current code converts customer_id to categorical (line 167)
Solution: Not needed in Ibis - grouping is efficient without categoricals
Benefit: Simpler code, better performance
4. Backward Compatibility
Challenge: Existing users pass pandas DataFrames and boolean Series
Solution: Keep API signature the same, auto-convert pandas to Ibis
Benefit: No breaking changes
5. Testing Strategy
- Unit tests: Verify Ibis query logic matches pandas results
- Integration tests: Test with DuckDB backend for performance validation
- Regression tests: Ensure existing functionality unchanged
- Performance tests: Benchmark pandas vs Ibis on large datasets
Example: Before and After
Current Pandas Implementation
import pandas as pd
from pyretailscience.analysis.gain_loss import GainLoss
# Current usage
gl = GainLoss(
df=transactions_df, # pandas DataFrame
p1_index=transactions_df['date'].between('2024-01-01', '2024-01-31'),
p2_index=transactions_df['date'].between('2024-02-01', '2024-02-29'),
focus_group_index=transactions_df['brand'] == 'Brand A',
focus_group_name='Brand A',
comparison_group_index=transactions_df['brand'] == 'Brand B',
comparison_group_name='Brand B',
)
# Only pandas DataFrame available
result = gl.gain_loss_table_df # Fully computed in memory
Proposed Ibis Implementation
import pandas as pd
import ibis
from pyretailscience.analysis.gain_loss import GainLoss
# Works with pandas (backward compatible)
gl = GainLoss(
df=transactions_df, # Auto-converted to ibis.memtable
p1_index=transactions_df['date'].between('2024-01-01', '2024-01-31'),
p2_index=transactions_df['date'].between('2024-02-29', '2024-02-29'),
focus_group_index=transactions_df['brand'] == 'Brand A',
focus_group_name='Brand A',
comparison_group_index=transactions_df['brand'] == 'Brand B',
comparison_group_name='Brand B',
)
# NEW: Inspect the Ibis query
print(gl.table.compile()) # See the SQL
# NEW: Lazy evaluation
result = gl.df # Computed on first access, cached
# Works with Ibis tables (new capability)
ibis_conn = ibis.duckdb.connect('retail.db')
transactions_table = ibis_conn.table('transactions')
gl_ibis = GainLoss(
df=transactions_table, # Direct Ibis table
p1_index=transactions_table['date'].between('2024-01-01', '2024-01-31').to_pandas(),
# ... etc
)
# Query executed in DuckDB, not pandas
result = gl_ibis.df
Acceptance Criteria
Testing Requirements
Unit Tests
-
Input handling:
- Test with pandas DataFrame (auto-conversion)
- Test with ibis.Table (direct use)
- Test validation errors for invalid input types
-
Boolean index filtering:
- Verify p1/p2 filtering works correctly
- Verify focus/comparison filtering works correctly
- Test combined filters (e.g., p1 & focus)
-
Aggregation correctness:
- Compare Ibis results to pandas results on same data
- Test with different agg_func values ('sum', 'mean', etc.)
- Test with and without group_col
-
Business logic:
- Verify all branches of
process_customer_group logic
- Test edge cases (all zeros, only new customers, only lost customers)
- Test switching scenarios
-
Properties:
- Test
.table returns ibis.Table
- Test
.df lazy evaluation (only executes once)
- Test
.df caching behavior
Regression Tests
- Existing functionality: All current test cases must pass
- API compatibility: Existing code using
GainLoss works without changes
- Output format: Verify output DataFrame structure matches current implementation
Related Code Patterns
Reference Implementation: SegTransactionStats
The SegTransactionStats class (lines 59-447 in pyretailscience/segmentation/segstats.py) demonstrates the target architecture:
-
Constructor (lines 73-148):
- Accepts
pd.DataFrame | ibis.Table
- Converts pandas to ibis.memtable
- Validates input types
-
Static calculation method (lines 200-329):
- Works entirely in Ibis
- Returns
ibis.Table
- Uses group_by and aggregate
-
Properties (lines 331-351):
self.table: Stores the Ibis table
self.df: Lazy evaluation with caching
Key Takeaway: Follow this exact pattern for consistency across the codebase.
Dependencies
None - this is a refactoring of existing functionality using existing dependencies (Ibis is already in the project).
Breaking Changes
None - This is a backward-compatible enhancement. Existing code using pandas DataFrames will continue to work exactly as before.
Implementation Notes
Code Structure
pyretailscience/analysis/gain_loss.py
├── GainLoss class
│ ├── __init__: Accept pd.DataFrame | ibis.Table
│ ├── _df: Private cached DataFrame
│ ├── table property: Return Ibis table
│ ├── df property: Lazy evaluation
│ ├── _calc_gain_loss: Migrate to Ibis (MAIN WORK)
│ ├── _calc_gains_loss_table: Migrate to Ibis
│ ├── process_customer_group: Convert to Ibis expressions or keep as helper
│ └── plot: Uses .df property (no changes needed)
Estimated Effort
- Small: Input handling and validation (4 hours)
- Medium:
_calc_gains_loss_table migration (8 hours)
- Large:
_calc_gain_loss migration including boolean index handling (16 hours)
- Large:
process_customer_group logic conversion to Ibis expressions (16 hours)
- Medium: Testing and validation (12 hours)
- Small: Documentation updates (4 hours)
Labels
type:enhancement - Improving existing functionality
priority:normal - Important for performance and consistency
size:large - Significant refactoring effort
status:ready - Fully specified and ready to implement
Migrate GainLoss Class to Use Ibis for Performance
Problem Statement
The current
GainLossclass inpyretailscience/analysis/gain_loss.pyperforms all data processing using pandas DataFrame operations. This creates performance bottlenecks when analyzing large retail datasets, as all data must be loaded into memory and processed row-by-row using pandas operations.Current Architecture Issues
SegTransactionStatsuse Ibis for efficient query generationProposed Solution
Refactor the
GainLossclass to use Ibis for data processing, following the same architecture pattern asSegTransactionStats. This will enable:.tableproperty exposing Ibis expressionsCurrent Implementation Analysis
Key Methods to Migrate
File:
pyretailscience/analysis/gain_loss.py__init__method (lines 31-94):pd.DataFramepd.DataFrame | ibis.Table_calc_gain_lossmethod (lines 140-226):process_customer_grouprow-by-row_calc_gains_loss_tablemethod (lines 229-245):process_customer_groupstatic method (lines 97-137):Data Flow
API Design
Updated Constructor Signature
New Properties
Following the
SegTransactionStatspattern:Implementation Approach
Phase 1: Input Handling and Validation
Phase 2: Migrate
_calc_gain_lossto IbisChallenge: The current implementation uses boolean indexing and row-wise apply operations.
Ibis Approach:
Boolean Indexing → Ibis Filters
Since indices are boolean arrays from the original DataFrame, we need to:
Groupby Aggregations → Ibis GroupBy
Row-wise Apply → Ibis Case Expressions
The
process_customer_grouplogic needs to be converted to Ibis case/when expressions:Phase 3: Handle Boolean Index Parameters
Challenge: The method receives boolean indices (
p1_index,p2_index, etc.) that reference rows in the original DataFrame.Solutions:
Option A: Add Filter Columns to Ibis Table (Recommended)
Option B: Require Ibis Expression Inputs (Breaking Change)
Recommended: Use Option A for backward compatibility, document Option B as future enhancement.
Phase 4: Migrate
_calc_gains_loss_tableStraightforward conversion:
Technical Considerations
1. Boolean Index Handling
Challenge: Current API uses boolean pandas Series/lists to filter data
Solution: Convert boolean indices to filter columns in Ibis table
Alternative: Future breaking change to use Ibis boolean expressions directly
2. Row-wise Logic Translation
Challenge:
process_customer_groupcontains complex conditional logic applied row-by-rowSolution: Convert to nested Ibis
case().when().else_()expressionsRisk: Complex logic may be harder to maintain in Ibis expressions
Mitigation: Add comprehensive comments and unit tests for each case
3. Categorical Columns
Challenge: Current code converts customer_id to categorical (line 167)
Solution: Not needed in Ibis - grouping is efficient without categoricals
Benefit: Simpler code, better performance
4. Backward Compatibility
Challenge: Existing users pass pandas DataFrames and boolean Series
Solution: Keep API signature the same, auto-convert pandas to Ibis
Benefit: No breaking changes
5. Testing Strategy
Example: Before and After
Current Pandas Implementation
Proposed Ibis Implementation
Acceptance Criteria
GainLoss.__init__accepts bothpd.DataFrameandibis.Tableibis.memtable_calc_gain_lossmethod uses Ibis operations instead of pandasprocess_customer_grouplogic converted to Ibis case expressions_calc_gains_loss_tablereturnsibis.Tableinstead ofpd.DataFrame.tableproperty returning the Ibis table expression.dfproperty with lazy evaluation (execute on first access)Testing Requirements
Unit Tests
Input handling:
Boolean index filtering:
Aggregation correctness:
Business logic:
process_customer_grouplogicProperties:
.tablereturns ibis.Table.dflazy evaluation (only executes once).dfcaching behaviorRegression Tests
GainLossworks without changesRelated Code Patterns
Reference Implementation: SegTransactionStats
The
SegTransactionStatsclass (lines 59-447 inpyretailscience/segmentation/segstats.py) demonstrates the target architecture:Constructor (lines 73-148):
pd.DataFrame | ibis.TableStatic calculation method (lines 200-329):
ibis.TableProperties (lines 331-351):
self.table: Stores the Ibis tableself.df: Lazy evaluation with cachingKey Takeaway: Follow this exact pattern for consistency across the codebase.
Dependencies
None - this is a refactoring of existing functionality using existing dependencies (Ibis is already in the project).
Breaking Changes
None - This is a backward-compatible enhancement. Existing code using pandas DataFrames will continue to work exactly as before.
Implementation Notes
Code Structure
Estimated Effort
_calc_gains_loss_tablemigration (8 hours)_calc_gain_lossmigration including boolean index handling (16 hours)process_customer_grouplogic conversion to Ibis expressions (16 hours)Labels
type:enhancement- Improving existing functionalitypriority:normal- Important for performance and consistencysize:large- Significant refactoring effortstatus:ready- Fully specified and ready to implement