-
Notifications
You must be signed in to change notification settings - Fork 1
Purchase Path Analysis #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughA new "Purchase Path Analysis" feature was introduced, including a new analysis module, comprehensive documentation, and corresponding unit tests. The documentation and navigation were updated to reference the new module and its API. The module enables analysis of customer purchase journeys through product categories, with flexible configuration and robust testing for various scenarios. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant purchase_path_analysis
participant HelperFunctions
participant Output
User->>purchase_path_analysis: Provide transactions DataFrame and parameters
purchase_path_analysis->>purchase_path_analysis: Validate input and filter transactions
purchase_path_analysis->>HelperFunctions: Group transactions into baskets/categories
HelperFunctions-->>purchase_path_analysis: Return grouped/pivoted data
purchase_path_analysis->>purchase_path_analysis: Aggregate and filter purchase paths
purchase_path_analysis->>Output: Return DataFrame with paths, customer counts, probabilities
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Nitpick comments (4)
pyretailscience/analysis/purchase_path.py (1)
41-56
: Simplify empty check and improve readability.The conditional expression on line 43 is unnecessarily complex. Additionally, the column renaming logic assumes sequential basket numbering starting from 1.
Consider this cleaner approach:
def _build_paths_df(category_groups_df: pd.DataFrame) -> pd.DataFrame: """Constructs a pivoted DataFrame representing customer purchase paths.""" - actual_baskets = sorted(category_groups_df["first_basket_number"].unique()) if not category_groups_df.empty else [] + if category_groups_df.empty: + return pd.DataFrame() + + actual_baskets = sorted(category_groups_df["first_basket_number"].unique()) paths_df = category_groups_df.pivot_table( index="customer_id", columns="first_basket_number", values="categories", aggfunc="first", ).reset_index()docs/analysis_modules.md (1)
1369-1376
: Consider adding error handling to the example.The example assumes all operations succeed. For documentation completeness, consider showing how to handle potential errors.
Add a note about error handling:
result = purchase_path_analysis( sample_data, category_column='product_category', min_customers=1, min_transactions=3, multi_category_handling='concatenate' ) + +# Note: The function may return an empty DataFrame if no paths meet the criteria +if result.empty: + print("No purchase paths found matching the criteria") +else: + print(result)tests/analysis/test_purchase_path.py (2)
230-234
: Make probability sum assertion more robust.The hardcoded tolerance of 0.001 might be too strict for floating-point arithmetic.
Use numpy's allclose for more robust comparison:
# Check transition probabilities if len(result) > 0: assert all(0 <= prob <= 1 for prob in result["transition_probability"]) prob_sum = result["transition_probability"].sum() - value = 0.001 - assert abs(prob_sum - 1.0) < value + # Use numpy's allclose with appropriate tolerance + assert np.allclose(prob_sum, 1.0, rtol=1e-7, atol=1e-9)
299-326
: Document performance test parameters.The comments mention values were "reduced for faster testing" but don't explain the original values or rationale.
Add more context:
def test_performance_with_larger_dataset(self): """Test performance characteristics with a larger synthetic dataset.""" rng = np.random.default_rng(42) - n_customers = 50 # Reduced from 100 for faster testing - n_transactions_per_customer = 5 # Reduced from 10 + # Using smaller dataset for CI/CD performance + # Production testing should use n_customers=1000, n_transactions_per_customer=50 + n_customers = 50 + n_transactions_per_customer = 5 categories = ["Electronics", "Books", "Clothing", "Home", "Sports"]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
docs/analysis_modules.md
(1 hunks)docs/api/analysis/purchase_path.md
(1 hunks)mkdocs.yml
(1 hunks)pyretailscience/analysis/purchase_path.py
(1 hunks)tests/analysis/test_purchase_path.py
(1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
pyretailscience/analysis/purchase_path.py
[refactor] 58-58: Too many arguments (12/5)
(R0913)
[refactor] 58-58: Too many positional arguments (12/5)
(R0917)
[refactor] 58-58: Too many local variables (28/15)
(R0914)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Pre-Commit
🔇 Additional comments (3)
pyretailscience/analysis/purchase_path.py (1)
157-160
: Add index reset for consistency.Missing
reset_index(drop=True)
on line 159 might cause index issues for downstream operations.total_customers = pattern_counts.customer_count.sum() pattern_counts["transition_probability"] = (pattern_counts.customer_count / total_customers).round(3) - return pattern_counts.sort_values("customer_count", ascending=False).reset_index(drop=True) + return pattern_counts.sort_values("customer_count", ascending=False).reset_index(drop=True)Wait, the code already has
reset_index(drop=True)
. This comment is unnecessary.mkdocs.yml (1)
31-31
: LGTM!The new Purchase Path module is correctly added to the navigation structure in the appropriate location.
docs/analysis_modules.md (1)
1304-1384
: Well-structured documentation with clear examples.The Purchase Path Analysis documentation is comprehensive and provides good context for users. The example code is clear and demonstrates the key functionality effectively.
if first_df.empty: | ||
return pd.DataFrame(columns=["customer_count", "transition_probability"]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix inconsistent empty DataFrame returns.
The function returns DataFrames with different column structures when empty (lines 139 vs 161).
Standardize empty DataFrame structure:
- if first_df.empty:
- return pd.DataFrame(columns=["customer_count", "transition_probability"])
+ # Define expected columns at function start
+ base_columns = ["customer_count", "transition_probability"]
+
+ if first_df.empty:
+ return pd.DataFrame(columns=base_columns)
# ... later in the function ...
- return pd.DataFrame(columns=[*basket_cols, "customer_count", "transition_probability"])
+ if not pattern_counts.empty:
+ # existing logic
+ else:
+ # Return consistent empty structure
+ return pd.DataFrame(columns=base_columns)
Also applies to: 161-161
🤖 Prompt for AI Agents
In pyretailscience/analysis/purchase_path.py around lines 138 to 140 and line
161, the function returns empty DataFrames with inconsistent column structures.
To fix this, standardize the empty DataFrame returns by ensuring both return
statements produce DataFrames with the same columns. Identify the full set of
columns used in the function's normal return and use that consistent column list
for all empty DataFrame returns.
def purchase_path_analysis( | ||
transactions_df: pd.DataFrame, | ||
category_column: str = "product_category", | ||
min_transactions: int = 3, | ||
min_basket_size: int = 2, | ||
min_basket_value: float = 10.0, | ||
max_depth: int = 10, | ||
min_customers: int = 5, | ||
exclude_negative_revenue: bool = True, | ||
multi_category_handling: str = "concatenate", | ||
sort_by: str = "alphabetical", | ||
aggregation_column: str | None = None, | ||
aggregation_function: str = "sum", | ||
) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider using a configuration object to reduce parameter count.
The function has 12 parameters, which violates the principle of keeping function signatures simple. Static analysis correctly flagged this.
Consider creating a configuration dataclass to group related parameters:
from dataclasses import dataclass
@dataclass
class PurchasePathConfig:
category_column: str = "product_category"
min_transactions: int = 3
min_basket_size: int = 2
min_basket_value: float = 10.0
max_depth: int = 10
min_customers: int = 5
exclude_negative_revenue: bool = True
multi_category_handling: str = "concatenate"
sort_by: str = "alphabetical"
aggregation_column: str | None = None
aggregation_function: str = "sum"
def purchase_path_analysis(
transactions_df: pd.DataFrame,
config: PurchasePathConfig | None = None,
**kwargs
) -> pd.DataFrame:
"""Analyzes customer purchase paths through product categories over time."""
if config is None:
config = PurchasePathConfig(**kwargs)
🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 58-58: Too many arguments (12/5)
(R0913)
[refactor] 58-58: Too many positional arguments (12/5)
(R0917)
[refactor] 58-58: Too many local variables (28/15)
(R0914)
🤖 Prompt for AI Agents
In pyretailscience/analysis/purchase_path.py around lines 58 to 71, the
purchase_path_analysis function has too many parameters, making the signature
complex and hard to maintain. Refactor by creating a dataclass named
PurchasePathConfig that groups all these parameters with their default values.
Then modify the function to accept a single config parameter of this dataclass
type (with an option to pass None) and optionally accept overrides via **kwargs,
initializing the config inside the function if None is provided. This will
simplify the function signature and improve code clarity.
# Purchase Path | ||
|
||
::: pyretailscience.analysis.purchase_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance API documentation with examples and parameter descriptions.
The current documentation is too minimal for a module with 12 parameters and complex functionality.
Consider adding comprehensive documentation:
# Purchase Path
::: pyretailscience.analysis.purchase_path
+ options:
+ show_source: true
+ show_root_heading: true
+ members:
+ - purchase_path_analysis
Additionally, ensure the docstring in the main module includes:
- Detailed parameter descriptions with types and defaults
- Return value description
- Example usage
- Notes about performance considerations
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Purchase Path | |
::: pyretailscience.analysis.purchase_path | |
# Purchase Path | |
::: pyretailscience.analysis.purchase_path | |
options: | |
show_source: true | |
show_root_heading: true | |
members: | |
- purchase_path_analysis |
🤖 Prompt for AI Agents
In docs/api/analysis/purchase_path.md at the beginning of the file, the
documentation is too minimal for a complex module with many parameters. Expand
the documentation by adding detailed descriptions for each of the 12 parameters
including their types and default values, provide a clear explanation of the
return value, include example usage demonstrating typical scenarios, and add
notes on performance considerations. This will make the API documentation more
comprehensive and user-friendly.
], | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Add missing import for revenue column validation.
The test fixture includes a 'revenue' column but the module might also need 'product_id' based on the implementation.
Verify if product_id is actually required:
🏁 Script executed:
#!/bin/bash
# Check the actual column requirements in the implementation
ast-grep --pattern 'required_cols = [$$$]'
Length of output: 2855
🏁 Script executed:
#!/bin/bash
# Display the implementation of purchase_path and its corresponding test
sed -n '1,200p' pyretailscience/analysis/purchase_path.py
echo "----"
sed -n '1,200p' tests/analysis/test_purchase_path.py
Length of output: 14019
Ensure product_id
and revenue
are validated as required columns
The function later uses ibis._.product_id.nunique()
and filters on transactions_table.revenue
, but the initial required_cols
check only covers customer, transaction and category fields. Please update the list to include both cols.product_id
and cols.revenue
.
• File: pyretailscience/analysis/purchase_path.py
(around line 73)
- required_cols = [cols.customer_id, cols.transaction_id, cols.transaction_date, category_column]
+ required_cols = [
+ cols.customer_id,
+ cols.transaction_id,
+ cols.transaction_date,
+ category_column,
+ cols.product_id,
+ cols.revenue,
+ ]
This change will raise a clear ValueError
when either product_id
or revenue
is missing, preventing downstream failures.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tests/analysis/test_purchase_path.py around lines 86 to 88, the test fixture
includes a 'revenue' column but the implementation also requires 'product_id'
and 'revenue' to be validated as required columns. Update the import or setup in
the test to include both 'product_id' and 'revenue' columns so that the
required_cols validation in pyretailscience/analysis/purchase_path.py (around
line 73) passes and raises a clear error if these columns are missing.
0cc9d79
to
6ebd338
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
pyretailscience/analysis/purchase_path.py (2)
58-71
: Consider using a configuration object to reduce parameter complexity.The function signature remains complex with 12 parameters, which affects maintainability and usability.
Consider implementing the previously suggested configuration dataclass approach to group related parameters and simplify the function signature.
🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 58-58: Too many arguments (12/5)
(R0913)
[refactor] 58-58: Too many positional arguments (12/5)
(R0917)
[refactor] 58-58: Too many local variables (30/15)
(R0914)
148-149
: Fix inconsistent empty DataFrame returns.The function still returns DataFrames with different column structures when empty. Lines 148-149 and 161 return only base columns, while line 171 includes basket columns. This inconsistency was previously identified but remains unresolved.
Standardize the empty DataFrame structure to always include the full expected column set including basket columns up to max_depth.
Also applies to: 161-161, 171-171
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
docs/analysis_modules.md
(1 hunks)docs/api/analysis/purchase_path.md
(1 hunks)mkdocs.yml
(1 hunks)pyretailscience/analysis/purchase_path.py
(1 hunks)tests/analysis/test_purchase_path.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- mkdocs.yml
🚧 Files skipped from review as they are similar to previous changes (3)
- docs/api/analysis/purchase_path.md
- docs/analysis_modules.md
- tests/analysis/test_purchase_path.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
pyretailscience/analysis/purchase_path.py (1)
pyretailscience/options.py (1)
ColumnHelper
(408-560)
🪛 Pylint (3.3.7)
pyretailscience/analysis/purchase_path.py
[refactor] 58-58: Too many arguments (12/5)
(R0913)
[refactor] 58-58: Too many positional arguments (12/5)
(R0917)
[refactor] 58-58: Too many local variables (30/15)
(R0914)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Pre-Commit
🔇 Additional comments (2)
pyretailscience/analysis/purchase_path.py (2)
13-38
: LGTM! String conversion issue resolved.The helper function correctly handles category grouping and the string conversion on line 32 properly addresses the previous concern about non-string values in the category column.
125-137
: LGTM! Aggregation function mapping improved.The aggregation function mapping has been properly refactored with a clean dictionary lookup and appropriate error handling, addressing the previous maintainability concern.
) -> pd.DataFrame: | ||
"""Analyzes customer purchase paths through product categories over time.""" | ||
cols = ColumnHelper() | ||
required_cols = [cols.customer_id, cols.transaction_id, cols.transaction_date, "product_id", category_column] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Use ColumnHelper for consistent column naming.
The product_id
column is hardcoded as a string instead of using the ColumnHelper pattern. If the ColumnHelper class supports a product_id
attribute, use it for consistency.
- required_cols = [cols.customer_id, cols.transaction_id, cols.transaction_date, "product_id", category_column]
+ required_cols = [cols.customer_id, cols.transaction_id, cols.transaction_date, cols.product_id, category_column]
Only apply this change if cols.product_id
is available in the ColumnHelper class.
🏁 Script executed:
#!/bin/bash
# Check if ColumnHelper has a product_id attribute
rg -A 5 -B 5 "product_id.*=" pyretailscience/options.py
Length of output: 57
🏁 Script executed:
#!/bin/bash
# Locate ColumnHelper class definition and its attributes across the repo
rg -n "class ColumnHelper" -A50 .
Length of output: 5094
🏁 Script executed:
#!/bin/bash
# Search for product-related attributes in ColumnHelper
rg -n "product" -A5 -B5 pyretailscience/options.py
Length of output: 4010
Add missing product_id
to ColumnHelper and use it consistently
The ColumnHelper
class doesn’t currently define a product_id
attribute, so switching to cols.product_id
will break downstream code. To address this:
• In pyretailscience/options.py
(inside class ColumnHelper.__init__
), add:
# Products
self.product_id = get_option("column.product_id")
• Then update pyretailscience/analysis/purchase_path.py
at line 74:
- required_cols = [cols.customer_id, cols.transaction_id, cols.transaction_date, "product_id", category_column]
+ required_cols = [cols.customer_id, cols.transaction_id, cols.transaction_date, cols.product_id, category_column]
🤖 Prompt for AI Agents
In pyretailscience/options.py inside the ColumnHelper class __init__ method, add
a new attribute self.product_id initialized with get_option("column.product_id")
to define the product_id column consistently. Then in
pyretailscience/analysis/purchase_path.py at line 74, replace the hardcoded
string "product_id" in the required_cols list with cols.product_id to use the
ColumnHelper pattern for consistent column naming.
feat: Create purchase path analysis module
Summary by CodeRabbit
New Features
Tests