Skip to content

Purchase Path Analysis #290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

mayurkmmt
Copy link
Collaborator

@mayurkmmt mayurkmmt commented Jun 10, 2025

feat: Create purchase path analysis module

Summary by CodeRabbit

  • New Features

    • Introduced a new "Purchase Path Analysis" module for analyzing customer purchase journeys through product categories over time.
    • Added documentation and examples for the new analysis module, including usage instructions and sample outputs.
    • Added a dedicated documentation page for "Purchase Path" in the reference section.
  • Tests

    • Added comprehensive tests to ensure the accuracy and reliability of the purchase path analysis functionality.

Copy link

coderabbitai bot commented Jun 10, 2025

Walkthrough

A new "Purchase Path Analysis" feature was introduced, including a new analysis module, comprehensive documentation, and corresponding unit tests. The documentation and navigation were updated to reference the new module and its API. The module enables analysis of customer purchase journeys through product categories, with flexible configuration and robust testing for various scenarios.

Changes

File(s) Change Summary
pyretailscience/analysis/purchase_path.py Added new purchase_path_analysis function and internal helpers for analyzing purchase paths.
tests/analysis/test_purchase_path.py Introduced extensive unit tests covering functionality, parameters, edge cases, and performance.
docs/analysis_modules.md Added "Purchase Path Analysis" section describing the new module, usage, and example snippet.
docs/api/analysis/purchase_path.md Created new API documentation stub referencing the purchase path analysis module.
mkdocs.yml Updated navigation to include the new "Purchase Path" documentation page under Analysis section.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant purchase_path_analysis
    participant HelperFunctions
    participant Output

    User->>purchase_path_analysis: Provide transactions DataFrame and parameters
    purchase_path_analysis->>purchase_path_analysis: Validate input and filter transactions
    purchase_path_analysis->>HelperFunctions: Group transactions into baskets/categories
    HelperFunctions-->>purchase_path_analysis: Return grouped/pivoted data
    purchase_path_analysis->>purchase_path_analysis: Aggregate and filter purchase paths
    purchase_path_analysis->>Output: Return DataFrame with paths, customer counts, probabilities
Loading

Poem

In the warren of data, a new path appears,
Hopping through baskets, across shopping years.
With tests and docs, our journey’s complete,
Mapping each burrow where customers meet.
🛒🐇 Now rabbits can see—
The purchase path’s secrets, as clear as can be!

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented Jun 10, 2025

Codecov Report

Attention: Patch coverage is 90.32258% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pyretailscience/analysis/purchase_path.py 90.32% 3 Missing and 3 partials ⚠️
Files with missing lines Coverage Δ
pyretailscience/analysis/purchase_path.py 90.32% <90.32%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (4)
pyretailscience/analysis/purchase_path.py (1)

41-56: Simplify empty check and improve readability.

The conditional expression on line 43 is unnecessarily complex. Additionally, the column renaming logic assumes sequential basket numbering starting from 1.

Consider this cleaner approach:

 def _build_paths_df(category_groups_df: pd.DataFrame) -> pd.DataFrame:
     """Constructs a pivoted DataFrame representing customer purchase paths."""
-    actual_baskets = sorted(category_groups_df["first_basket_number"].unique()) if not category_groups_df.empty else []
+    if category_groups_df.empty:
+        return pd.DataFrame()
+    
+    actual_baskets = sorted(category_groups_df["first_basket_number"].unique())
     paths_df = category_groups_df.pivot_table(
         index="customer_id",
         columns="first_basket_number",
         values="categories",
         aggfunc="first",
     ).reset_index()
docs/analysis_modules.md (1)

1369-1376: Consider adding error handling to the example.

The example assumes all operations succeed. For documentation completeness, consider showing how to handle potential errors.

Add a note about error handling:

 result = purchase_path_analysis(
     sample_data,
     category_column='product_category',
     min_customers=1,
     min_transactions=3,
     multi_category_handling='concatenate'
 )
+
+# Note: The function may return an empty DataFrame if no paths meet the criteria
+if result.empty:
+    print("No purchase paths found matching the criteria")
+else:
+    print(result)
tests/analysis/test_purchase_path.py (2)

230-234: Make probability sum assertion more robust.

The hardcoded tolerance of 0.001 might be too strict for floating-point arithmetic.

Use numpy's allclose for more robust comparison:

         # Check transition probabilities
         if len(result) > 0:
             assert all(0 <= prob <= 1 for prob in result["transition_probability"])
             prob_sum = result["transition_probability"].sum()
-            value = 0.001
-            assert abs(prob_sum - 1.0) < value
+            # Use numpy's allclose with appropriate tolerance
+            assert np.allclose(prob_sum, 1.0, rtol=1e-7, atol=1e-9)

299-326: Document performance test parameters.

The comments mention values were "reduced for faster testing" but don't explain the original values or rationale.

Add more context:

     def test_performance_with_larger_dataset(self):
         """Test performance characteristics with a larger synthetic dataset."""
         rng = np.random.default_rng(42)
 
-        n_customers = 50  # Reduced from 100 for faster testing
-        n_transactions_per_customer = 5  # Reduced from 10
+        # Using smaller dataset for CI/CD performance
+        # Production testing should use n_customers=1000, n_transactions_per_customer=50
+        n_customers = 50
+        n_transactions_per_customer = 5
         categories = ["Electronics", "Books", "Clothing", "Home", "Sports"]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d8c65ca and 0cc9d79.

📒 Files selected for processing (5)
  • docs/analysis_modules.md (1 hunks)
  • docs/api/analysis/purchase_path.md (1 hunks)
  • mkdocs.yml (1 hunks)
  • pyretailscience/analysis/purchase_path.py (1 hunks)
  • tests/analysis/test_purchase_path.py (1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
pyretailscience/analysis/purchase_path.py

[refactor] 58-58: Too many arguments (12/5)

(R0913)


[refactor] 58-58: Too many positional arguments (12/5)

(R0917)


[refactor] 58-58: Too many local variables (28/15)

(R0914)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Pre-Commit
🔇 Additional comments (3)
pyretailscience/analysis/purchase_path.py (1)

157-160: Add index reset for consistency.

Missing reset_index(drop=True) on line 159 might cause index issues for downstream operations.

         total_customers = pattern_counts.customer_count.sum()
         pattern_counts["transition_probability"] = (pattern_counts.customer_count / total_customers).round(3)
-        return pattern_counts.sort_values("customer_count", ascending=False).reset_index(drop=True)
+        return pattern_counts.sort_values("customer_count", ascending=False).reset_index(drop=True)

Wait, the code already has reset_index(drop=True). This comment is unnecessary.

mkdocs.yml (1)

31-31: LGTM!

The new Purchase Path module is correctly added to the navigation structure in the appropriate location.

docs/analysis_modules.md (1)

1304-1384: Well-structured documentation with clear examples.

The Purchase Path Analysis documentation is comprehensive and provides good context for users. The example code is clear and demonstrates the key functionality effectively.

Comment on lines +138 to +150
if first_df.empty:
return pd.DataFrame(columns=["customer_count", "transition_probability"])

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix inconsistent empty DataFrame returns.

The function returns DataFrames with different column structures when empty (lines 139 vs 161).

Standardize empty DataFrame structure:

-    if first_df.empty:
-        return pd.DataFrame(columns=["customer_count", "transition_probability"])
+    # Define expected columns at function start
+    base_columns = ["customer_count", "transition_probability"]
+    
+    if first_df.empty:
+        return pd.DataFrame(columns=base_columns)

     # ... later in the function ...
     
-    return pd.DataFrame(columns=[*basket_cols, "customer_count", "transition_probability"])
+    if not pattern_counts.empty:
+        # existing logic
+    else:
+        # Return consistent empty structure
+        return pd.DataFrame(columns=base_columns)

Also applies to: 161-161

🤖 Prompt for AI Agents
In pyretailscience/analysis/purchase_path.py around lines 138 to 140 and line
161, the function returns empty DataFrames with inconsistent column structures.
To fix this, standardize the empty DataFrame returns by ensuring both return
statements produce DataFrames with the same columns. Identify the full set of
columns used in the function's normal return and use that consistent column list
for all empty DataFrame returns.

Comment on lines +58 to +71
def purchase_path_analysis(
transactions_df: pd.DataFrame,
category_column: str = "product_category",
min_transactions: int = 3,
min_basket_size: int = 2,
min_basket_value: float = 10.0,
max_depth: int = 10,
min_customers: int = 5,
exclude_negative_revenue: bool = True,
multi_category_handling: str = "concatenate",
sort_by: str = "alphabetical",
aggregation_column: str | None = None,
aggregation_function: str = "sum",
) -> pd.DataFrame:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using a configuration object to reduce parameter count.

The function has 12 parameters, which violates the principle of keeping function signatures simple. Static analysis correctly flagged this.

Consider creating a configuration dataclass to group related parameters:

from dataclasses import dataclass

@dataclass
class PurchasePathConfig:
    category_column: str = "product_category"
    min_transactions: int = 3
    min_basket_size: int = 2
    min_basket_value: float = 10.0
    max_depth: int = 10
    min_customers: int = 5
    exclude_negative_revenue: bool = True
    multi_category_handling: str = "concatenate"
    sort_by: str = "alphabetical"
    aggregation_column: str | None = None
    aggregation_function: str = "sum"

def purchase_path_analysis(
    transactions_df: pd.DataFrame,
    config: PurchasePathConfig | None = None,
    **kwargs
) -> pd.DataFrame:
    """Analyzes customer purchase paths through product categories over time."""
    if config is None:
        config = PurchasePathConfig(**kwargs)
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 58-58: Too many arguments (12/5)

(R0913)


[refactor] 58-58: Too many positional arguments (12/5)

(R0917)


[refactor] 58-58: Too many local variables (28/15)

(R0914)

🤖 Prompt for AI Agents
In pyretailscience/analysis/purchase_path.py around lines 58 to 71, the
purchase_path_analysis function has too many parameters, making the signature
complex and hard to maintain. Refactor by creating a dataclass named
PurchasePathConfig that groups all these parameters with their default values.
Then modify the function to accept a single config parameter of this dataclass
type (with an option to pass None) and optionally accept overrides via **kwargs,
initializing the config inside the function if None is provided. This will
simplify the function signature and improve code clarity.

Comment on lines +1 to +3
# Purchase Path

::: pyretailscience.analysis.purchase_path
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance API documentation with examples and parameter descriptions.

The current documentation is too minimal for a module with 12 parameters and complex functionality.

Consider adding comprehensive documentation:

 # Purchase Path
 
 ::: pyretailscience.analysis.purchase_path
+    options:
+      show_source: true
+      show_root_heading: true
+      members:
+        - purchase_path_analysis

Additionally, ensure the docstring in the main module includes:

  • Detailed parameter descriptions with types and defaults
  • Return value description
  • Example usage
  • Notes about performance considerations
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Purchase Path
::: pyretailscience.analysis.purchase_path
# Purchase Path
::: pyretailscience.analysis.purchase_path
options:
show_source: true
show_root_heading: true
members:
- purchase_path_analysis
🤖 Prompt for AI Agents
In docs/api/analysis/purchase_path.md at the beginning of the file, the
documentation is too minimal for a complex module with many parameters. Expand
the documentation by adding detailed descriptions for each of the 12 parameters
including their types and default values, provide a clear explanation of the
return value, include example usage demonstrating typical scenarios, and add
notes on performance considerations. This will make the API documentation more
comprehensive and user-friendly.

Comment on lines +86 to +88
],
},
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Add missing import for revenue column validation.

The test fixture includes a 'revenue' column but the module might also need 'product_id' based on the implementation.

Verify if product_id is actually required:


🏁 Script executed:

#!/bin/bash
# Check the actual column requirements in the implementation
ast-grep --pattern 'required_cols = [$$$]'

Length of output: 2855


🏁 Script executed:

#!/bin/bash
# Display the implementation of purchase_path and its corresponding test
sed -n '1,200p' pyretailscience/analysis/purchase_path.py
echo "----"
sed -n '1,200p' tests/analysis/test_purchase_path.py

Length of output: 14019


Ensure product_id and revenue are validated as required columns

The function later uses ibis._.product_id.nunique() and filters on transactions_table.revenue, but the initial required_cols check only covers customer, transaction and category fields. Please update the list to include both cols.product_id and cols.revenue.

• File: pyretailscience/analysis/purchase_path.py (around line 73)

-    required_cols = [cols.customer_id, cols.transaction_id, cols.transaction_date, category_column]
+    required_cols = [
+        cols.customer_id,
+        cols.transaction_id,
+        cols.transaction_date,
+        category_column,
+        cols.product_id,
+        cols.revenue,
+    ]

This change will raise a clear ValueError when either product_id or revenue is missing, preventing downstream failures.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tests/analysis/test_purchase_path.py around lines 86 to 88, the test fixture
includes a 'revenue' column but the implementation also requires 'product_id'
and 'revenue' to be validated as required columns. Update the import or setup in
the test to include both 'product_id' and 'revenue' columns so that the
required_cols validation in pyretailscience/analysis/purchase_path.py (around
line 73) passes and raises a clear error if these columns are missing.

@mayurkmmt mayurkmmt force-pushed the feature/purchase-path-analysis branch from 0cc9d79 to 6ebd338 Compare June 10, 2025 09:27
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
pyretailscience/analysis/purchase_path.py (2)

58-71: Consider using a configuration object to reduce parameter complexity.

The function signature remains complex with 12 parameters, which affects maintainability and usability.

Consider implementing the previously suggested configuration dataclass approach to group related parameters and simplify the function signature.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 58-58: Too many arguments (12/5)

(R0913)


[refactor] 58-58: Too many positional arguments (12/5)

(R0917)


[refactor] 58-58: Too many local variables (30/15)

(R0914)


148-149: Fix inconsistent empty DataFrame returns.

The function still returns DataFrames with different column structures when empty. Lines 148-149 and 161 return only base columns, while line 171 includes basket columns. This inconsistency was previously identified but remains unresolved.

Standardize the empty DataFrame structure to always include the full expected column set including basket columns up to max_depth.

Also applies to: 161-161, 171-171

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0cc9d79 and 6ebd338.

📒 Files selected for processing (5)
  • docs/analysis_modules.md (1 hunks)
  • docs/api/analysis/purchase_path.md (1 hunks)
  • mkdocs.yml (1 hunks)
  • pyretailscience/analysis/purchase_path.py (1 hunks)
  • tests/analysis/test_purchase_path.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • mkdocs.yml
🚧 Files skipped from review as they are similar to previous changes (3)
  • docs/api/analysis/purchase_path.md
  • docs/analysis_modules.md
  • tests/analysis/test_purchase_path.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
pyretailscience/analysis/purchase_path.py (1)
pyretailscience/options.py (1)
  • ColumnHelper (408-560)
🪛 Pylint (3.3.7)
pyretailscience/analysis/purchase_path.py

[refactor] 58-58: Too many arguments (12/5)

(R0913)


[refactor] 58-58: Too many positional arguments (12/5)

(R0917)


[refactor] 58-58: Too many local variables (30/15)

(R0914)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Pre-Commit
🔇 Additional comments (2)
pyretailscience/analysis/purchase_path.py (2)

13-38: LGTM! String conversion issue resolved.

The helper function correctly handles category grouping and the string conversion on line 32 properly addresses the previous concern about non-string values in the category column.


125-137: LGTM! Aggregation function mapping improved.

The aggregation function mapping has been properly refactored with a clean dictionary lookup and appropriate error handling, addressing the previous maintainability concern.

) -> pd.DataFrame:
"""Analyzes customer purchase paths through product categories over time."""
cols = ColumnHelper()
required_cols = [cols.customer_id, cols.transaction_id, cols.transaction_date, "product_id", category_column]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Use ColumnHelper for consistent column naming.

The product_id column is hardcoded as a string instead of using the ColumnHelper pattern. If the ColumnHelper class supports a product_id attribute, use it for consistency.

-    required_cols = [cols.customer_id, cols.transaction_id, cols.transaction_date, "product_id", category_column]
+    required_cols = [cols.customer_id, cols.transaction_id, cols.transaction_date, cols.product_id, category_column]

Only apply this change if cols.product_id is available in the ColumnHelper class.


🏁 Script executed:

#!/bin/bash
# Check if ColumnHelper has a product_id attribute
rg -A 5 -B 5 "product_id.*=" pyretailscience/options.py

Length of output: 57


🏁 Script executed:

#!/bin/bash
# Locate ColumnHelper class definition and its attributes across the repo
rg -n "class ColumnHelper" -A50 .

Length of output: 5094


🏁 Script executed:

#!/bin/bash
# Search for product-related attributes in ColumnHelper
rg -n "product" -A5 -B5 pyretailscience/options.py

Length of output: 4010


Add missing product_id to ColumnHelper and use it consistently

The ColumnHelper class doesn’t currently define a product_id attribute, so switching to cols.product_id will break downstream code. To address this:

• In pyretailscience/options.py (inside class ColumnHelper.__init__), add:

    # Products
    self.product_id = get_option("column.product_id")

• Then update pyretailscience/analysis/purchase_path.py at line 74:

-    required_cols = [cols.customer_id, cols.transaction_id, cols.transaction_date, "product_id", category_column]
+    required_cols = [cols.customer_id, cols.transaction_id, cols.transaction_date, cols.product_id, category_column]
🤖 Prompt for AI Agents
In pyretailscience/options.py inside the ColumnHelper class __init__ method, add
a new attribute self.product_id initialized with get_option("column.product_id")
to define the product_id column consistently. Then in
pyretailscience/analysis/purchase_path.py at line 74, replace the hardcoded
string "product_id" in the required_cols list with cols.product_id to use the
ColumnHelper pattern for consistent column naming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant