Skip to content

#163 Purchase Path Analysis#290

Open
mayurkmmt wants to merge 2 commits into
mainfrom
feature/purchase-path-analysis
Open

#163 Purchase Path Analysis#290
mayurkmmt wants to merge 2 commits into
mainfrom
feature/purchase-path-analysis

Conversation

@mayurkmmt

@mayurkmmt mayurkmmt commented Jun 10, 2025

Copy link
Copy Markdown
Collaborator

feat: Create purchase path analysis module

Summary by CodeRabbit

  • New Features

    • Introduced a new "Purchase Path Analysis" module for analyzing customer purchase journeys through product categories over time.
    • Added documentation and examples for the new analysis module, including usage instructions and sample outputs.
    • Added a dedicated documentation page for "Purchase Path" in the reference section.
  • Tests

    • Added comprehensive tests to ensure the accuracy and reliability of the purchase path analysis functionality.

@coderabbitai

coderabbitai Bot commented Jun 10, 2025

Copy link
Copy Markdown

Walkthrough

A new "Purchase Path Analysis" feature was introduced, including a new analysis module, comprehensive documentation, and corresponding unit tests. The documentation and navigation were updated to reference the new module and its API. The module enables analysis of customer purchase journeys through product categories, with flexible configuration and robust testing for various scenarios.

Changes

File(s) Change Summary
pyretailscience/analysis/purchase_path.py Added new purchase_path_analysis function and internal helpers for analyzing purchase paths.
tests/analysis/test_purchase_path.py Introduced extensive unit tests covering functionality, parameters, edge cases, and performance.
docs/analysis_modules.md Added "Purchase Path Analysis" section describing the new module, usage, and example snippet.
docs/api/analysis/purchase_path.md Created new API documentation stub referencing the purchase path analysis module.
mkdocs.yml Updated navigation to include the new "Purchase Path" documentation page under Analysis section.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant purchase_path_analysis
    participant HelperFunctions
    participant Output

    User->>purchase_path_analysis: Provide transactions DataFrame and parameters
    purchase_path_analysis->>purchase_path_analysis: Validate input and filter transactions
    purchase_path_analysis->>HelperFunctions: Group transactions into baskets/categories
    HelperFunctions-->>purchase_path_analysis: Return grouped/pivoted data
    purchase_path_analysis->>purchase_path_analysis: Aggregate and filter purchase paths
    purchase_path_analysis->>Output: Return DataFrame with paths, customer counts, probabilities
Loading

Poem

In the warren of data, a new path appears,
Hopping through baskets, across shopping years.
With tests and docs, our journey’s complete,
Mapping each burrow where customers meet.
🛒🐇 Now rabbits can see—
The purchase path’s secrets, as clear as can be!

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@codecov

codecov Bot commented Jun 10, 2025

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.32258% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pyretailscience/analysis/purchase_path.py 90.32% 3 Missing and 3 partials ⚠️
Files with missing lines Coverage Δ
pyretailscience/analysis/purchase_path.py 90.32% <90.32%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (4)
pyretailscience/analysis/purchase_path.py (1)

41-56: Simplify empty check and improve readability.

The conditional expression on line 43 is unnecessarily complex. Additionally, the column renaming logic assumes sequential basket numbering starting from 1.

Consider this cleaner approach:

 def _build_paths_df(category_groups_df: pd.DataFrame) -> pd.DataFrame:
     """Constructs a pivoted DataFrame representing customer purchase paths."""
-    actual_baskets = sorted(category_groups_df["first_basket_number"].unique()) if not category_groups_df.empty else []
+    if category_groups_df.empty:
+        return pd.DataFrame()
+    
+    actual_baskets = sorted(category_groups_df["first_basket_number"].unique())
     paths_df = category_groups_df.pivot_table(
         index="customer_id",
         columns="first_basket_number",
         values="categories",
         aggfunc="first",
     ).reset_index()
docs/analysis_modules.md (1)

1369-1376: Consider adding error handling to the example.

The example assumes all operations succeed. For documentation completeness, consider showing how to handle potential errors.

Add a note about error handling:

 result = purchase_path_analysis(
     sample_data,
     category_column='product_category',
     min_customers=1,
     min_transactions=3,
     multi_category_handling='concatenate'
 )
+
+# Note: The function may return an empty DataFrame if no paths meet the criteria
+if result.empty:
+    print("No purchase paths found matching the criteria")
+else:
+    print(result)
tests/analysis/test_purchase_path.py (2)

230-234: Make probability sum assertion more robust.

The hardcoded tolerance of 0.001 might be too strict for floating-point arithmetic.

Use numpy's allclose for more robust comparison:

         # Check transition probabilities
         if len(result) > 0:
             assert all(0 <= prob <= 1 for prob in result["transition_probability"])
             prob_sum = result["transition_probability"].sum()
-            value = 0.001
-            assert abs(prob_sum - 1.0) < value
+            # Use numpy's allclose with appropriate tolerance
+            assert np.allclose(prob_sum, 1.0, rtol=1e-7, atol=1e-9)

299-326: Document performance test parameters.

The comments mention values were "reduced for faster testing" but don't explain the original values or rationale.

Add more context:

     def test_performance_with_larger_dataset(self):
         """Test performance characteristics with a larger synthetic dataset."""
         rng = np.random.default_rng(42)
 
-        n_customers = 50  # Reduced from 100 for faster testing
-        n_transactions_per_customer = 5  # Reduced from 10
+        # Using smaller dataset for CI/CD performance
+        # Production testing should use n_customers=1000, n_transactions_per_customer=50
+        n_customers = 50
+        n_transactions_per_customer = 5
         categories = ["Electronics", "Books", "Clothing", "Home", "Sports"]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d8c65ca and 0cc9d79.

📒 Files selected for processing (5)
  • docs/analysis_modules.md (1 hunks)
  • docs/api/analysis/purchase_path.md (1 hunks)
  • mkdocs.yml (1 hunks)
  • pyretailscience/analysis/purchase_path.py (1 hunks)
  • tests/analysis/test_purchase_path.py (1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
pyretailscience/analysis/purchase_path.py

[refactor] 58-58: Too many arguments (12/5)

(R0913)


[refactor] 58-58: Too many positional arguments (12/5)

(R0917)


[refactor] 58-58: Too many local variables (28/15)

(R0914)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Pre-Commit
🔇 Additional comments (3)
pyretailscience/analysis/purchase_path.py (1)

157-160: Add index reset for consistency.

Missing reset_index(drop=True) on line 159 might cause index issues for downstream operations.

         total_customers = pattern_counts.customer_count.sum()
         pattern_counts["transition_probability"] = (pattern_counts.customer_count / total_customers).round(3)
-        return pattern_counts.sort_values("customer_count", ascending=False).reset_index(drop=True)
+        return pattern_counts.sort_values("customer_count", ascending=False).reset_index(drop=True)

Wait, the code already has reset_index(drop=True). This comment is unnecessary.

mkdocs.yml (1)

31-31: LGTM!

The new Purchase Path module is correctly added to the navigation structure in the appropriate location.

docs/analysis_modules.md (1)

1304-1384: Well-structured documentation with clear examples.

The Purchase Path Analysis documentation is comprehensive and provides good context for users. The example code is clear and demonstrates the key functionality effectively.

Comment on lines +138 to +140
if first_df.empty:
return pd.DataFrame(columns=["customer_count", "transition_probability"])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix inconsistent empty DataFrame returns.

The function returns DataFrames with different column structures when empty (lines 139 vs 161).

Standardize empty DataFrame structure:

-    if first_df.empty:
-        return pd.DataFrame(columns=["customer_count", "transition_probability"])
+    # Define expected columns at function start
+    base_columns = ["customer_count", "transition_probability"]
+    
+    if first_df.empty:
+        return pd.DataFrame(columns=base_columns)

     # ... later in the function ...
     
-    return pd.DataFrame(columns=[*basket_cols, "customer_count", "transition_probability"])
+    if not pattern_counts.empty:
+        # existing logic
+    else:
+        # Return consistent empty structure
+        return pd.DataFrame(columns=base_columns)

Also applies to: 161-161

🤖 Prompt for AI Agents
In pyretailscience/analysis/purchase_path.py around lines 138 to 140 and line
161, the function returns empty DataFrames with inconsistent column structures.
To fix this, standardize the empty DataFrame returns by ensuring both return
statements produce DataFrames with the same columns. Identify the full set of
columns used in the function's normal return and use that consistent column list
for all empty DataFrame returns.

Comment thread pyretailscience/analysis/purchase_path.py
Comment thread pyretailscience/analysis/purchase_path.py
Comment thread pyretailscience/analysis/purchase_path.py
Comment thread pyretailscience/analysis/purchase_path.py
Comment thread pyretailscience/analysis/purchase_path.py
Comment thread docs/api/analysis/purchase_path.md
Comment thread tests/analysis/test_purchase_path.py
Comment on lines +86 to +88
],
},
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Add missing import for revenue column validation.

The test fixture includes a 'revenue' column but the module might also need 'product_id' based on the implementation.

Verify if product_id is actually required:


🏁 Script executed:

#!/bin/bash
# Check the actual column requirements in the implementation
ast-grep --pattern 'required_cols = [$$$]'

Length of output: 2855


🏁 Script executed:

#!/bin/bash
# Display the implementation of purchase_path and its corresponding test
sed -n '1,200p' pyretailscience/analysis/purchase_path.py
echo "----"
sed -n '1,200p' tests/analysis/test_purchase_path.py

Length of output: 14019


Ensure product_id and revenue are validated as required columns

The function later uses ibis._.product_id.nunique() and filters on transactions_table.revenue, but the initial required_cols check only covers customer, transaction and category fields. Please update the list to include both cols.product_id and cols.revenue.

• File: pyretailscience/analysis/purchase_path.py (around line 73)

-    required_cols = [cols.customer_id, cols.transaction_id, cols.transaction_date, category_column]
+    required_cols = [
+        cols.customer_id,
+        cols.transaction_id,
+        cols.transaction_date,
+        category_column,
+        cols.product_id,
+        cols.revenue,
+    ]

This change will raise a clear ValueError when either product_id or revenue is missing, preventing downstream failures.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tests/analysis/test_purchase_path.py around lines 86 to 88, the test fixture
includes a 'revenue' column but the implementation also requires 'product_id'
and 'revenue' to be validated as required columns. Update the import or setup in
the test to include both 'product_id' and 'revenue' columns so that the
required_cols validation in pyretailscience/analysis/purchase_path.py (around
line 73) passes and raises a clear error if these columns are missing.

@mayurkmmt mayurkmmt force-pushed the feature/purchase-path-analysis branch from 0cc9d79 to 6ebd338 Compare June 10, 2025 09:27

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
pyretailscience/analysis/purchase_path.py (2)

58-71: Consider using a configuration object to reduce parameter complexity.

The function signature remains complex with 12 parameters, which affects maintainability and usability.

Consider implementing the previously suggested configuration dataclass approach to group related parameters and simplify the function signature.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 58-58: Too many arguments (12/5)

(R0913)


[refactor] 58-58: Too many positional arguments (12/5)

(R0917)


[refactor] 58-58: Too many local variables (30/15)

(R0914)


148-149: Fix inconsistent empty DataFrame returns.

The function still returns DataFrames with different column structures when empty. Lines 148-149 and 161 return only base columns, while line 171 includes basket columns. This inconsistency was previously identified but remains unresolved.

Standardize the empty DataFrame structure to always include the full expected column set including basket columns up to max_depth.

Also applies to: 161-161, 171-171

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0cc9d79 and 6ebd338.

📒 Files selected for processing (5)
  • docs/analysis_modules.md (1 hunks)
  • docs/api/analysis/purchase_path.md (1 hunks)
  • mkdocs.yml (1 hunks)
  • pyretailscience/analysis/purchase_path.py (1 hunks)
  • tests/analysis/test_purchase_path.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • mkdocs.yml
🚧 Files skipped from review as they are similar to previous changes (3)
  • docs/api/analysis/purchase_path.md
  • docs/analysis_modules.md
  • tests/analysis/test_purchase_path.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
pyretailscience/analysis/purchase_path.py (1)
pyretailscience/options.py (1)
  • ColumnHelper (408-560)
🪛 Pylint (3.3.7)
pyretailscience/analysis/purchase_path.py

[refactor] 58-58: Too many arguments (12/5)

(R0913)


[refactor] 58-58: Too many positional arguments (12/5)

(R0917)


[refactor] 58-58: Too many local variables (30/15)

(R0914)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Pre-Commit
🔇 Additional comments (2)
pyretailscience/analysis/purchase_path.py (2)

13-38: LGTM! String conversion issue resolved.

The helper function correctly handles category grouping and the string conversion on line 32 properly addresses the previous concern about non-string values in the category column.


125-137: LGTM! Aggregation function mapping improved.

The aggregation function mapping has been properly refactored with a clean dictionary lookup and appropriate error handling, addressing the previous maintainability concern.

Comment thread pyretailscience/analysis/purchase_path.py
@murray-ds

Copy link
Copy Markdown
Contributor

@claude Please review this PR

@claude

claude Bot commented Sep 11, 2025

Copy link
Copy Markdown

Claude finished @mayurkmmt's task in 2m 0s —— View job


PR Review: Purchase Path Analysis

Todo List:

  • Read implementation and test files
  • Analyze code quality and adherence to CLAUDE.md guidelines
  • Review test coverage and quality
  • Check for bugs and security issues
  • Review documentation
  • Provide comprehensive feedback

Summary

This PR introduces a new Purchase Path Analysis module that tracks customer journeys through product categories over time. The implementation is well-structured with strong test coverage (90.32%), but there are several issues that need to be addressed before merging, particularly around column validation, code style adherence, and test quality.

🔴 Critical Issues

1. Missing product_id in ColumnHelper (pyretailscience/analysis/purchase_path.py:74)

Issue: The code hardcodes "product_id" as a string instead of using the ColumnHelper pattern. While product_id is defined in the options configuration, it's not exposed as an attribute on the ColumnHelper class.

Impact: Inconsistent with codebase patterns and breaks if users customize the product_id column name via options.

Fix needed:

  1. Add self.product_id = get_option("column.product_id") to ColumnHelper.init in pyretailscience/options.py
  2. Replace "product_id" with cols.product_id on line 74

Fix this →

2. Inconsistent Empty DataFrame Returns (pyretailscience/analysis/purchase_path.py:149, 161, 171)

Issue: The function returns empty DataFrames with different column structures:

  • Lines 149, 161: Only ["customer_count", "transition_probability"]
  • Line 171: Includes basket columns [*basket_cols, "customer_count", "transition_probability"]

Impact: Inconsistent API contract makes it difficult for users to handle empty results reliably.

Fix needed: Standardize to always return the same column structure, preferably with basket columns up to max_depth.

Fix this →

3. Incomplete Column Validation (pyretailscience/analysis/purchase_path.py:74-76)

Issue: The function uses revenue column unconditionally (line 92) but only validates it when exclude_negative_revenue=True (line 76).

Impact: Runtime errors when exclude_negative_revenue=False but revenue column is missing.

Fix needed: Add "revenue" to required_cols unconditionally since it's used in the basket aggregation regardless of the filter setting.

Fix this →

🟡 Code Quality Issues

4. Function Signature Complexity (pyretailscience/analysis/purchase_path.py:58-71)

Issue: Function has 12 parameters, violating the guideline of keeping function signatures simple. Pylint correctly flags this with R0913, R0917, and R0914.

Per CLAUDE.md: Functions should be maintainable and readable.

Recommendation: Consider introducing a configuration dataclass to group related parameters:

@dataclass
class PurchasePathConfig:
    category_column: str = "product_category"
    min_transactions: int = 3
    min_basket_size: int = 2
    # ... other parameters

def purchase_path_analysis(
    transactions_df: pd.DataFrame,
    config: PurchasePathConfig | None = None,
    **kwargs
) -> pd.DataFrame:
    if config is None:
        config = PurchasePathConfig(**kwargs)

5. Magic Number in Test (tests/analysis/test_purchase_path.py:235-236)

Per CLAUDE.md: "Extract magic numbers as named constants with descriptive names"

Issue: Hardcoded tolerance 0.001 for probability sum assertion.

Fix needed: Use numpy's allclose for more robust floating-point comparison:

assert np.allclose(prob_sum, 1.0, rtol=1e-7, atol=1e-9)

6. Minimal API Documentation (docs/api/analysis/purchase_path.md)

Issue: The API documentation file only contains a basic stub with no parameter descriptions, examples, or return value documentation.

Per CLAUDE.md: Functions need "Google-style docstrings with Args, Returns, and Raises sections"

Recommendation: Enhance the docstring in purchase_path.py with comprehensive Args/Returns/Raises sections, and consider adding more detailed examples in the API docs.

🟢 Test Coverage Issues

7. Missing Error Path Tests (6 lines uncovered per Codecov)

Per CLAUDE.md Test Guidelines: "Include boundary/edge case tests for threshold values, limits, and special cases"

Missing tests:

a) Invalid aggregation function (lines 134-136):

def test_invalid_aggregation_function(self, sample_transactions_df):
    """Test that invalid aggregation function raises ValueError."""
    with pytest.raises(ValueError, match="Unsupported aggregation function"):
        purchase_path_analysis(
            sample_transactions_df,
            sort_by="aggregation",
            aggregation_column="revenue",
            aggregation_function="invalid_func"
        )

b) Missing required columns:

def test_missing_product_id_column(self):
    """Test that missing product_id column raises ValueError."""
    df = pd.DataFrame({
        "customer_id": [1, 2],
        "transaction_id": [101, 201],
        "transaction_date": ["2023-01-01", "2023-01-02"],
        "product_category": ["Electronics", "Books"],
        # Missing product_id
        "revenue": [50.0, 60.0]
    })
    with pytest.raises(ValueError, match="product_id"):
        purchase_path_analysis(df)

c) Non-string category values (line 32 converts to string):

def test_non_string_category_values(self):
    """Test handling of non-string category values."""
    df = pd.DataFrame({
        "customer_id": [1, 1, 2],
        "transaction_id": [101, 102, 201],
        "transaction_date": ["2023-01-01", "2023-01-02", "2023-01-03"],
        "product_id": ["A1", "A2", "B1"],
        "revenue": [50.0, 60.0, 70.0],
        "product_category": [1, 2, 1]  # Integer categories
    })
    result = purchase_path_analysis(df, min_customers=1, min_transactions=1)
    assert isinstance(result, pd.DataFrame)

8. Weak Assertion in Negative Revenue Test (tests/analysis/test_purchase_path.py:158-161)

Per CLAUDE.md Test Guidelines: "Avoid vague comparison assertions that only check 'something changed'"

Issue: The assertion allows both empty results to pass, which doesn't actually verify the behavior difference.

Current code:

if len(result_exclude) > 0 or len(result_include) > 0:
    assert not result_exclude.equals(result_include)

Better approach: Verify actual filtering behavior with specific data that guarantees different results.

🟢 Positive Observations

  1. Excellent test organization: Good use of pytest fixtures and parametrization
  2. Comprehensive scenario coverage: Tests cover various filtering parameters, edge cases, and configurations
  3. Good vectorization: Uses pandas/ibis efficiently without unnecessary loops
  4. Clear helper functions: _build_category_group_df and _build_paths_df are well-separated concerns
  5. Proper string handling: Line 32 correctly converts category values to strings to prevent TypeErrors
  6. Input format flexibility: Handles both pandas DataFrames and Ibis tables

📋 Additional Recommendations

Documentation (docs/analysis_modules.md:1369-1376)

Suggestion: Add error handling example to show users what happens when no paths meet criteria:

result = purchase_path_analysis(...)
if result.empty:
    print("No purchase paths found matching the criteria")

Performance Test Documentation (tests/analysis/test_purchase_path.py:306-307)

Issue: Comments mention "reduced for faster testing" but don't document original values or reasoning.

Recommendation: Add context:

# Using smaller dataset for CI/CD performance
# Production testing should use n_customers=1000, n_transactions_per_customer=50
n_customers = 50
n_transactions_per_customer = 5

Summary of Required Changes

Before merging, you must:

  1. ✅ Add product_id to ColumnHelper and use it consistently
  2. ✅ Fix inconsistent empty DataFrame column structures
  3. ✅ Fix revenue column validation logic

Strongly recommended:
4. ⚠️ Add tests for missing coverage (invalid aggregation, missing columns, non-string categories)
5. ⚠️ Improve probability sum assertion using numpy
6. ⚠️ Enhance API documentation with comprehensive docstrings

Consider for future improvements:
7. 💡 Refactor function signature using a config dataclass
8. 💡 Strengthen negative revenue test assertions


Overall Assessment: This is a solid implementation with good test coverage and clear code structure. The main issues are around consistency with existing codebase patterns (ColumnHelper usage) and a few edge cases in validation logic. Once the critical issues are addressed, this will be a valuable addition to the analysis module.


• Branch: feature/purchase-path-analysis

@murray-ds

Copy link
Copy Markdown
Contributor

@claude /review-pr-tests

@claude

claude Bot commented Sep 21, 2025

Copy link
Copy Markdown

Claude finished @murray-ds's task —— View job


Test Review for Purchase Path Analysis

Todo List:

  • Read and analyze the test file structure
  • Review test coverage and identify gaps
  • Examine test fixtures and data setup
  • Check parametrized tests and edge cases
  • Review test assertions and validation
  • Provide comprehensive feedback

Test Coverage Overview

The test suite for purchase_path_analysis is comprehensive and well-structured, covering 90.32% of the code according to Codecov. The tests are organized in a single test class with clear, descriptive test methods.

🟢 Strengths

1. Excellent Test Organization

  • Proper use of pytest fixtures for test data setup (sample_transactions_df, multi_category_transactions_df)
  • Clear test method naming that describes what's being tested
  • Logical grouping of related tests within a single test class

2. Comprehensive Parametrization

The test suite makes excellent use of @pytest.mark.parametrize to cover multiple scenarios efficiently:

  • Filtering parameters (test_filtering_parameters): Tests all filter types with high/low values
  • Multi-category configurations (test_multi_category_configurations): Tests 6 different combinations of handling options
  • Max depth testing (test_max_depth_parameter): Tests various depth limits
  • Edge cases (test_edge_cases_and_empty_results): Tests both restrictive and normal filters

3. Good Edge Case Coverage

  • Single customer/transaction scenarios (test_minimal_data_single_customer)
  • Same category across all transactions (test_same_category_all_transactions)
  • Empty results with restrictive filters
  • Negative revenue handling

4. Input Format Testing

  • Tests both pandas DataFrame and Ibis table inputs (test_ibis_table_input)
  • Custom category column names (test_custom_category_column)

5. Result Validation

  • Verifies DataFrame structure and required columns
  • Checks probability constraints (sum to 1, range 0-1)
  • Validates sorting order (descending by customer count)
  • Confirms basket column naming convention

🟡 Areas for Improvement

1. Missing Test Coverage (6 lines uncovered)

Based on the implementation, the missing coverage likely includes:

  • Error path for invalid aggregation function (lines ~134-136): No test for unsupported aggregation functions
  • Partial coverage in conditional branches: Some edge conditions in the empty DataFrame return paths

Suggested additional test:

def test_invalid_aggregation_function(self, sample_transactions_df):
    """Test that invalid aggregation function raises ValueError."""
    with pytest.raises(ValueError, match="Unsupported aggregation function"):
        purchase_path_analysis(
            sample_transactions_df,
            sort_by="aggregation",
            aggregation_column="revenue",
            aggregation_function="invalid_func"
        )

2. Missing Column Validation Tests

The function now requires product_id and conditionally revenue, but tests don't verify:

  • Missing product_id column raises appropriate error
  • Missing revenue column when exclude_negative_revenue=True

Suggested test:

def test_missing_product_id_column(self):
    """Test that missing product_id column raises ValueError."""
    df = pd.DataFrame({
        "customer_id": [1, 2],
        "transaction_id": [101, 201],
        "transaction_date": ["2023-01-01", "2023-01-02"],
        "product_category": ["Electronics", "Books"],
        # Missing product_id
        "revenue": [50.0, 60.0]
    })
    with pytest.raises(ValueError, match="product_id"):
        purchase_path_analysis(df)

3. Probability Sum Assertion Could Be More Robust

Line 235-236 uses a hardcoded tolerance which could be fragile:

assert abs(prob_sum - 1.0) < 0.001

Better approach:

import numpy as np
assert np.allclose(prob_sum, 1.0, rtol=1e-7, atol=1e-9)

4. Performance Test Parameters Lack Context

The performance test mentions values were "reduced for faster testing" but doesn't document the original values or reasoning (lines 306-307).

5. Missing Tests for Helper Functions

While the main function is well-tested, the helper functions (_build_category_group_df, _build_paths_df) aren't directly tested. Consider adding unit tests for these.

6. No Tests for Type Hints

The function uses type hints (str | None), but there's no test verifying these work correctly with type checkers.

🔴 Critical Gaps

1. Inconsistent Empty DataFrame Structure

The tests don't verify that empty DataFrames have consistent column structures. The implementation returns different column sets in different empty scenarios (lines 149, 161, 171 in the main module).

2. String Conversion in Category Handling

While the implementation now handles non-string categories (line 32: str(v) for v in x), there's no test specifically validating this with integer or mixed-type categories.

Suggested test:

def test_non_string_category_values(self):
    """Test handling of non-string category values."""
    df = pd.DataFrame({
        "customer_id": [1, 1, 2],
        "transaction_id": [101, 102, 201],
        "transaction_date": ["2023-01-01", "2023-01-02", "2023-01-03"],
        "product_id": ["A1", "A2", "B1"],
        "revenue": [50.0, 60.0, 70.0],
        "product_category": [1, 2, 1]  # Integer categories
    })
    result = purchase_path_analysis(df, min_customers=1, min_transactions=1)
    assert isinstance(result, pd.DataFrame)

Summary

The test suite is very good overall with excellent use of pytest features and comprehensive scenario coverage. The 90.32% coverage is strong, but the missing 6 lines likely represent important error paths that should be tested. The main improvements needed are:

  1. Add tests for error conditions (invalid aggregation function)
  2. Test the newly required product_id column validation
  3. Add specific tests for non-string category handling
  4. Verify consistent empty DataFrame structures
  5. Improve probability sum assertions using numpy methods

The test structure and approach are sound - these are refinements to achieve complete coverage and handle edge cases identified in the code review.


Job run • Branch: feature/purchase-path-analysis

@mayurkmmt mayurkmmt changed the title Purchase Path Analysis #163 Purchase Path Analysis Nov 14, 2025
@github-actions

Copy link
Copy Markdown
Contributor

🚀 Cloudflare Pages Deployment

✅ Preview deployed successfully!

Preview URL: https://a5a3d65c.pyretailscience-docs.pages.dev
Deployment ID: a5a3d65c-de5c-4aa0-8751-d2d00cdf0e4c


This preview will be updated automatically when you push new changes to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants