Skip to content

Split Segmentation #154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 26, 2025
Merged

Split Segmentation #154

merged 1 commit into from
Mar 26, 2025

Conversation

mayurkmmt
Copy link
Collaborator

@mayurkmmt mayurkmmt commented Mar 25, 2025

User description

feat: split segmentations.py file


PR Type

Enhancement, Documentation, Tests


Description

  • Refactored segmentation logic into modular files for better organization.

    • Created base.py for base segmentation functionality.
    • Added hml.py for HML segmentation logic.
    • Added rfm.py for RFM segmentation logic.
    • Added threshold.py for threshold-based segmentation.
    • Added segstats.py for transaction statistics by segment.
  • Updated documentation to reflect new segmentation module structure.

    • Added API documentation for HMLSegmentation, RFMSegmentation, ThresholdSegmentation, and SegTransactionStats.
    • Updated examples in analysis_modules.md to use new imports.
  • Added comprehensive unit tests for all segmentation modules.

    • Created new test files for HMLSegmentation, RFMSegmentation, ThresholdSegmentation, and SegTransactionStats.
  • Updated mkdocs.yml to include new segmentation API documentation.


Changes walkthrough 📝

Relevant files
Enhancement
5 files
base.py
Added base class for segmentation logic                                   
+34/-0   
hml.py
Added HML segmentation logic                                                         
+49/-0   
rfm.py
Added RFM segmentation logic                                                         
+143/-0 
threshold.py
Added threshold-based segmentation logic                                 
+105/-0 
segstats.py
Added transaction statistics by segment logic                       
+11/-288
Tests
4 files
test_hml.py
Added unit tests for HML segmentation                                       
+88/-0   
test_rfm.py
Added unit tests for RFM segmentation                                       
+160/-0 
test_threshold.py
Added unit tests for threshold segmentation                           
+187/-0 
test_segstats.py
Added unit tests for transaction statistics by segment     
+320/-0 
Documentation
6 files
analysis_modules.md
Updated examples to reflect new segmentation module structure
+5/-4     
hml.md
Added API documentation for HML segmentation                         
+3/-0     
rfm.md
Added API documentation for RFM segmentation                         
+3/-0     
segstats.md
Added API documentation for SegTransactionStats                   
+3/-0     
threshold.md
Added API documentation for threshold segmentation             
+3/-0     
mkdocs.yml
Updated navigation to include new segmentation API documentation
+5/-1     
Additional files
2 files
__init__.py [link]   
test_segmentation.py +0/-733 

Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • Summary by CodeRabbit

    • New Features

      • Introduced enhanced customer segmentation capabilities with flexible approaches based on spending behavior and recency-frequency-monetary analysis.
      • Added new segmentation classes for HML, RFM, and Threshold analysis.
      • Introduced a new BaseSegmentation class for foundational segmentation functionality.
    • Documentation

      • Expanded segmentation guides with dedicated sections, offering improved navigation and clearer insights.
      • Created new documentation files for HML, RFM, SegTransactionStats, and Threshold segmentation.
      • Added a new section for Base Segmentation.
    • Refactor

      • Streamlined the segmentation module structure and updated integration points to support a more maintainable and scalable API.
    • Tests

      • Implemented comprehensive test suites for HML, RFM, Threshold, and SegTransactionStats classes to validate segmentation logic and ensure robust error handling across scenarios.
      • Added new test classes for calculating segment statistics and validating segmentation outputs.

    Copy link

    coderabbitai bot commented Mar 25, 2025

    Walkthrough

    The changes reorganize the segmentation modules in the package. Import paths for segmentation classes (HML, Threshold, SegTransactionStats, RFMSegmentation) are updated, and several new modules are introduced. Documentation is enhanced with new API files and navigation restructuring in the mkdocs configuration. Additionally, new test suites replace the deprecated segmentation tests, and unused segmentation classes are removed from the segmentation statistics module.

    Changes

    File(s) Change Summary
    docs/analysis_modules.md Updated import paths for segmentation classes (HML, Threshold, SegTransactionStats, RFMSegmentation).
    docs/api/segmentation/{hml.md, rfm.md, segstats.md, threshold.md} New documentation files for HML, RFM, SegTransactionStats, and Threshold segmentation.
    mkdocs.yml Restructured navigation by replacing a single Segmentation entry with a hierarchical structure linking to new segmentation docs.
    pyretailscience/segmentation/{base.py, hml.py, rfm.py, threshold.py} New segmentation modules added: BaseSegmentation, HMLSegmentation, RFMSegmentation, and ThresholdSegmentation.
    pyretailscience/segmentation/segstats.py Refactored to remove obsolete segmentation classes, retaining only SegTransactionStats with updated documentation.
    tests/analysis/test_segmentation.py Removed outdated segmentation tests.
    tests/segmentation/{test_hml.py, test_rfm.py, test_segstats.py, test_threshold.py} Added new test suites to validate the functionality of segmentation classes (HML, RFM, SegTransactionStats, Threshold).

    Sequence Diagram(s)

    sequenceDiagram
        participant Client
        participant ThresholdSegmentation
        participant DataFrame
        Client->>ThresholdSegmentation: __init__(df, thresholds, segments, ...)
        ThresholdSegmentation->>DataFrame: Validate required columns
        ThresholdSegmentation->>DataFrame: Aggregate data and calculate percentiles
        ThresholdSegmentation->>Client: Return segmented DataFrame
    
    Loading
    sequenceDiagram
        participant Client
        participant RFMSegmentation
        participant DataFrame
        Client->>RFMSegmentation: __init__(df, current_date)
        RFMSegmentation->>DataFrame: Validate input and compute RFM metrics
        RFMSegmentation->>RFMSegmentation: Calculate RFM scores and segments
        RFMSegmentation->>Client: Provide segmented results via df and ibis_table properties
    
    Loading

    Possibly related issues

    • Split segmentation.py into a module #151: The changes in the main issue directly relate to the restructuring of import paths for segmentation classes, which aligns with the proposed refactoring in the retrieved issue that aims to move these classes into a dedicated segmentation module.

    Possibly related PRs

    • Added threshold segmentation analysis modules docs #111: The changes in the main PR involve reorganization of import paths for segmentation classes, while the retrieved PR focuses on enhancing documentation for the HML and Threshold Segmentation classes, including method definitions; thus, they are related at the code level.
    • docs: added hml seg example to analysis modules page #110: The changes in the main PR involve updates to the import paths for the HMLSegmentation class, while the retrieved PR adds an example usage of HMLSegmentation in the documentation, indicating a direct relationship at the code level.
    • Added ThresholdSegmentation class #58: The changes in the main PR involve updates to the import paths and method signatures of segmentation classes, while the retrieved PR introduces a new ThresholdSegmentation class and refactors the HMLSegmentation class, indicating a direct relationship in terms of class structure and functionality.

    Suggested labels

    enhancement, documentation, Tests, Review effort [1-5]: 3

    Suggested reviewers

    • mvanwyk

    Poem

    I'm a clever rabbit with a coding flair,
    Hoping through modules with meticulous care,
    New paths for segmentation now brightly shine,
    With docs and tests all perfectly in line,
    A playful hop and a joyful cheer—code is fine!
    🐰✨

    ✨ Finishing Touches
    • 📝 Generate Docstrings

    🪧 Tips

    Chat

    There are 3 ways to chat with CodeRabbit:

    • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
      • I pushed a fix in commit <commit_id>, please review it.
      • Generate unit testing code for this file.
      • Open a follow-up GitHub issue for this discussion.
    • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
      • @coderabbitai generate unit testing code for this file.
      • @coderabbitai modularize this function.
    • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
      • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
      • @coderabbitai read src/utils.ts and generate unit testing code.
      • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
      • @coderabbitai help me debug CodeRabbit configuration file.

    Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

    CodeRabbit Commands (Invoked using PR comments)

    • @coderabbitai pause to pause the reviews on a PR.
    • @coderabbitai resume to resume the paused reviews.
    • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
    • @coderabbitai full review to do a full review from scratch and review all the files again.
    • @coderabbitai summary to regenerate the summary of the PR.
    • @coderabbitai generate docstrings to generate docstrings for this PR.
    • @coderabbitai resolve resolve all the CodeRabbit review comments.
    • @coderabbitai plan to trigger planning for file edits and PR creation.
    • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
    • @coderabbitai help to get help.

    Other keywords and placeholders

    • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
    • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
    • Add @coderabbitai anywhere in the PR title to generate the title automatically.

    CodeRabbit Configuration File (.coderabbit.yaml)

    • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
    • Please see the configuration documentation for more information.
    • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

    Documentation and Community

    • Visit our Documentation for detailed information on how to use CodeRabbit.
    • Join our Discord Community to get help, request features, and share feedback.
    • Follow us on X/Twitter for updates and announcements.

    Copy link
    Contributor

    Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Documentation Path Error

    The navigation path in mkdocs.yml references 'api/segmentation/html.md' but the actual file is named 'api/segmentation/hml.md'. This will cause a broken link in the documentation.

    - HML Segmentation: api/segmentation/html.md
    Import Error

    There's an incorrect import statement referencing 'pyretailscience.segmentation.html' instead of 'pyretailscience.segmentation.hml' which will cause runtime errors.

    from pyretailscience.segmentation.html import HMLSegmentation
    Test Logic Issue

    The test_thresholds_too_too_few_thresholds method has a redundant name and tests both too few and too many thresholds in the same test method, which violates the single responsibility principle for tests.

    def test_thresholds_too_too_few_thresholds(self):
        """Test that the method raises an error when there are too few/many thresholds for the number of segments."""
        df = pd.DataFrame(
            {
                get_option("column.customer_id"): [1, 2, 3, 4, 5],
                cols.unit_spend: [100, 200, 300, 400, 500],
            },
        )
        thresholds = [0.4, 1]
        segments = ["Low", "Medium", "High"]
    
        with pytest.raises(ValueError):
            ThresholdSegmentation(df, thresholds, segments)
    
        thresholds = [0.2, 0.5, 0.6, 0.8, 1]
    
        with pytest.raises(ValueError):
            ThresholdSegmentation(df, thresholds, segments)

    Copy link
    Contributor

    qodo-merge-pro bot commented Mar 25, 2025

    Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix import path typo

    There's a typo in the import statement. The module name should be 'hml' not
    'html'.

    docs/analysis_modules.md [770]

    -from pyretailscience.segmentation.html import HMLSegmentation
    +from pyretailscience.segmentation.hml import HMLSegmentation
    • Apply this suggestion
    Suggestion importance[1-10]: 10

    __

    Why: The import statement contains a critical typo using "html" instead of "hml", which would cause import errors when users try to run the code. This is a high-impact fix as it corrects code that would definitely fail.

    High
    Fix documentation file path

    The file path for HML Segmentation documentation contains a typo. It should be
    'hml.md' instead of 'html.md' to match the actual module name.

    mkdocs.yml [30]

    -- HML Segmentation: api/segmentation/html.md
    +- HML Segmentation: api/segmentation/hml.md
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    __

    Why: The documentation path contains the same "html" vs "hml" typo, which would result in broken documentation links. This is a high-impact fix as it ensures users can access the correct documentation pages.

    High
    General
    Use timezone.utc for compatibility

    The code is using datetime.UTC which was introduced in Python 3.11. For better
    compatibility with older Python versions, consider using datetime.timezone.utc
    instead.

    pyretailscience/segmentation/rfm.py [87]

    -current_date = datetime.datetime.now(datetime.UTC).date()
    +current_date = datetime.datetime.now(datetime.timezone.utc).date()
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion addresses a potential compatibility issue with older Python versions by replacing datetime.UTC (introduced in Python 3.11) with datetime.timezone.utc. This is a valid concern that could prevent the code from running in environments using Python 3.10 or earlier.

    Medium
    Improve test case robustness

    The test is checking for a specific error message, but it's not verifying that
    the error is raised for the correct reason. Consider adding a test case that
    checks the error is raised specifically because of the invalid date type, not
    for some other reason.

    tests/segmentation/test_rfm.py [139-145]

     def test_invalid_current_date_type(self, base_df):
         """Test that RFMSegmentation raises a TypeError when an invalid current_date is provided."""
         with pytest.raises(
             TypeError,
             match="current_date must be a string in 'YYYY-MM-DD' format, a datetime.date object, or None",
         ):
             RFMSegmentation(base_df, current_date=12345)
    +        
    +    # Verify valid date types work
    +    RFMSegmentation(base_df, current_date="2025-03-17")  # String format
    +    RFMSegmentation(base_df, current_date=datetime.date(2025, 3, 17))  # date object
    +    RFMSegmentation(base_df, current_date=None)  # None defaults to current date
    • Apply this suggestion
    Suggestion importance[1-10]: 5

    __

    Why: The suggestion enhances test coverage by adding verification for valid date types alongside the existing invalid type test. While the original test correctly checks for error raising, the additional tests provide more comprehensive validation of the date handling functionality.

    Low
    • Update

    Copy link

    codecov bot commented Mar 25, 2025

    Codecov Report

    Attention: Patch coverage is 93.00000% with 7 lines in your changes missing coverage. Please review.

    Files with missing lines Patch % Lines
    pyretailscience/segmentation/rfm.py 92.10% 0 Missing and 3 partials ⚠️
    pyretailscience/segmentation/base.py 80.00% 1 Missing and 1 partial ⚠️
    pyretailscience/segmentation/threshold.py 95.23% 0 Missing and 2 partials ⚠️
    Files with missing lines Coverage Δ
    pyretailscience/segmentation/hml.py 100.00% <100.00%> (ø)
    pyretailscience/segmentation/segstats.py 68.22% <100.00%> (ø)
    pyretailscience/segmentation/base.py 80.00% <80.00%> (ø)
    pyretailscience/segmentation/threshold.py 95.23% <95.23%> (ø)
    pyretailscience/segmentation/rfm.py 92.10% <92.10%> (ø)
    🚀 New features to boost your workflow:
    • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 3

    🧹 Nitpick comments (12)
    pyretailscience/segmentation/threshold.py (3)

    31-31: Change segment keys to numeric types for clarity
    The parameter segments is typed as dict[any, str]. If your threshold keys are numeric quantiles, using dict[float, str] can improve clarity and readability.


    50-52: Mismatch in docstring regarding null value checks
    The docstring states a ValueError is raised if columns contain null values, but the code only checks for missing columns. Consider adding explicit checks for null values to match the documented behavior.


    100-106: Potential repeated execution of Ibis query
    Each time the df property is accessed, table.execute() is called, re-running the query. For large datasets, consider caching results or separating data retrieval from downstream operations to optimize performance.

    pyretailscience/segmentation/base.py (2)

    8-10: Consider expanding the class docstring.

    The class docstring could provide more information about how the class should be used, particularly explaining the expected structure of self.df that will be referenced in the add_segment method.

    class BaseSegmentation:
    -    """A base class for customer segmentation."""
    +    """A base class for customer segmentation.
    +
    +    This class expects subclasses to initialize self.df with customer data,
    +    including a 'segment_name' column that maps customer IDs to their segments.
    +    """

    11-33:

    ❓ Verification inconclusive

    Add initialization validation and test coverage for error path.

    The add_segment method is well-documented, but there are a few improvements that could be made:

    1. There's no validation that self.df exists and has the required "segment_name" column.
    2. The error case (row count mismatch) is not covered by tests.
    def add_segment(self, df: pd.DataFrame) -> pd.DataFrame:
        """Adds the segment to the dataframe based on the customer_id column.
    
        Args:
            df (pd.DataFrame): The dataframe to add the segment to. The dataframe must have a customer_id column.
    
        Returns:
            pd.DataFrame: The dataframe with the segment added.
    
        Raises:
            ValueError: If the number of rows before and after the merge do not match.
    +       AttributeError: If self.df is not initialized or missing the segment_name column.
        """
    +    if not hasattr(self, 'df') or 'segment_name' not in self.df.columns:
    +        raise AttributeError("self.df must be initialized with a 'segment_name' column.")
    +        
    +    customer_id_col = get_option("column.customer_id")
    +    if customer_id_col not in df.columns:
    +        raise ValueError(f"DataFrame must contain the customer ID column: {customer_id_col}")
    +        
        rows_before = len(df)
        df = df.merge(
            self.df["segment_name"],
            how="left",
            left_on=get_option("column.customer_id"),
            right_index=True,
        )
        rows_after = len(df)
        if rows_before != rows_after:
            raise ValueError("The number of rows before and after the merge do not match. This should not happen.")
    
        return df

    To verify test coverage for the error path, let's check if there are tests for this class:


    🏁 Script executed:

    #!/bin/bash
    # Look for tests that cover the BaseSegmentation class
    rg "BaseSegmentation" -g "test_*.py" --files-with-matches

    Length of output: 57


    Initialization Validation Added – Please Add Tests for Error Paths

    The updated changes correctly include validation to ensure that both self.df and its "segment_name" column, as well as the customer ID column in df, are present before proceeding with the merge. However, our search for tests covering the error cases (such as when these validations fail or the row counts don’t match) didn’t yield any results.

    • Ensure that tests are added for the following scenarios:
      • When self.df is not initialized or is missing the "segment_name" column.
      • When the input DataFrame does not include the expected customer ID column.
      • When the row count mismatch triggers the ValueError during the merge.

    Please add the necessary tests to cover these error paths to maintain robust error handling in the code.

    🧰 Tools
    🪛 GitHub Check: codecov/patch

    [warning] 32-32: pyretailscience/segmentation/base.py#L32
    Added line #L32 was not covered by tests

    tests/segmentation/test_threshold.py (1)

    170-188: Additional threshold validation tests.

    These tests complement the previous validation tests by checking the case where there are too few thresholds for the number of segments.

    I notice a minor typo in the test name test_thresholds_too_too_few_thresholds (duplicated "too"), but this doesn't affect functionality.

    -    def test_thresholds_too_too_few_thresholds(self):
    +    def test_thresholds_too_few_thresholds(self):
    tests/segmentation/test_segstats.py (1)

    51-77: Consider adding edge case coverage with no transactions.

    While these tests correctly confirm transaction-level behavior, it could be valuable to add a scenario where one or more segments have zero rows (e.g., empty segment or no matching transactions) to ensure the code gracefully handles entirely missing segment data.

    pyretailscience/segmentation/hml.py (1)

    3-6: Fix minor grammatical issue in docstring.

    “commonly know as the 80/20 rule” should be “commonly known as the 80/20 rule.”

    -commonly know as the 80/20 rule
    +commonly known as the 80/20 rule
    pyretailscience/segmentation/segstats.py (2)

    137-140: Be mindful of memory usage for large DataFrames.

    Converting a large pd.DataFrame to an Ibis memtable can increase memory consumption. If the data is substantial, you might consider an alternative Ibis backend or a partitioned approach to avoid potential out-of-memory issues.


    250-251: Support for multi-column plots.

    Currently, the plot method raises an error if multiple segment columns exist. If multi-level plotting becomes a requirement, refactoring to handle multiple segment levels (e.g., stacked bar charts) may be beneficial.

    pyretailscience/segmentation/rfm.py (2)

    91-132: Consider making the bin size configurable.

    The current implementation of ibis.ntile(10) is hardcoded to 10 bins. Some users may want more or fewer bins for their RFM segmentation use cases. Making this a parameter could increase flexibility and reusability.

    Below is a possible change illustrating how to introduce a bins parameter:

    -def _compute_rfm(self, df: ibis.Table, current_date: datetime.date) -> ibis.Table:
    +def _compute_rfm(self, df: ibis.Table, current_date: datetime.date, bins: int = 10) -> ibis.Table:
         ...
         rfm_scores = customer_metrics.mutate(
    -        r_score=(ibis.ntile(10).over(window_recency)),
    -        f_score=(ibis.ntile(10).over(window_frequency)),
    -        m_score=(ibis.ntile(10).over(window_monetary)),
    +        r_score=(ibis.ntile(bins).over(window_recency)),
    +        f_score=(ibis.ntile(bins).over(window_frequency)),
    +        m_score=(ibis.ntile(bins).over(window_monetary)),
         )
         ...

    133-144: Minor doc alignment suggestion.

    The docstring for the df property mentions “segment names,” but the method actually returns the entire table with RFM metrics and indexes on customer_id. Consider clarifying that the DataFrame includes R, F, M, and segment columns. Similarly, you may wish to document the auxiliary fm_segment column or remove it if it’s not needed.

    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 91edbf4 and a0179ca.

    📒 Files selected for processing (16)
    • docs/analysis_modules.md (4 hunks)
    • docs/api/segmentation/hml.md (1 hunks)
    • docs/api/segmentation/rfm.md (1 hunks)
    • docs/api/segmentation/segstats.md (1 hunks)
    • docs/api/segmentation/threshold.md (1 hunks)
    • mkdocs.yml (1 hunks)
    • pyretailscience/segmentation/base.py (1 hunks)
    • pyretailscience/segmentation/hml.py (1 hunks)
    • pyretailscience/segmentation/rfm.py (1 hunks)
    • pyretailscience/segmentation/segstats.py (1 hunks)
    • pyretailscience/segmentation/threshold.py (1 hunks)
    • tests/analysis/test_segmentation.py (0 hunks)
    • tests/segmentation/test_hml.py (1 hunks)
    • tests/segmentation/test_rfm.py (1 hunks)
    • tests/segmentation/test_segstats.py (1 hunks)
    • tests/segmentation/test_threshold.py (1 hunks)
    💤 Files with no reviewable changes (1)
    • tests/analysis/test_segmentation.py
    🧰 Additional context used
    🧬 Code Definitions (6)
    tests/segmentation/test_rfm.py (1)
    pyretailscience/segmentation/rfm.py (3)
    • RFMSegmentation (34-143)
    • df (134-138)
    • ibis_table (141-143)
    tests/segmentation/test_hml.py (6)
    pyretailscience/segmentation/hml.py (1)
    • HMLSegmentation (16-49)
    tests/segmentation/test_segstats.py (1)
    • base_df (17-27)
    tests/segmentation/test_rfm.py (1)
    • base_df (18-33)
    pyretailscience/segmentation/rfm.py (1)
    • df (134-138)
    pyretailscience/segmentation/threshold.py (1)
    • df (101-105)
    pyretailscience/segmentation/segstats.py (1)
    • df (190-204)
    tests/segmentation/test_segstats.py (1)
    pyretailscience/segmentation/segstats.py (3)
    • SegTransactionStats (23-299)
    • df (190-204)
    • plot (206-299)
    pyretailscience/segmentation/hml.py (3)
    pyretailscience/segmentation/threshold.py (2)
    • ThresholdSegmentation (22-105)
    • df (101-105)
    pyretailscience/segmentation/rfm.py (1)
    • df (134-138)
    pyretailscience/segmentation/segstats.py (1)
    • df (190-204)
    pyretailscience/segmentation/threshold.py (3)
    pyretailscience/segmentation/base.py (1)
    • BaseSegmentation (8-34)
    pyretailscience/segmentation/rfm.py (1)
    • df (134-138)
    pyretailscience/segmentation/segstats.py (1)
    • df (190-204)
    pyretailscience/segmentation/rfm.py (2)
    pyretailscience/segmentation/threshold.py (1)
    • df (101-105)
    pyretailscience/segmentation/segstats.py (1)
    • df (190-204)
    🪛 GitHub Check: codecov/patch
    pyretailscience/segmentation/base.py

    [warning] 32-32: pyretailscience/segmentation/base.py#L32
    Added line #L32 was not covered by tests

    🔇 Additional comments (38)
    pyretailscience/segmentation/threshold.py (1)

    75-82: Clarify behavior for "include_with_light" zero-spend customers
    Currently there is no specific branch handling zero-spend customers when zero_value_customers="include_with_light". Verify if allowing them to remain in the main dataset and receive a normal percentile rank is the intended behavior for your business logic.

    docs/api/segmentation/hml.md (1)

    1-3: Documentation references updated successfully
    The file properly references the pyretailscience.segmentation.hml module, which aligns with the new segmentation structure.

    docs/api/segmentation/rfm.md (1)

    1-3: Documentation looks good.

    The documentation for the RFM Segmentation module follows the project's documentation pattern with a concise header and proper module reference.

    docs/api/segmentation/threshold.md (1)

    1-3: Documentation looks good.

    The documentation for the Threshold Segmentation module is consistent with other documentation patterns in the project.

    docs/api/segmentation/segstats.md (1)

    1-3: Documentation looks good.

    The documentation for the SegTransactionStats Segmentation module maintains consistency with other segmentation documentation patterns.

    pyretailscience/segmentation/base.py (1)

    1-7: Well-structured imports and module docstring.

    The module docstring clearly explains the purpose of this file, and the imports are organized appropriately.

    tests/segmentation/test_hml.py (7)

    1-24: LGTM: Well-structured test fixture and setup.

    The imports, class declaration, and fixture setup look clean and follow good testing practices.


    25-38: Comprehensive test for "exclude" option.

    This test effectively verifies that zero-spend customers are excluded from the result DataFrame when the zero_value_customers parameter is set to "exclude".


    39-50: Comprehensive test for "include_with_light" option.

    The test appropriately verifies that zero-spend customers are included in the "Light" segment when the zero_value_customers parameter is set to "include_with_light".


    51-62: Comprehensive test for "separate_segment" option.

    The test correctly verifies that zero-spend customers are assigned to a "Zero" segment when the zero_value_customers parameter is set to "separate_segment".


    63-68: Good validation test for missing columns.

    This test ensures the class properly raises a ValueError when required columns are missing, which is essential for early error detection.


    69-78: Input data integrity check.

    Good practice to verify that the original DataFrame is not modified during processing, ensuring the method is non-destructive.


    79-89: Effective test for alternate value column.

    This test demonstrates the flexibility of the class by showing it can work with different column names for the value parameter.

    tests/segmentation/test_threshold.py (6)

    1-37: LGTM: Well-structured basic segmentation test.

    The initial test case effectively verifies that customers are correctly segmented based on provided thresholds. The structure follows good testing practices.


    38-49: Error handling test for single customer scenario.

    Good test to verify that the class raises an appropriate error when there's only one customer in the DataFrame. This prevents segmentation issues with too few data points.


    50-78: Thorough testing of custom aggregation function.

    This test properly verifies that the aggregation function works correctly with a non-standard value column. The use of pandas testing utilities for DataFrame comparison is a good practice.


    79-115: Good validation of segment data merging.

    Comprehensive test to verify that segment information is correctly merged back into the original DataFrame using the add_segment method.


    116-136: Important test for duplicate customer IDs.

    This test verifies that the segmentation handles DataFrames with duplicate customer_id entries correctly, which is a common real-world scenario.


    137-169: Thorough error handling for threshold and segment validation.

    These tests ensure that the class properly validates the thresholds and segments parameters, raising appropriate errors when they don't meet requirements.

    tests/segmentation/test_rfm.py (8)

    1-50: LGTM: Well-structured test fixtures for RFM segmentation.

    The test fixtures provide comprehensive test data and expected results for RFM segmentation. The use of explicit expected values for RFM scores is good practice.


    51-64: Comprehensive RFM segmentation test with current date.

    This test correctly verifies that RFM scores and segments are calculated properly based on a specified current date.


    65-77: Good error handling test for missing columns.

    This test ensures the class properly validates required columns and raises an appropriate error when they're missing.


    78-92: Testing single customer scenario.

    Important edge case test to verify that RFM segmentation works correctly with a single customer DataFrame.


    93-114: Multiple transactions handling test.

    This test verifies that the class correctly aggregates multiple transactions per customer when calculating RFM scores.


    115-124: Verification of complete customer processing.

    Good test to ensure that all customers are processed and receive RFM segments.


    125-138: Clever use of freezegun for date testing.

    This test effectively uses freezegun to mock the current date, allowing for predictable testing of the automatic date handling feature.


    139-161: Type validation tests.

    Thorough tests to verify that the class properly validates input types and raises appropriate errors for invalid inputs.

    tests/segmentation/test_segstats.py (2)

    29-49: Comprehensive scenario validation.

    These tests effectively validate various aggregation outputs (revenue, transactions, and customers) for multi-segment data, including the “Total” row. Great work ensuring robust coverage of common scenarios.


    153-321: Impressive multiple-parameter coverage.

    These tests thoroughly exercise segment columns, plotting constraints, and extra aggregations. They validate correct error handling and proper data shape for advanced usage. No significant issues noted.

    pyretailscience/segmentation/hml.py (1)

    19-49: Enhance test coverage for HMLSegmentation.

    Although this class extends ThresholdSegmentation and appears logically sound, consider adding dedicated tests to verify that Light, Medium, Heavy, and Zero spend segments are computed correctly in line with the 50/30/20 splits.

    pyretailscience/segmentation/segstats.py (1)

    1-10: Well-structured documentation updates.

    The updated docstring clearly communicates the purpose and capabilities of SegTransactionStats. The emphasis on both Pandas DataFrames and Ibis Tables is concise and aligns with the underlying logic.

    pyretailscience/segmentation/rfm.py (4)

    1-24: Great descriptive module docstring.

    The in-depth explanation of RFM segmentation logic, scoring tier details, and business implications is well-written and thorough. This provides valuable clarity to users and maintainers.


    26-33: Verify Python version compatibility.

    The line datetime.datetime.now(datetime.UTC).date() (line 87) relies on datetime.UTC, which is only available in Python 3.11+. If this library must support older Python versions, consider switching to datetime.timezone.utc.

    Do you need to support Python versions earlier than 3.11? If so, please confirm and update accordingly.


    34-46: Well-structured class docstring.

    The class docstring neatly outlines the RFM approach and clearly communicates the scoring system. The explanation aligns well with the module docstring. No issues found here.


    48-90: Robust initialization checks.

    The constructor correctly enforces required columns and handles both pandas DataFrame and Ibis Table inputs. The early validation steps (lines 79-82) and type checks (lines 77, 89) help prevent runtime errors. Nicely done!

    docs/analysis_modules.md (3)

    684-684: Approve HMLSegmentation Import Update
    The import for HMLSegmentation has been correctly updated to use the new module path (pyretailscience.segmentation.hml). This aligns with the intended reorganization of the segmentation modules.


    727-727: Approve ThresholdSegmentation Import Update
    The import for ThresholdSegmentation is now correctly pointing to pyretailscience.segmentation.threshold, which reflects the updated module structure.


    822-822: Approve RFMSegmentation Import Update
    The import for RFMSegmentation has been updated to pyretailscience.segmentation.rfm, in line with the new module organization.

    Comment on lines +88 to +93
    for quantile, segment in zip(thresholds, segments, strict=True):
    case = case.when(df["ptile"] <= quantile, segment)

    case = case.end()

    df = df.mutate(segment_name=case).drop(["ptile"])
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    No fallback for percentile values above highest threshold
    This block only handles cases where the percentile is <= each threshold in turn. If the highest threshold is below 1.0, some customers may not be assigned to any segment. Consider adding a fallback or requiring 1.0 as the final threshold.

    @mayurkmmt mayurkmmt force-pushed the feature/split-segmentation branch from a0179ca to 1fc2b6e Compare March 25, 2025 13:39
    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 3

    🧹 Nitpick comments (1)
    tests/segmentation/test_threshold.py (1)

    170-170: Fix typo in test method name

    There's a duplicate "too" in the method name.

    -def test_thresholds_too_too_few_thresholds(self):
    +def test_thresholds_too_few_thresholds(self):
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between a0179ca and 1fc2b6e.

    📒 Files selected for processing (16)
    • docs/analysis_modules.md (4 hunks)
    • docs/api/segmentation/hml.md (1 hunks)
    • docs/api/segmentation/rfm.md (1 hunks)
    • docs/api/segmentation/segstats.md (1 hunks)
    • docs/api/segmentation/threshold.md (1 hunks)
    • mkdocs.yml (1 hunks)
    • pyretailscience/segmentation/base.py (1 hunks)
    • pyretailscience/segmentation/hml.py (1 hunks)
    • pyretailscience/segmentation/rfm.py (1 hunks)
    • pyretailscience/segmentation/segstats.py (1 hunks)
    • pyretailscience/segmentation/threshold.py (1 hunks)
    • tests/analysis/test_segmentation.py (0 hunks)
    • tests/segmentation/test_hml.py (1 hunks)
    • tests/segmentation/test_rfm.py (1 hunks)
    • tests/segmentation/test_segstats.py (1 hunks)
    • tests/segmentation/test_threshold.py (1 hunks)
    💤 Files with no reviewable changes (1)
    • tests/analysis/test_segmentation.py
    🚧 Files skipped from review as they are similar to previous changes (6)
    • docs/api/segmentation/hml.md
    • docs/api/segmentation/rfm.md
    • docs/api/segmentation/segstats.md
    • pyretailscience/segmentation/hml.py
    • docs/analysis_modules.md
    • docs/api/segmentation/threshold.md
    🧰 Additional context used
    🧬 Code Definitions (3)
    tests/segmentation/test_hml.py (6)
    pyretailscience/segmentation/hml.py (1)
    • HMLSegmentation (16-49)
    tests/segmentation/test_rfm.py (1)
    • base_df (18-33)
    tests/segmentation/test_segstats.py (1)
    • base_df (17-27)
    pyretailscience/segmentation/rfm.py (1)
    • df (134-138)
    pyretailscience/segmentation/threshold.py (1)
    • df (101-105)
    pyretailscience/segmentation/segstats.py (1)
    • df (190-204)
    tests/segmentation/test_segstats.py (1)
    pyretailscience/segmentation/segstats.py (3)
    • SegTransactionStats (23-299)
    • df (190-204)
    • plot (206-299)
    pyretailscience/segmentation/threshold.py (3)
    pyretailscience/segmentation/base.py (1)
    • BaseSegmentation (8-34)
    pyretailscience/segmentation/rfm.py (1)
    • df (134-138)
    pyretailscience/segmentation/segstats.py (1)
    • df (190-204)
    🪛 GitHub Check: codecov/patch
    pyretailscience/segmentation/base.py

    [warning] 32-32: pyretailscience/segmentation/base.py#L32
    Added line #L32 was not covered by tests

    🔇 Additional comments (32)
    mkdocs.yml (1)

    29-33:

    ✅ Verification successful

    Consider verifying the new doc file references.

    The newly introduced documentation paths for HML, RFM, SegTransactionStats, and Threshold Segmentation look consistent. However, please confirm that these files (hml.md, rfm.md, segstats.md, and threshold.md) exist and are properly recognized by MkDocs.

    Here is a script to check for these files:


    🏁 Script executed:

    #!/bin/bash
    # Verify the existence of new doc files under docs/api/segmentation
    fd hml.md docs/api/segmentation
    fd rfm.md docs/api/segmentation
    fd segstats.md docs/api/segmentation
    fd threshold.md docs/api/segmentation

    Length of output: 273


    Doc file references have been verified.
    The new documentation paths for HML, RFM, SegTransactionStats, and Threshold Segmentation have all been confirmed to exist under the docs/api/segmentation directory:

    • docs/api/segmentation/hml.md
    • docs/api/segmentation/rfm.md
    • docs/api/segmentation/segstats.md
    • docs/api/segmentation/threshold.md

    No further changes are required.

    pyretailscience/segmentation/threshold.py (1)

    88-91: No fallback for customers above the highest threshold.

    If the highest threshold is below 1.0, customers with a percentile rank above that threshold have no assigned segment. A fallback or enforced final threshold of 1.0 is recommended to handle these customers correctly.

    pyretailscience/segmentation/base.py (1)

    8-34: Add unit test coverage for line 32.

    Static analysis indicates that this error-handling line (raise ValueError) is untested. Adding a negative test scenario that triggers a row mismatch after merging will help verify this logic and improve overall coverage.

    Would you like me to provide a sample test case that forces a mismatch scenario for demonstration?

    🧰 Tools
    🪛 GitHub Check: codecov/patch

    [warning] 32-32: pyretailscience/segmentation/base.py#L32
    Added line #L32 was not covered by tests

    tests/segmentation/test_hml.py (5)

    1-89: Well-structured test suite for HMLSegmentation

    The test coverage for the HMLSegmentation class is comprehensive, effectively testing all three options for handling zero-value customers: "exclude", "include_with_light", and "separate_segment". The tests properly verify the segmentation logic and edge cases.


    25-37: Test case properly verifies zero customer exclusion

    Good test for ensuring zero-spend customers are completely excluded from the result DataFrame when the "exclude" option is used.


    64-67: Thorough validation of error handling

    The test properly verifies that a ValueError is raised when required columns are missing. This helps ensure the class fails fast with clear error messages rather than producing unexpected results.


    70-77: Important verification of non-mutation

    This test is critical for ensuring that the segmentation process doesn't modify the original DataFrame, which is essential for maintaining data integrity when the same DataFrame is used in multiple operations.


    79-88: Good test for alternate column functionality

    Testing with an alternate value column ensures the flexibility of the implementation to work with different column names, which is important for adapting to various data structures.

    tests/segmentation/test_threshold.py (3)

    1-188: Comprehensive test suite for ThresholdSegmentation

    The test suite covers the core functionality of ThresholdSegmentation, including proper segmentation, aggregation functions, error handling, and edge cases. The tests are well-structured and verify both expected behavior and proper error handling.


    38-48: Correctly tests single customer limitation

    This test verifies that attempting to use ThresholdSegmentation with a DataFrame containing only one customer raises a ValueError. This is an important limitation to test for, as segmentation generally requires multiple data points to be meaningful.


    50-77: Thorough test of custom aggregation functions

    This test effectively validates that the ThresholdSegmentation class correctly applies custom aggregation functions (in this case 'nunique') for segmenting on product_id. The test covers a complex case with multiple products per customer and verifies the expected segmentation outcome.

    tests/segmentation/test_rfm.py (4)

    1-161: Comprehensive test suite for RFMSegmentation

    The test suite thoroughly covers various aspects of RFM segmentation, including correct calculation of scores, handling of edge cases, date handling, and validation of input types. The use of pytest fixtures and freezegun for consistent testing is particularly effective.


    35-49: Well-structured fixture for expected values

    The expected_df fixture provides a clear reference point for RFM segmentation tests, making the assertions more readable and maintainable. This is a good practice for tests that need to compare complex DataFrame structures.


    125-137: Good use of freezegun for date testing

    Using freezegun to control the current date ensures consistent test results, regardless of when the tests are run. This is crucial for date-dependent functionality like RFM segmentation.


    154-160: Important test for ibis.Table compatibility

    This test verifies that the RFMSegmentation class properly handles ibis.Table objects, which is an important feature for users working with large datasets where Ibis can provide performance benefits.

    tests/segmentation/test_segstats.py (5)

    13-151: Well-structured tests for segment statistics calculation

    The TestCalcSegStats class thoroughly tests the segment statistics calculation logic, including revenue, transactions, and customers per segment. The tests cover various scenarios like single segments, zero net units, and excluding total rows.


    153-320: Comprehensive testing of SegTransactionStats functionality

    The TestSegTransactionStats class provides good coverage of the class's functionality, including error handling, multiple segment columns, plotting limitations, and custom aggregations. The tests for the extra_aggs parameter are particularly thorough.


    210-227: Important validation of plotting limitations

    This test verifies that attempting to plot with multiple segment columns raises an appropriate ValueError, which is essential for providing clear error messages to users who might attempt this unsupported operation.


    229-288: Thorough testing of extra aggregations feature

    The tests for the extra_aggs parameter cover both single and multiple custom aggregations, verifying that the results match expected values. This is crucial for ensuring that this powerful feature works correctly.


    290-320: Good error handling tests for invalid aggregations

    These tests verify that appropriate errors are raised for invalid columns and functions in extra_aggs, ensuring that users receive clear error messages rather than unexpected results.

    pyretailscience/segmentation/segstats.py (7)

    1-10: Clear and focused module documentation.

    The updated docstring effectively communicates the module's purpose, focusing solely on transaction statistics by segment. This aligns well with the modular restructuring objectives of the PR.


    19-19: Import statements have been appropriately streamlined.

    The import statements have been simplified to focus only on dependencies needed for SegTransactionStats, removing references to the segmentation classes that have been moved to separate modules.


    23-85: The SegTransactionStats class implementation is well-structured.

    The class initialization and validation logic are robust, with proper error handling for missing columns and invalid input formats. The class effectively supports both Pandas DataFrames and Ibis Tables.


    86-114: Column order management is well-implemented.

    The static method for determining column order based on data availability is clean and maintainable, with good documentation explaining its purpose.


    115-187: Segment statistics calculation is robust and well-implemented.

    The implementation handles various edge cases appropriately:

    • Converting DataFrame to Ibis table when needed
    • Validating input types
    • Handling optional quantity columns
    • Supporting custom aggregations
    • Including total metrics when requested
    • Computing derived metrics correctly

    189-204: The df property implementation is efficient and flexible.

    The property correctly handles column ordering, including both standard metrics and any custom aggregations specified during initialization.


    206-299: The plot method is comprehensive with good parameter validation.

    The plotting functionality provides flexible options for visualization while maintaining appropriate parameter validation. The method effectively handles orientation, sorting, axis labels, and formatting.

    pyretailscience/segmentation/rfm.py (5)

    1-24: Excellent comprehensive module documentation.

    The docstring provides a thorough explanation of RFM segmentation, including its purpose, benefits, and the scoring methodology. This helps users understand the module's functionality and how to interpret the results.


    26-32: Import statements are appropriately minimal.

    The imports include only the necessary modules and libraries needed for the RFM segmentation functionality.


    34-47: Clear class-level documentation for RFMSegmentation.

    The class docstring effectively explains how customers are scored on the three dimensions (recency, frequency, monetary) and how the RFM segment is calculated.


    93-132: RFM computation logic is well-implemented.

    The _compute_rfm method effectively:

    1. Calculates the base metrics (recency, frequency, monetary)
    2. Creates appropriate window functions for ranking
    3. Computes percentile scores using ntile
    4. Constructs the final RFM and FM segments

    Using secondary sorting by customer_id in window functions is a good practice to ensure deterministic results when there are ties in the primary metric.


    133-144: Accessor properties are appropriately implemented.

    The class provides convenient properties to access the segmentation results in both DataFrame and Ibis table formats. The DataFrame is lazily executed only when needed, which is efficient.

    Comment on lines 31 to 32
    segments: dict[any, str],
    value_col: str | None = None,
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Clarify the usage of the segments dictionary.

    The docstring states that “keys are segment IDs and values are segment names,” but the loop iterates over the dictionary keys as if they are the displayed segment names. This may cause confusion or unintended labeling.

    Consider either:

    • Changing the loop to use the dictionary values as segment labels, or
    • Updating the docstring to reflect that the dictionary keys will be used as the segment labels.

    Comment on lines +50 to +92
    def __init__(self, df: pd.DataFrame | ibis.Table, current_date: str | datetime.date | None = None) -> None:
    """Initializes the RFM segmentation process.

    Args:
    df (pd.DataFrame | ibis.Table): A DataFrame or Ibis table containing transaction data.
    Must include the following columns:
    - customer_id
    - transaction_date
    - unit_spend
    - transaction_id
    current_date (Optional[Union[str, datetime.date]]): The reference date for calculating recency.
    Can be a string (format: "YYYY-MM-DD"), a date object, or None (defaults to the current system date).

    Raises:
    ValueError: If the dataframe is missing required columns.
    TypeError: If the input data is not a pandas DataFrame or an Ibis Table.
    """
    cols = ColumnHelper()
    required_cols = [
    cols.customer_id,
    cols.transaction_date,
    cols.unit_spend,
    cols.transaction_id,
    ]
    if isinstance(df, pd.DataFrame):
    df = ibis.memtable(df)
    elif not isinstance(df, ibis.Table):
    raise TypeError("df must be either a pandas DataFrame or an Ibis Table")

    missing_cols = set(required_cols) - set(df.columns)
    if missing_cols:
    error_message = f"Missing required columns: {missing_cols}"
    raise ValueError(error_message)

    if isinstance(current_date, str):
    current_date = datetime.date.fromisoformat(current_date)
    elif current_date is None:
    current_date = datetime.datetime.now(datetime.UTC).date()
    elif not isinstance(current_date, datetime.date):
    raise TypeError("current_date must be a string in 'YYYY-MM-DD' format, a datetime.date object, or None")

    self.table = self._compute_rfm(df, current_date)

    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    💡 Verification agent

    🧩 Analysis chain

    Initialization method has thorough validation.

    The __init__ method includes comprehensive validation of input parameters and handles different input types appropriately.

    However, there's a potential Python version compatibility issue:


    🏁 Script executed:

    #!/bin/bash
    # Description: Check Python version requirement for datetime.UTC usage
    
    # Check if the repository specifies a minimum Python version
    grep -r "python_requires\|python_version" setup.py pyproject.toml requirements*.txt | head -5
    
    # Check if the code uses datetime.UTC elsewhere
    rg "datetime\.UTC"

    Length of output: 317


    Attention: Python Version Compatibility for datetime.UTC

    • The initialization method’s validations are thorough and correctly handle alternative input types.
    • However, the use of datetime.UTC in pyretailscience/segmentation/rfm.py (line 58) may lead to compatibility issues on Python versions earlier than 3.11.
    • The repository configuration (e.g., in pyproject.toml) does not enforce a minimum Python version, so consider either:
      • Updating the project’s Python version requirement to 3.11 or higher, or
      • Replacing datetime.UTC with the more broadly supported datetime.timezone.utc for compatibility with older Python versions.

    @mayurkmmt mayurkmmt force-pushed the feature/split-segmentation branch from 1fc2b6e to 835a8e1 Compare March 26, 2025 05:50
    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 1

    🧹 Nitpick comments (1)
    pyretailscience/segmentation/segstats.py (1)

    23-24: Consider expanding the class-level docstring.

    While the implementation and __init__ docstring are detailed, the class-level docstring is quite brief. Consider expanding it to match the comprehensiveness of the module-level docstring.

    -    """Calculates transaction statistics by segment."""
    +    """Calculates and visualizes transaction statistics by segment.
    +    
    +    This class computes transaction-based statistics grouped by segment columns,
    +    including total spend, unique customers, transactions per customer, and more.
    +    It supports both Pandas DataFrames and Ibis Tables as input data formats and
    +    provides visualization capabilities for segment-based statistics.
    +    """
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 1fc2b6e and 835a8e1.

    📒 Files selected for processing (16)
    • docs/analysis_modules.md (4 hunks)
    • docs/api/segmentation/hml.md (1 hunks)
    • docs/api/segmentation/rfm.md (1 hunks)
    • docs/api/segmentation/segstats.md (1 hunks)
    • docs/api/segmentation/threshold.md (1 hunks)
    • mkdocs.yml (1 hunks)
    • pyretailscience/segmentation/base.py (1 hunks)
    • pyretailscience/segmentation/hml.py (1 hunks)
    • pyretailscience/segmentation/rfm.py (1 hunks)
    • pyretailscience/segmentation/segstats.py (1 hunks)
    • pyretailscience/segmentation/threshold.py (1 hunks)
    • tests/analysis/test_segmentation.py (0 hunks)
    • tests/segmentation/test_hml.py (1 hunks)
    • tests/segmentation/test_rfm.py (1 hunks)
    • tests/segmentation/test_segstats.py (1 hunks)
    • tests/segmentation/test_threshold.py (1 hunks)
    💤 Files with no reviewable changes (1)
    • tests/analysis/test_segmentation.py
    🚧 Files skipped from review as they are similar to previous changes (10)
    • docs/api/segmentation/rfm.md
    • docs/api/segmentation/threshold.md
    • docs/api/segmentation/hml.md
    • docs/api/segmentation/segstats.md
    • tests/segmentation/test_threshold.py
    • tests/segmentation/test_hml.py
    • pyretailscience/segmentation/hml.py
    • tests/segmentation/test_rfm.py
    • docs/analysis_modules.md
    • mkdocs.yml
    🧰 Additional context used
    🧬 Code Definitions (2)
    tests/segmentation/test_segstats.py (1)
    pyretailscience/segmentation/segstats.py (3)
    • SegTransactionStats (23-299)
    • df (190-204)
    • plot (206-299)
    pyretailscience/segmentation/threshold.py (3)
    pyretailscience/segmentation/base.py (1)
    • BaseSegmentation (8-34)
    pyretailscience/segmentation/rfm.py (1)
    • df (134-138)
    pyretailscience/segmentation/segstats.py (1)
    • df (190-204)
    🪛 GitHub Check: codecov/patch
    pyretailscience/segmentation/base.py

    [warning] 32-32: pyretailscience/segmentation/base.py#L32
    Added line #L32 was not covered by tests

    🔇 Additional comments (12)
    pyretailscience/segmentation/segstats.py (2)

    1-10: Excellent docstring improvement!

    The updated module docstring clearly explains the purpose of the module and the functionality provided by the SegTransactionStats class. It effectively communicates the key capabilities, such as computing transaction-based statistics by segment and supporting both Pandas DataFrames and Ibis Tables.


    19-19: Import path update looks good.

    The updated import for ColumnHelper from pyretailscience.options is correctly implemented.

    tests/segmentation/test_segstats.py (5)

    1-321: Well-structured comprehensive test suite!

    The test file provides excellent coverage of the SegTransactionStats class functionality. The tests are well-organized into two classes that separately test the internal _calc_seg_stats method and the public-facing class functionality.


    13-151: Thorough testing of the calculation logic.

    The TestCalcSegStats class thoroughly tests the calculation functionality with various scenarios:

    • Transaction-item level calculations
    • Transaction level calculations
    • Single segment handling
    • Zero net units edge case
    • Total row exclusion option

    This comprehensive approach ensures the core calculation logic is robust.


    153-321: Effective validation of class behavior and error handling.

    The TestSegTransactionStats class effectively tests:

    • Input validation by checking errors on missing required columns
    • Multi-segment column functionality
    • Plot method restrictions
    • Extra aggregations functionality and error handling

    These tests provide confidence in the class's robustness when handling various inputs and edge cases.


    102-125: Excellent edge case handling for zero net units.

    The test for handling zero net units is particularly valuable as it ensures the code correctly handles division-by-zero scenarios, which could otherwise lead to runtime errors.


    229-289: Well-structured test for extra aggregations.

    The test_extra_aggs_functionality method effectively validates both single and multiple extra aggregations. The use of descriptive constant variables for expected values enhances readability and makes the test purpose clear.

    pyretailscience/segmentation/base.py (1)

    11-22: Consider verifying the existence of "segment_name"
    This method assumes self.df["segment_name"] always exists. If self.df is uninitialized or missing the "segment_name" column, the merge will fail. Ensure that self.df is properly populated with a "segment_name" column before calling add_segment.

    Would you like to verify the presence of "segment_name" in self.df before merging and add a safeguard for potential KeyError?

    pyretailscience/segmentation/rfm.py (2)

    87-87: Compatibility warning for Python versions < 3.11
    Using datetime.UTC can cause compatibility issues for those on older Python versions. To maintain broader support, consider using datetime.timezone.utc or specifying a minimum Python version of 3.11.


    112-126:

    ❓ Verification inconclusive

    Verify ranking logic for recency
    The docstring says “9 represents the best customer” for Recency, yet the code orders by ascending recency (lowest recency first). This might place the most recent purchasers in the lowest ntile value. Confirm the intended behavior and adjust if necessary (e.g., by reversing the ordering or reinterpreting the ntile scale).


    🏁 Script executed:

    #!/bin/bash
    # We can verify how ibis.ntile(10) assigns ranks with ascending vs descending order.
    # This script searches for usage of ntile in the codebase to confirm no contradictory logic.
    rg 'ibis\.ntile\(10\)'

    Length of output: 312


    Clarify Recency Ranking Intent

    After verifying the ntile usage in the code, it’s clear that the three metrics (recency, frequency, monetary) are partitioned using ascending sort orders. Given that a lower recency (i.e., fewer days since the last purchase) should intuitively indicate a better customer, this implementation would assign lower ntile values—contrary to the docstring which states “9 represents the best customer.” Please confirm whether the intended behavior is to have customers with the smallest recency values receive a higher ntile score (perhaps by reversing the ordering for recency or by adjusting the ntile scale) or if the docstring should be modified to reflect the current implementation.

    • File: pyretailscience/segmentation/rfm.py
    • Lines: 112-126
    • Action Needed: Verify and reconcile the mismatch between the documented best-customer score for recency and the current ascending order logic.
    pyretailscience/segmentation/threshold.py (2)

    35-47: Implement “include_with_light” zero-value option
    The code handles the "exclude" and "separate_segment" cases but not "include_with_light". This may lead to unexpected behavior if a user passes "include_with_light".


    88-90: No fallback segment for percentile values above the highest threshold
    If the highest threshold is below 1.0, customers above that threshold remain unassigned. Consider adding a final .else_() branch or requiring 1.0 as the last threshold.

    @@ -26,7 +26,12 @@ nav:
    - Haversine Distance: api/analysis/haversine.md
    - Product Association: api/analysis/product_association.md
    - Revenue Tree: api/analysis/revenue_tree.md
    - Segmentation: api/analysis/segmentation.md
    - Segmentation:
    - Base Segmentation: api/segmentation/base.md
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Does this file exist? I don't see it in the PR

    @mayurkmmt mayurkmmt force-pushed the feature/split-segmentation branch from 835a8e1 to 6a4058b Compare March 26, 2025 09:03
    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 1

    🧹 Nitpick comments (1)
    pyretailscience/segmentation/threshold.py (1)

    53-54: No fallback for percentile values above the highest threshold
    If the largest threshold is below 1.0, some customers may not be assigned to any segment. Consider ensuring 1.0 is the final threshold or adding a final catch-all.

    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 835a8e1 and 6a4058b.

    📒 Files selected for processing (17)
    • docs/analysis_modules.md (4 hunks)
    • docs/api/segmentation/base.md (1 hunks)
    • docs/api/segmentation/hml.md (1 hunks)
    • docs/api/segmentation/rfm.md (1 hunks)
    • docs/api/segmentation/segstats.md (1 hunks)
    • docs/api/segmentation/threshold.md (1 hunks)
    • mkdocs.yml (1 hunks)
    • pyretailscience/segmentation/base.py (1 hunks)
    • pyretailscience/segmentation/hml.py (1 hunks)
    • pyretailscience/segmentation/rfm.py (1 hunks)
    • pyretailscience/segmentation/segstats.py (1 hunks)
    • pyretailscience/segmentation/threshold.py (1 hunks)
    • tests/analysis/test_segmentation.py (0 hunks)
    • tests/segmentation/test_hml.py (1 hunks)
    • tests/segmentation/test_rfm.py (1 hunks)
    • tests/segmentation/test_segstats.py (1 hunks)
    • tests/segmentation/test_threshold.py (1 hunks)
    💤 Files with no reviewable changes (1)
    • tests/analysis/test_segmentation.py
    ✅ Files skipped from review due to trivial changes (1)
    • docs/api/segmentation/base.md
    🚧 Files skipped from review as they are similar to previous changes (10)
    • docs/api/segmentation/threshold.md
    • docs/api/segmentation/segstats.md
    • docs/api/segmentation/rfm.md
    • docs/api/segmentation/hml.md
    • tests/segmentation/test_threshold.py
    • tests/segmentation/test_hml.py
    • pyretailscience/segmentation/hml.py
    • mkdocs.yml
    • tests/segmentation/test_rfm.py
    • docs/analysis_modules.md
    🧰 Additional context used
    🧬 Code Definitions (3)
    tests/segmentation/test_segstats.py (1)
    pyretailscience/segmentation/segstats.py (3)
    • SegTransactionStats (23-299)
    • df (190-204)
    • plot (206-299)
    pyretailscience/segmentation/threshold.py (3)
    pyretailscience/segmentation/base.py (1)
    • BaseSegmentation (8-34)
    pyretailscience/segmentation/rfm.py (1)
    • df (134-138)
    pyretailscience/segmentation/segstats.py (1)
    • df (190-204)
    pyretailscience/segmentation/rfm.py (2)
    pyretailscience/segmentation/threshold.py (1)
    • df (101-105)
    pyretailscience/segmentation/segstats.py (1)
    • df (190-204)
    🪛 GitHub Check: codecov/patch
    pyretailscience/segmentation/base.py

    [warning] 32-32: pyretailscience/segmentation/base.py#L32
    Added line #L32 was not covered by tests

    🔇 Additional comments (22)
    pyretailscience/segmentation/base.py (3)

    8-9: Well-structured base class
    Your docstring and class structure provide a concise and user-friendly foundation for specialized segmentation classes.


    11-22: Merge-based logic verification
    The logic for merging segment names appears clear. However, ensure that the caller sets self.df with a valid segment_name column before invoking add_segment.

    Would you confirm that the codebase has tests ensuring a ValueError is raised if self.df lacks segment_name?


    32-32: Unit test coverage recommendation
    The line raising ValueError is not covered by tests.

    🧰 Tools
    🪛 GitHub Check: codecov/patch

    [warning] 32-32: pyretailscience/segmentation/base.py#L32
    Added line #L32 was not covered by tests

    pyretailscience/segmentation/rfm.py (3)

    1-24: Comprehensive documentation
    The docstring provides a thorough overview of RFM segmentation, covering methodology and benefits clearly.


    87-87: Python version compatibility concern
    datetime.UTC is only available in Python 3.11 and later. Consider using datetime.timezone.utc for broader compatibility.

    If you intend to support Python versions below 3.11, please confirm that this usage won't break in those environments.


    133-139: Lazy-loading the DataFrame
    Your property-based approach to executing the data and caching the result is efficient. Just be mindful of memory usage for large datasets.

    pyretailscience/segmentation/threshold.py (1)

    95-96: Ensure union schema alignment
    When uniting df and zero_df, confirm both tables share the same schema. A mismatch in columns or data types might cause errors in future changes.

    pyretailscience/segmentation/segstats.py (3)

    1-10: Improved module documentation

    The module now has a clear, comprehensive docstring that accurately describes its purpose and capabilities. The documentation clearly explains that this module focuses on transaction statistics by segment, supporting both Pandas DataFrames and Ibis Tables.


    19-19: Import path updated correctly

    The import path for ColumnHelper has been updated as part of the segmentation module restructuring.


    23-24: SegTransactionStats class is properly maintained

    This class has been correctly maintained as the sole focus of this module, aligning with the PR objective of splitting segmentation logic into modular files.

    tests/segmentation/test_segstats.py (12)

    1-11: Well-structured test module

    The test module has clear imports and appropriate organization. Using ColumnHelper for consistent column access is a good practice for maintainability.


    13-28: Good test fixture setup

    The TestCalcSegStats class with the base_df fixture provides a solid foundation for testing with sample data that includes all required columns. This approach promotes code reuse across multiple tests.


    29-50: Comprehensive basic functionality test

    This test thoroughly verifies the core calculation logic, ensuring that revenue, transactions, and customers are correctly calculated per segment. The expected values are well-defined and the verification uses proper assertion methods.


    51-77: Additional test for dataframes without quantity column

    This test appropriately verifies that the class works correctly when the input dataframe doesn't include the unit quantity column, which is an important edge case to cover.


    78-101: Single segment handling test

    Good test for verifying behavior when all rows belong to the same segment, an important edge case that could otherwise lead to unexpected behavior.


    102-125: Zero net units handling test

    This test verifies correct behavior when a segment has zero net units, properly checking that:

    1. The price per unit is set to NaN (as division by zero would be undefined)
    2. The units per transaction is calculated as zero
    3. Other metrics remain unaffected

    This ensures robust handling of edge cases in real-world data.


    126-151: Total row exclusion test

    Good verification of the calc_total parameter functionality, ensuring the total row is excluded when specified and that customer percentages are adjusted correctly.


    153-164: Error handling test for missing required columns

    This test correctly verifies that the class raises an appropriate error when required columns are missing, which is crucial for providing clear feedback to users.


    165-209: Multiple segment columns test

    This test thoroughly verifies the class's ability to handle multiple segment columns, which is an important feature for complex segmentation analysis. The test correctly:

    1. Verifies that both segment columns are present in the result
    2. Checks that only actual combinations (not all possible combinations) are included
    3. Validates the aggregation calculations across the segment combinations

    210-228: Plot limitations test

    Good test to validate that the plotting functionality correctly raises an error when multiple segment columns are provided, as this is a documented limitation of the implementation.


    229-289: Extra aggregations functionality test

    Comprehensive test for the extra_aggs parameter, verifying both single and multiple additional aggregations. The test uses well-defined expected values and thorough assertions to ensure the feature works correctly.


    290-320: Error handling tests for invalid extra aggregations

    These tests verify appropriate error handling for:

    1. Invalid column names in extra aggregations
    2. Invalid aggregation functions

    Both tests ensure users receive clear error messages when configuration issues occur, improving the developer experience.

    Comment on lines +35 to +47
    zero_value_customers: Literal["separate_segment", "exclude", "include_with_light"] = "separate_segment",
    ) -> None:
    """Segments customers based on user-defined thresholds and segments.

    Args:
    df (pd.DataFrame | ibis.Table): A dataframe with the transaction data. The dataframe must contain a customer_id column.
    thresholds (List[float]): The percentile thresholds for segmentation.
    segments (List[str]): A list of segment names for each threshold.
    value_col (str, optional): The column to use for the segmentation. Defaults to get_option("column.unit_spend").
    agg_func (str, optional): The aggregation function to use when grouping by customer_id. Defaults to "sum".
    zero_segment_name (str, optional): The name of the segment for customers with zero spend. Defaults to "Zero".
    zero_value_customers (Literal["separate_segment", "exclude", "include_with_light"], optional): How to handle
    customers with zero spend. Defaults to "separate_segment".
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Missing handling for "include_with_light"
    The code does not handle the include_with_light option for zero spend customers. This could lead to inconsistent outcomes or errors.

    @mayurkmmt mayurkmmt merged commit c4be7f1 into main Mar 26, 2025
    3 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    3 participants