Skip to content

feat: changed threshold seg to use ibis #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 5, 2025
Merged

feat: changed threshold seg to use ibis #89

merged 1 commit into from
Feb 5, 2025

Conversation

mvanwyk
Copy link
Contributor

@mvanwyk mvanwyk commented Feb 5, 2025

PR Type

Enhancement, Tests


Description

  • Refactored segmentation logic to use ibis for data processing.

  • Replaced segment_id with segment_name for simplicity.

  • Updated tests to align with new segmentation logic.

  • Added ibis-framework dependency to project configuration.


Changes walkthrough 📝

Relevant files
Tests
test_segmentation.py
Updated tests for `ibis`-based segmentation logic               

tests/test_segmentation.py

  • Replaced segment_id with segment_name in tests.
  • Updated tests to use ColumnHelper for column references.
  • Adjusted test logic to align with ibis-based segmentation.
  • Removed redundant tests for segment_id.
  • +68/-138
    Enhancement
    segmentation.py
    Refactored segmentation logic with `ibis` integration       

    pyretailscience/segmentation.py

  • Refactored segmentation logic to use ibis for data processing.
  • Removed segment_id and simplified segment handling.
  • Added validation for unique thresholds and matching segments.
  • Introduced ibis-specific data transformations for segmentation.
  • +52/-52 
    Dependencies
    pyproject.toml
    Added `ibis-framework` dependency                                               

    pyproject.toml

    • Added ibis-framework dependency with duckdb extras.
    +1/-0     
    Additional files
    segmentation.ipynb +116/-118

    Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • Summary by CodeRabbit

    • New Features

      • Expanded analytical capabilities by integrating the ibis-framework for enhanced data analysis.
    • Refactor

      • Streamlined segmentation processes by simplifying data merging and adopting clearer segment labels.
    • Tests

      • Updated test cases to align with improved segmentation naming and enhanced column reference handling.

    @mvanwyk mvanwyk requested a review from Copilot February 5, 2025 17:07
    @mvanwyk mvanwyk self-assigned this Feb 5, 2025
    Copy link

    coderabbitai bot commented Feb 5, 2025

    Walkthrough

    The changes add a new dependency (ibis-framework with the duckdb extra) to the project’s configuration file. Updates in the segmentation module simplify merge operations, adjust class constructors to work with different data types (including ibis.Table), and revise column naming from segment_id to segment_name. Additionally, tests have been updated to use a centralized ColumnHelper for column name references, and some redundant tests have been removed. Overall, the modifications streamline segmentation logic and improve type flexibility and error handling within the project.

    Changes

    File Change Summary
    pyproject.toml Added dependency: ibis-framework = {extras = ["duckdb"], version = "^9.5.0"}.
    pyretailscience/segmentation.py Updated segmentation classes: In BaseSegmentation, merged only segment_name; in ExistingSegmentation adjusted required columns; in ThresholdSegmentation, expanded type support, added validations, and modified zero-value handling via ibis; in HMLSegmentation, segments changed from dict to list; in SegTransactionStats, renamed column from segment_id to segment_name.
    pyretailscience/options.py,
    tests/test_segmentation.py
    Introduced ColumnHelper to centralize column names; updated tests to use the helper instead of get_option; changed segment identifier from segment_id to segment_name; removed tests that were deemed unnecessary.

    Sequence Diagram(s)

    sequenceDiagram
        participant Client
        participant TS as ThresholdSegmentation
        participant IbisOps as Ibis Operations
        Client->>TS: Submit df, thresholds, segments
        TS->>TS: Validate lengths & uniqueness of thresholds and segments
        TS->>IbisOps: Filter zero-value customers using ibis operations
        IbisOps-->>TS: Return filtered data
        TS->>TS: Construct case statement for segment assignment
        TS->>IbisOps: Combine segments via ibis.union
        TS-->>Client: Return segmented DataFrame
    
    Loading

    Possibly related PRs

    • Added pandas style option system and updated segmentations module to use it. #66: The changes in the main PR are related to the retrieved PR as both involve modifications to the segmentation.py file, specifically integrating dynamic column name handling through the get_option function, which directly affects the same classes and methods.
    • feat: segment stats calc now uses duckdb to improve performance #74: The changes in the main PR, which add the ibis-framework dependency, are related to the modifications in the retrieved PR that enhance the SegTransactionStats class to accept DuckDBPyRelation, as both involve integrating data processing capabilities with DuckDB for improved analytics.
    • feat: revenue tree #88: The changes in the main PR, which involve adding a new dependency for the ibis-framework, are related to the modifications in the retrieved PR that enhance the ThresholdSegmentation class to accept ibis.Table, indicating a direct connection in terms of utilizing the same library for data manipulation.

    Suggested labels

    enhancement, Tests, Review effort [1-5]: 4

    Poem

    I'm a happy rabbit, hopping with glee,
    Coding through segmentation with simplicity.
    I added new tricks with ibis in sight,
    Column names now shine so bright.
    With tests refined and logic so clear,
    I bound my code with a joyful cheer! 🐇✨

    Tip

    🌐 Web search-backed reviews and chat
    • We have enabled web search-based reviews and chat for all users. This feature allows CodeRabbit to access the latest documentation and information on the web.
    • You can disable this feature by setting web_search: false in the knowledge_base settings.
    • Please share any feedback in the Discord discussion.
    ✨ Finishing Touches
    • 📝 Generate Docstrings (Beta)

    Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

    ❤️ Share
    🪧 Tips

    Chat

    There are 3 ways to chat with CodeRabbit:

    • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
      • I pushed a fix in commit <commit_id>, please review it.
      • Generate unit testing code for this file.
      • Open a follow-up GitHub issue for this discussion.
    • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
      • @coderabbitai generate unit testing code for this file.
      • @coderabbitai modularize this function.
    • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
      • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
      • @coderabbitai read src/utils.ts and generate unit testing code.
      • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
      • @coderabbitai help me debug CodeRabbit configuration file.

    Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

    CodeRabbit Commands (Invoked using PR comments)

    • @coderabbitai pause to pause the reviews on a PR.
    • @coderabbitai resume to resume the paused reviews.
    • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
    • @coderabbitai full review to do a full review from scratch and review all the files again.
    • @coderabbitai summary to regenerate the summary of the PR.
    • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
    • @coderabbitai resolve resolve all the CodeRabbit review comments.
    • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
    • @coderabbitai help to get help.

    Other keywords and placeholders

    • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
    • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
    • Add @coderabbitai anywhere in the PR title to generate the title automatically.

    CodeRabbit Configuration File (.coderabbit.yaml)

    • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
    • Please see the configuration documentation for more information.
    • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

    Documentation and Community

    • Visit our Documentation for detailed information on how to use CodeRabbit.
    • Join our Discord Community to get help, request features, and share feedback.
    • Follow us on X/Twitter for updates and announcements.

    Copy link
    Contributor

    qodo-merge-pro bot commented Feb 5, 2025

    Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Error Handling

    The empty DataFrame check was removed from ThresholdSegmentation initialization. Should validate that input data is not empty to avoid potential issues.

    def __init__(
        self,
        df: pd.DataFrame | ibis.Table,
        thresholds: list[float],
        segments: dict[any, str],
        value_col: str | None = None,
        agg_func: str = "sum",
        zero_segment_name: str = "Zero",
        zero_value_customers: Literal["separate_segment", "exclude", "include_with_light"] = "separate_segment",
    ) -> None:
        """Segments customers based on user-defined thresholds and segments.
    
        Args:
            df (pd.DataFrame | ibis.Table): A dataframe with the transaction data. The dataframe must contain a customer_id column.
            thresholds (List[float]): The percentile thresholds for segmentation.
            segments (Dict[str, str]): A dictionary where keys are segment IDs and values are segment names.
            value_col (str, optional): The column to use for the segmentation. Defaults to get_option("column.unit_spend").
            agg_func (str, optional): The aggregation function to use when grouping by customer_id. Defaults to "sum".
            zero_segment_name (str, optional): The name of the segment for customers with zero spend. Defaults to "Zero".
            zero_value_customers (Literal["separate_segment", "exclude", "include_with_light"], optional): How to handle
                customers with zero spend. Defaults to "separate_segment".
    
        Raises:
            ValueError: If the dataframe is missing the columns "customer_id" or `value_col`, or these columns contain
                null values.
        """
        if set(thresholds) != set(thresholds):
            raise ValueError("The thresholds must be unique.")
    
        if len(thresholds) != len(segments):
            raise ValueError("The number of thresholds must match the number of segments.")
    Data Validation

    The thresholds sorting and range validation (0 to 1) was removed. Should validate thresholds are properly ordered and within valid range.

    if set(thresholds) != set(thresholds):
        raise ValueError("The thresholds must be unique.")
    
    if len(thresholds) != len(segments):
        raise ValueError("The number of thresholds must match the number of segments.")

    Copy link
    Contributor

    qodo-merge-pro bot commented Feb 5, 2025

    Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Validate thresholds order and range

    Add validation to ensure thresholds are sorted in ascending order and start from
    0. Currently thresholds could be provided in any order which may lead to
    incorrect segmentation.

    pyretailscience/segmentation.py [98-99]

     if set(thresholds) != set(thresholds):
         raise ValueError("The thresholds must be unique.")
    +if not all(x <= y for x, y in zip(thresholds, thresholds[1:])):
    +    raise ValueError("The thresholds must be in ascending order.")
    +if thresholds[0] != 0:
    +    raise ValueError("The first threshold must be 0.")
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    __

    Why: Important validation to prevent incorrect segmentation results. Unordered or improperly ranged thresholds could lead to wrong customer segments being assigned.

    Medium
    Validate metric calculation logic

    The spend_per_customer metric appears to be too high for the Heavy segment
    (>50k). Verify the calculation logic and consider adding validation to catch
    unreasonable values.

    docs/examples/segmentation.ipynb [570-576]

     "      <th>Heavy</th>\n",
     "      <td>4.518138e+07</td>\n",
     "      <td>8572</td>\n",
     "      <td>850</td>\n",
     "      <td>80558</td>\n",
    -"      <td>53154.567635</td>\n"
    +"      <td>5315.456763</td>\n"  # Corrected value (example)
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion identifies a potentially critical data quality issue with the spend_per_customer metric, which shows an unusually high value that could indicate a calculation error affecting business insights.

    Medium
    Validate non-empty input dataframe

    Add validation for empty dataframes to prevent processing empty data which could
    lead to errors in downstream operations.

    pyretailscience/segmentation.py [104-105]

     if isinstance(df, pd.DataFrame):
    +    if df.empty:
    +        raise ValueError("Input DataFrame is empty")
         df: ibis.Table = ibis.memtable(df)
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    __

    Why: Critical validation to prevent processing empty dataframes which could cause errors or invalid results in downstream segmentation logic.

    Medium

    Copy link
    Contributor

    @Copilot Copilot AI left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

    Comments suppressed due to low confidence (2)

    pyretailscience/segmentation.py:76

    • Changing the segments parameter from a dictionary to a list could cause issues if the rest of the codebase expects segments to be a dictionary. Please ensure that this change does not break any existing functionality.
    segments: dict[any, str],
    

    pyretailscience/segmentation.py:80

    • The zero_segment_id parameter was removed. Please ensure that this change does not break any existing functionality.
    zero_segment_id: str = "Z",
    

    Copy link

    codecov bot commented Feb 5, 2025

    Codecov Report

    Attention: Patch coverage is 86.11111% with 5 lines in your changes missing coverage. Please review.

    Files with missing lines Patch % Lines
    pyretailscience/segmentation.py 86.11% 3 Missing and 2 partials ⚠️
    Files with missing lines Coverage Δ
    pyretailscience/segmentation.py 62.20% <86.11%> (+2.04%) ⬆️

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    🧹 Nitpick comments (3)
    tests/test_segmentation.py (2)

    6-9: Consider completing the migration to ColumnHelper.

    The code imports both get_option and ColumnHelper, but get_option is still used in some places. For consistency, consider migrating all column references to use ColumnHelper.


    32-42: Maintain consistency in column references.

    Line 40 still uses get_option while surrounding code uses ColumnHelper. For consistency, consider updating this to use ColumnHelper.

    -                f"customers_{get_option('column.suffix.percent')}": [0.6, 0.4, 1.0],
    +                f"customers_{cols.suffix_percent}": [0.6, 0.4, 1.0],
    pyretailscience/segmentation.py (1)

    70-150: Well-structured ibis integration with room for optimization.

    The changes effectively integrate ibis for data processing with proper validation. However, the threshold uniqueness check can be simplified.

    -        if set(thresholds) != set(thresholds):
    +        if len(set(thresholds)) != len(thresholds):

    The current check set(thresholds) != set(thresholds) will always evaluate to False. The suggested change correctly identifies duplicate thresholds by comparing the length of the unique values to the original length.

    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 4f8346b and dcec1fb.

    ⛔ Files ignored due to path filters (1)
    • poetry.lock is excluded by !**/*.lock
    📒 Files selected for processing (3)
    • pyproject.toml (1 hunks)
    • pyretailscience/segmentation.py (5 hunks)
    • tests/test_segmentation.py (14 hunks)
    ⏰ Context from checks skipped due to timeout of 90000ms (1)
    • GitHub Check: Pre-Commit
    🔇 Additional comments (5)
    tests/test_segmentation.py (2)

    20-24: LGTM!

    The test data setup correctly uses ColumnHelper for column references and aligns with the segment_name changes.


    111-283: LGTM!

    The test cases thoroughly cover the ThresholdSegmentation functionality including:

    • Basic segmentation
    • Edge cases (single customer, zero values)
    • Error cases (invalid thresholds, segments)
    • The changes align well with the ibis integration.
    pyretailscience/segmentation.py (2)

    177-179: LGTM!

    The thresholds and segments are correctly defined as lists and align with the industry standard definition (80/20 rule) mentioned in the docstring.


    192-192: LGTM!

    The default segment_col change to "segment_name" maintains consistency with the rest of the codebase.

    pyproject.toml (1)

    23-23: Verify ibis-framework version compatibility.

    The addition of ibis-framework with duckdb extra is appropriate for the changes. Let's verify the version compatibility.

    ✅ Verification successful

    🌐 Web query:

    What is the latest version of ibis-framework and are there any known compatibility issues with version 9.5.0?
    

    Length of output: 3764


    Ibis-framework version compatibility verified

    The web query confirms that version 9.5.0 is the latest release (September 11, 2024) and that it is appropriate for the recent changes. The only noted issue is a Windows installation problem via Poetry (requiring long path support), which is a known workaround rather than a compatibility blocker.

    No modifications are needed in the pyproject.toml.

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    🧹 Nitpick comments (3)
    tests/test_segmentation.py (1)

    289-297: Consider adding more test cases for error handling.

    The test for empty DataFrame with errors is good, but consider adding test cases for:

    • Invalid segment names
    • Duplicate segment names
    • Edge cases with zero-value customers
    pyretailscience/segmentation.py (2)

    70-151: Excellent implementation of ibis integration.

    The refactoring to use ibis operations improves performance and maintainability. The use of window functions and case statements is particularly well done.

    Consider adding caching for the computed DataFrame to improve performance for repeated access.


    192-201: Consider adding type hints for return values.

    Add return type hints to improve code clarity:

    -    def __init__(self, data: pd.DataFrame | DuckDBPyRelation, segment_col: str = "segment_name") -> None:
    +    def __init__(self, data: pd.DataFrame | DuckDBPyRelation, segment_col: str = "segment_name") -> None:
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between dcec1fb and dfe7dea.

    ⛔ Files ignored due to path filters (1)
    • poetry.lock is excluded by !**/*.lock
    📒 Files selected for processing (3)
    • pyproject.toml (1 hunks)
    • pyretailscience/segmentation.py (5 hunks)
    • tests/test_segmentation.py (14 hunks)
    🚧 Files skipped from review as they are similar to previous changes (1)
    • pyproject.toml
    🧰 Additional context used
    🪛 GitHub Check: codecov/patch
    pyretailscience/segmentation.py

    [warning] 57-58: pyretailscience/segmentation.py#L57-L58
    Added lines #L57 - L58 were not covered by tests


    [warning] 64-64: pyretailscience/segmentation.py#L64
    Added line #L64 was not covered by tests

    🔇 Additional comments (5)
    tests/test_segmentation.py (2)

    6-10: LGTM! Good refactoring to use ColumnHelper.

    The transition from using get_option to ColumnHelper for column references improves code readability and maintainability by centralizing column name definitions.


    20-24: Verify test coverage for segment name changes.

    The test case correctly validates the segment name calculations, but we should ensure all edge cases are covered.

    Run the following script to check test coverage for segment name handling:

    Also applies to: 32-37

    ✅ Verification successful

    🏁 Script executed:

    #!/bin/bash
    # Description: Check test coverage for segment name handling
    
    # Search for test cases involving segment name
    rg -A 5 'segment_name.*=.*' tests/

    Length of output: 4220


    Test coverage for segment name edge cases is comprehensive.
    The test cases in tests/test_segmentation.py validate multiple scenarios including:

    • Standard segmentation with labels like "A", "Total", "High", and "Low".
    • Single-customer segmentation, ensuring labels like "Heavy" and "Light" are correctly applied.
    • Handling of zero spend customers with both "include_with_light" and "separate_segment" settings, covering outcomes such as "Heavy", "Light", "Medium", and "Zero".
    • Error handling when required columns are missing.

    Given these checks across different configurations, the edge cases for segment name changes appear to be well covered.

    pyretailscience/segmentation.py (3)

    5-14: LGTM! Good dependency management.

    The addition of ibis framework and reorganization of imports improves code structure.


    98-103: Good addition of validation checks.

    The validation for unique thresholds and matching segments improves robustness.


    57-64: Add test coverage for ExistingSegmentation.

    Static analysis indicates these lines are not covered by tests.

    Would you like me to generate test cases for the ExistingSegmentation class to improve coverage?

    🧰 Tools
    🪛 GitHub Check: codecov/patch

    [warning] 57-58: pyretailscience/segmentation.py#L57-L58
    Added lines #L57 - L58 were not covered by tests


    [warning] 64-64: pyretailscience/segmentation.py#L64
    Added line #L64 was not covered by tests

    @mvanwyk mvanwyk merged commit f4b4825 into main Feb 5, 2025
    4 checks passed
    This was referenced Mar 20, 2025
    @coderabbitai coderabbitai bot mentioned this pull request Mar 28, 2025
    @mvanwyk mvanwyk deleted the ibis_seg_agg branch April 23, 2025 12:57
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant