feat: changed threshold seg to use ibis #89

mvanwyk · 2025-02-05T17:07:13Z

PR Type

Enhancement, Tests

Description

Refactored segmentation logic to use ibis for data processing.
Replaced segment_id with segment_name for simplicity.
Updated tests to align with new segmentation logic.
Added ibis-framework dependency to project configuration.

Changes walkthrough 📝

Relevant files

Tests

test_segmentation.py Updated tests for `ibis`-based segmentation logic tests/test_segmentation.py Replaced `segment_id` with `segment_name` in tests. Updated tests to use `ColumnHelper` for column references. Adjusted test logic to align with `ibis`-based segmentation. Removed redundant tests for `segment_id`.	+68/-138

Enhancement

segmentation.py Refactored segmentation logic with `ibis` integration pyretailscience/segmentation.py Refactored segmentation logic to use `ibis` for data processing. Removed `segment_id` and simplified segment handling. Added validation for unique thresholds and matching segments. Introduced `ibis`-specific data transformations for segmentation.	+52/-52

Dependencies

pyproject.toml Added `ibis-framework` dependency pyproject.toml Added `ibis-framework` dependency with `duckdb` extras.	+1/-0

Additional files

segmentation.ipynb	+116/-118

Need help?
Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
Check out the documentation for more information.

Summary by CodeRabbit

New Features
- Expanded analytical capabilities by integrating the ibis-framework for enhanced data analysis.
Refactor
- Streamlined segmentation processes by simplifying data merging and adopting clearer segment labels.
Tests
- Updated test cases to align with improved segmentation naming and enhanced column reference handling.

coderabbitai · 2025-02-05T17:07:24Z

Walkthrough

The changes add a new dependency (ibis-framework with the duckdb extra) to the project’s configuration file. Updates in the segmentation module simplify merge operations, adjust class constructors to work with different data types (including ibis.Table), and revise column naming from segment_id to segment_name. Additionally, tests have been updated to use a centralized ColumnHelper for column name references, and some redundant tests have been removed. Overall, the modifications streamline segmentation logic and improve type flexibility and error handling within the project.

Changes

File	Change Summary
`pyproject.toml`	Added dependency: `ibis-framework = {extras = ["duckdb"], version = "^9.5.0"}`.
`pyretailscience/segmentation.py`	Updated segmentation classes: In `BaseSegmentation`, merged only `segment_name`; in `ExistingSegmentation` adjusted required columns; in `ThresholdSegmentation`, expanded type support, added validations, and modified zero-value handling via ibis; in `HMLSegmentation`, segments changed from dict to list; in `SegTransactionStats`, renamed column from `segment_id` to `segment_name`.
`pyretailscience/options.py`, `tests/test_segmentation.py`	Introduced `ColumnHelper` to centralize column names; updated tests to use the helper instead of `get_option`; changed segment identifier from `segment_id` to `segment_name`; removed tests that were deemed unnecessary.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant TS as ThresholdSegmentation
    participant IbisOps as Ibis Operations
    Client->>TS: Submit df, thresholds, segments
    TS->>TS: Validate lengths & uniqueness of thresholds and segments
    TS->>IbisOps: Filter zero-value customers using ibis operations
    IbisOps-->>TS: Return filtered data
    TS->>TS: Construct case statement for segment assignment
    TS->>IbisOps: Combine segments via ibis.union
    TS-->>Client: Return segmented DataFrame

Possibly related PRs

Added pandas style option system and updated segmentations module to use it. #66: The changes in the main PR are related to the retrieved PR as both involve modifications to the segmentation.py file, specifically integrating dynamic column name handling through the get_option function, which directly affects the same classes and methods.
feat: segment stats calc now uses duckdb to improve performance #74: The changes in the main PR, which add the ibis-framework dependency, are related to the modifications in the retrieved PR that enhance the SegTransactionStats class to accept DuckDBPyRelation, as both involve integrating data processing capabilities with DuckDB for improved analytics.
feat: revenue tree #88: The changes in the main PR, which involve adding a new dependency for the ibis-framework, are related to the modifications in the retrieved PR that enhance the ThresholdSegmentation class to accept ibis.Table, indicating a direct connection in terms of utilizing the same library for data manipulation.

Suggested labels

enhancement, Tests, Review effort [1-5]: 4

Poem

I'm a happy rabbit, hopping with glee,
Coding through segmentation with simplicity.
I added new tricks with ibis in sight,
Column names now shine so bright.
With tests refined and logic so clear,
I bound my code with a joyful cheer! 🐇✨

Tip

🌐 Web search-backed reviews and chat

We have enabled web search-based reviews and chat for all users. This feature allows CodeRabbit to access the latest documentation and information on the web.
You can disable this feature by setting web_search: false in the knowledge_base settings.
Please share any feedback in the Discord discussion.

✨ Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

qodo-merge-pro · 2025-02-05T17:07:42Z

Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Error Handling The empty DataFrame check was removed from ThresholdSegmentation initialization. Should validate that input data is not empty to avoid potential issues. def __init__( self, df: pd.DataFrame \| ibis.Table, thresholds: list[float], segments: dict[any, str], value_col: str \| None = None, agg_func: str = "sum", zero_segment_name: str = "Zero", zero_value_customers: Literal["separate_segment", "exclude", "include_with_light"] = "separate_segment", ) -> None: """Segments customers based on user-defined thresholds and segments. Args: df (pd.DataFrame \| ibis.Table): A dataframe with the transaction data. The dataframe must contain a customer_id column. thresholds (List[float]): The percentile thresholds for segmentation. segments (Dict[str, str]): A dictionary where keys are segment IDs and values are segment names. value_col (str, optional): The column to use for the segmentation. Defaults to get_option("column.unit_spend"). agg_func (str, optional): The aggregation function to use when grouping by customer_id. Defaults to "sum". zero_segment_name (str, optional): The name of the segment for customers with zero spend. Defaults to "Zero". zero_value_customers (Literal["separate_segment", "exclude", "include_with_light"], optional): How to handle customers with zero spend. Defaults to "separate_segment". Raises: ValueError: If the dataframe is missing the columns "customer_id" or `value_col`, or these columns contain null values. """ if set(thresholds) != set(thresholds): raise ValueError("The thresholds must be unique.") if len(thresholds) != len(segments): raise ValueError("The number of thresholds must match the number of segments.") Data Validation The thresholds sorting and range validation (0 to 1) was removed. Should validate thresholds are properly ordered and within valid range. if set(thresholds) != set(thresholds): raise ValueError("The thresholds must be unique.") if len(thresholds) != len(segments): raise ValueError("The number of thresholds must match the number of segments.")

qodo-merge-pro · 2025-02-05T17:08:04Z

Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Validate thresholds order and range Add validation to ensure thresholds are sorted in ascending order and start from 0. Currently thresholds could be provided in any order which may lead to incorrect segmentation. pyretailscience/segmentation.py [98-99] if set(thresholds) != set(thresholds): raise ValueError("The thresholds must be unique.") +if not all(x <= y for x, y in zip(thresholds, thresholds[1:])): + raise ValueError("The thresholds must be in ascending order.") +if thresholds[0] != 0: + raise ValueError("The first threshold must be 0.") Apply this suggestion Suggestion importance[1-10]: 8 __ Why: Important validation to prevent incorrect segmentation results. Unordered or improperly ranged thresholds could lead to wrong customer segments being assigned.	Medium
	Validate metric calculation logic The spend_per_customer metric appears to be too high for the Heavy segment (>50k). Verify the calculation logic and consider adding validation to catch unreasonable values. docs/examples/segmentation.ipynb [570-576] " <th>Heavy</th>\n", " <td>4.518138e+07</td>\n", " <td>8572</td>\n", " <td>850</td>\n", " <td>80558</td>\n", -" <td>53154.567635</td>\n" +" <td>5315.456763</td>\n" # Corrected value (example) Apply this suggestion Suggestion importance[1-10]: 8 __ Why: The suggestion identifies a potentially critical data quality issue with the spend_per_customer metric, which shows an unusually high value that could indicate a calculation error affecting business insights.	Medium
	Validate non-empty input dataframe Add validation for empty dataframes to prevent processing empty data which could lead to errors in downstream operations. pyretailscience/segmentation.py [104-105] if isinstance(df, pd.DataFrame): + if df.empty: + raise ValueError("Input DataFrame is empty") df: ibis.Table = ibis.memtable(df) Apply this suggestion Suggestion importance[1-10]: 7 __ Why: Critical validation to prevent processing empty dataframes which could cause errors or invalid results in downstream segmentation logic.	Medium

Copilot

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

pyretailscience/segmentation.py:76

Changing the segments parameter from a dictionary to a list could cause issues if the rest of the codebase expects segments to be a dictionary. Please ensure that this change does not break any existing functionality.

segments: dict[any, str],

pyretailscience/segmentation.py:80

The zero_segment_id parameter was removed. Please ensure that this change does not break any existing functionality.

zero_segment_id: str = "Z",

codecov · 2025-02-05T17:10:14Z

Codecov Report

Attention: Patch coverage is 86.11111% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
pyretailscience/segmentation.py	86.11%	3 Missing and 2 partials ⚠️

Files with missing lines	Coverage Δ
pyretailscience/segmentation.py	`62.20% <86.11%> (+2.04%)`	⬆️

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

tests/test_segmentation.py (2)
6-9: Consider completing the migration to ColumnHelper.

The code imports both get_option and ColumnHelper, but get_option is still used in some places. For consistency, consider migrating all column references to use ColumnHelper.

32-42: Maintain consistency in column references.

Line 40 still uses get_option while surrounding code uses ColumnHelper. For consistency, consider updating this to use ColumnHelper.
-                f"customers_{get_option('column.suffix.percent')}": [0.6, 0.4, 1.0],
+                f"customers_{cols.suffix_percent}": [0.6, 0.4, 1.0],
pyretailscience/segmentation.py (1)
70-150: Well-structured ibis integration with room for optimization.

The changes effectively integrate ibis for data processing with proper validation. However, the threshold uniqueness check can be simplified.
-        if set(thresholds) != set(thresholds):
+        if len(set(thresholds)) != len(thresholds):
The current check set(thresholds) != set(thresholds) will always evaluate to False. The suggested change correctly identifies duplicate thresholds by comparing the length of the unique values to the original length.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f8346b and dcec1fb.

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

pyproject.toml (1 hunks)
pyretailscience/segmentation.py (5 hunks)
tests/test_segmentation.py (14 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Pre-Commit

🔇 Additional comments (5)

tests/test_segmentation.py (2)

20-24: LGTM!

The test data setup correctly uses ColumnHelper for column references and aligns with the segment_name changes.

111-283: LGTM!

The test cases thoroughly cover the ThresholdSegmentation functionality including:

Basic segmentation

Edge cases (single customer, zero values)

Error cases (invalid thresholds, segments)

The changes align well with the ibis integration.

pyretailscience/segmentation.py (2)

177-179: LGTM!

The thresholds and segments are correctly defined as lists and align with the industry standard definition (80/20 rule) mentioned in the docstring.

192-192: LGTM!

The default segment_col change to "segment_name" maintains consistency with the rest of the codebase.
pyproject.toml (1)

23-23: Verify ibis-framework version compatibility.

The addition of ibis-framework with duckdb extra is appropriate for the changes. Let's verify the version compatibility.
✅ Verification successful

🌐 Web query:
What is the latest version of ibis-framework and are there any known compatibility issues with version 9.5.0?
Length of output: 3764

Ibis-framework version compatibility verified

The web query confirms that version 9.5.0 is the latest release (September 11, 2024) and that it is appropriate for the recent changes. The only noted issue is a Windows installation problem via Poetry (requiring long path support), which is a known workaround rather than a compatibility blocker.

No modifications are needed in the pyproject.toml.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

tests/test_segmentation.py (1)

289-297: Consider adding more test cases for error handling.

The test for empty DataFrame with errors is good, but consider adding test cases for:

Invalid segment names

Duplicate segment names

Edge cases with zero-value customers
pyretailscience/segmentation.py (2)
70-151: Excellent implementation of ibis integration.

The refactoring to use ibis operations improves performance and maintainability. The use of window functions and case statements is particularly well done.

Consider adding caching for the computed DataFrame to improve performance for repeated access.

192-201: Consider adding type hints for return values.

Add return type hints to improve code clarity:
-    def __init__(self, data: pd.DataFrame | DuckDBPyRelation, segment_col: str = "segment_name") -> None:
+    def __init__(self, data: pd.DataFrame | DuckDBPyRelation, segment_col: str = "segment_name") -> None:

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dcec1fb and dfe7dea.

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

pyproject.toml (1 hunks)
pyretailscience/segmentation.py (5 hunks)
tests/test_segmentation.py (14 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pyproject.toml

🧰 Additional context used

🪛 GitHub Check: codecov/patch

pyretailscience/segmentation.py

[warning] 57-58: pyretailscience/segmentation.py#L57-L58
Added lines #L57 - L58 were not covered by tests

[warning] 64-64: pyretailscience/segmentation.py#L64
Added line #L64 was not covered by tests

🔇 Additional comments (5)

tests/test_segmentation.py (2)

6-10: LGTM! Good refactoring to use ColumnHelper.

The transition from using get_option to ColumnHelper for column references improves code readability and maintainability by centralizing column name definitions.

20-24: Verify test coverage for segment name changes.

The test case correctly validates the segment name calculations, but we should ensure all edge cases are covered.

Run the following script to check test coverage for segment name handling:

Also applies to: 32-37
✅ Verification successful

🏁 Script executed:
#!/bin/bash
# Description: Check test coverage for segment name handling

# Search for test cases involving segment name
rg -A 5 'segment_name.*=.*' tests/
Length of output: 4220

Test coverage for segment name edge cases is comprehensive.
The test cases in tests/test_segmentation.py validate multiple scenarios including:

Standard segmentation with labels like "A", "Total", "High", and "Low".

Single-customer segmentation, ensuring labels like "Heavy" and "Light" are correctly applied.

Handling of zero spend customers with both "include_with_light" and "separate_segment" settings, covering outcomes such as "Heavy", "Light", "Medium", and "Zero".

Error handling when required columns are missing.

Given these checks across different configurations, the edge cases for segment name changes appear to be well covered.
pyretailscience/segmentation.py (3)

5-14: LGTM! Good dependency management.

The addition of ibis framework and reorganization of imports improves code structure.

98-103: Good addition of validation checks.

The validation for unique thresholds and matching segments improves robustness.

57-64: Add test coverage for ExistingSegmentation.

Static analysis indicates these lines are not covered by tests.

Would you like me to generate test cases for the ExistingSegmentation class to improve coverage?

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 57-58: pyretailscience/segmentation.py#L57-L58
Added lines #L57 - L58 were not covered by tests

[warning] 64-64: pyretailscience/segmentation.py#L64
Added line #L64 was not covered by tests

mvanwyk requested a review from Copilot February 5, 2025 17:07

mvanwyk self-assigned this Feb 5, 2025

qodo-merge-pro bot added the Review effort [1-5]: 3 label Feb 5, 2025

Copilot AI reviewed Feb 5, 2025

View reviewed changes

coderabbitai bot reviewed Feb 5, 2025

View reviewed changes

feat: changed threshold seg to use ibis

dfe7dea

mvanwyk force-pushed the ibis_seg_agg branch from dcec1fb to dfe7dea Compare February 5, 2025 17:13

coderabbitai bot reviewed Feb 5, 2025

View reviewed changes

mvanwyk merged commit f4b4825 into main Feb 5, 2025
4 checks passed

This was referenced Feb 9, 2025

feat: convert column names to use the options class #91

Merged

refactor with ibis #95

Merged

This was referenced Mar 13, 2025

Make the threshold segmentation compatible with MS SQL Server #136

Merged

feat: add extra_aggs parameter to SegTransactionStats #138

Merged

RFM Segmentation #140

Merged

This was referenced Mar 20, 2025

Ibis upgrade #146

Merged

Split Segmentation #154

Merged

coderabbitai bot mentioned this pull request Mar 28, 2025

Segmentation #157

Merged

mvanwyk deleted the ibis_seg_agg branch April 23, 2025 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: changed threshold seg to use ibis #89

feat: changed threshold seg to use ibis #89

mvanwyk commented Feb 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 5, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

qodo-merge-pro bot commented Feb 5, 2025

qodo-merge-pro bot commented Feb 5, 2025

Copilot AI left a comment

codecov bot commented Feb 5, 2025 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

feat: changed threshold seg to use ibis #89

feat: changed threshold seg to use ibis #89

Conversation

mvanwyk commented Feb 5, 2025 • edited by coderabbitai bot Loading

PR Type

Description

Changes walkthrough 📝

Summary by CodeRabbit

coderabbitai bot commented Feb 5, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

qodo-merge-pro bot commented Feb 5, 2025

PR Reviewer Guide 🔍

qodo-merge-pro bot commented Feb 5, 2025

PR Code Suggestions ✨

Copilot AI left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 5, 2025 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

mvanwyk commented Feb 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 5, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Feb 5, 2025 •

edited

Loading