Split Segmentation #154

mayurkmmt · 2025-03-25T13:29:42Z

User description

feat: split segmentations.py file

PR Type

Enhancement, Documentation, Tests

Description

Refactored segmentation logic into modular files for better organization.
- Created base.py for base segmentation functionality.
- Added hml.py for HML segmentation logic.
- Added rfm.py for RFM segmentation logic.
- Added threshold.py for threshold-based segmentation.
- Added segstats.py for transaction statistics by segment.
Updated documentation to reflect new segmentation module structure.
- Added API documentation for HMLSegmentation, RFMSegmentation, ThresholdSegmentation, and SegTransactionStats.
- Updated examples in analysis_modules.md to use new imports.
Added comprehensive unit tests for all segmentation modules.
- Created new test files for HMLSegmentation, RFMSegmentation, ThresholdSegmentation, and SegTransactionStats.
Updated mkdocs.yml to include new segmentation API documentation.

Changes walkthrough 📝

Relevant files

Enhancement

5 files

base.py `Added base class for segmentation logic`	+34/-0
hml.py `Added HML segmentation logic`	+49/-0
rfm.py `Added RFM segmentation logic`	+143/-0
threshold.py `Added threshold-based segmentation logic`	+105/-0
segstats.py `Added transaction statistics by segment logic`	+11/-288

Tests

4 files

test_hml.py `Added unit tests for HML segmentation`	+88/-0
test_rfm.py `Added unit tests for RFM segmentation`	+160/-0
test_threshold.py `Added unit tests for threshold segmentation`	+187/-0
test_segstats.py `Added unit tests for transaction statistics by segment`	+320/-0

Documentation

6 files

analysis_modules.md `Updated examples to reflect new segmentation module structure`	+5/-4
hml.md `Added API documentation for HML segmentation`	+3/-0
rfm.md `Added API documentation for RFM segmentation`	+3/-0
segstats.md `Added API documentation for SegTransactionStats`	+3/-0
threshold.md `Added API documentation for threshold segmentation`	+3/-0
mkdocs.yml `Updated navigation to include new segmentation API documentation`	+5/-1

Additional files

2 files

__init__.py	[link]
test_segmentation.py	+0/-733

Need help?
Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
Check out the documentation for more information.

Summary by CodeRabbit

New Features
- Introduced enhanced customer segmentation capabilities with flexible approaches based on spending behavior and recency-frequency-monetary analysis.
- Added new segmentation classes for HML, RFM, and Threshold analysis.
- Introduced a new BaseSegmentation class for foundational segmentation functionality.
Documentation
- Expanded segmentation guides with dedicated sections, offering improved navigation and clearer insights.
- Created new documentation files for HML, RFM, SegTransactionStats, and Threshold segmentation.
- Added a new section for Base Segmentation.
Refactor
- Streamlined the segmentation module structure and updated integration points to support a more maintainable and scalable API.
Tests
- Implemented comprehensive test suites for HML, RFM, Threshold, and SegTransactionStats classes to validate segmentation logic and ensure robust error handling across scenarios.
- Added new test classes for calculating segment statistics and validating segmentation outputs.

coderabbitai · 2025-03-25T13:29:53Z

Walkthrough

The changes reorganize the segmentation modules in the package. Import paths for segmentation classes (HML, Threshold, SegTransactionStats, RFMSegmentation) are updated, and several new modules are introduced. Documentation is enhanced with new API files and navigation restructuring in the mkdocs configuration. Additionally, new test suites replace the deprecated segmentation tests, and unused segmentation classes are removed from the segmentation statistics module.

Changes

File(s)	Change Summary
`docs/analysis_modules.md`	Updated import paths for segmentation classes (HML, Threshold, SegTransactionStats, RFMSegmentation).
`docs/api/segmentation/{hml.md, rfm.md, segstats.md, threshold.md}`	New documentation files for HML, RFM, SegTransactionStats, and Threshold segmentation.
`mkdocs.yml`	Restructured navigation by replacing a single Segmentation entry with a hierarchical structure linking to new segmentation docs.
`pyretailscience/segmentation/{base.py, hml.py, rfm.py, threshold.py}`	New segmentation modules added: BaseSegmentation, HMLSegmentation, RFMSegmentation, and ThresholdSegmentation.
`pyretailscience/segmentation/segstats.py`	Refactored to remove obsolete segmentation classes, retaining only SegTransactionStats with updated documentation.
`tests/analysis/test_segmentation.py`	Removed outdated segmentation tests.
`tests/segmentation/{test_hml.py, test_rfm.py, test_segstats.py, test_threshold.py}`	Added new test suites to validate the functionality of segmentation classes (HML, RFM, SegTransactionStats, Threshold).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ThresholdSegmentation
    participant DataFrame
    Client->>ThresholdSegmentation: __init__(df, thresholds, segments, ...)
    ThresholdSegmentation->>DataFrame: Validate required columns
    ThresholdSegmentation->>DataFrame: Aggregate data and calculate percentiles
    ThresholdSegmentation->>Client: Return segmented DataFrame

sequenceDiagram
    participant Client
    participant RFMSegmentation
    participant DataFrame
    Client->>RFMSegmentation: __init__(df, current_date)
    RFMSegmentation->>DataFrame: Validate input and compute RFM metrics
    RFMSegmentation->>RFMSegmentation: Calculate RFM scores and segments
    RFMSegmentation->>Client: Provide segmented results via df and ibis_table properties

Possibly related issues

Split segmentation.py into a module #151: The changes in the main issue directly relate to the restructuring of import paths for segmentation classes, which aligns with the proposed refactoring in the retrieved issue that aims to move these classes into a dedicated segmentation module.

Possibly related PRs

Added threshold segmentation analysis modules docs #111: The changes in the main PR involve reorganization of import paths for segmentation classes, while the retrieved PR focuses on enhancing documentation for the HML and Threshold Segmentation classes, including method definitions; thus, they are related at the code level.
docs: added hml seg example to analysis modules page #110: The changes in the main PR involve updates to the import paths for the HMLSegmentation class, while the retrieved PR adds an example usage of HMLSegmentation in the documentation, indicating a direct relationship at the code level.
Added ThresholdSegmentation class #58: The changes in the main PR involve updates to the import paths and method signatures of segmentation classes, while the retrieved PR introduces a new ThresholdSegmentation class and refactors the HMLSegmentation class, indicating a direct relationship in terms of class structure and functionality.

Suggested labels

enhancement, documentation, Tests, Review effort [1-5]: 3

Suggested reviewers

mvanwyk

Poem

I'm a clever rabbit with a coding flair,
Hoping through modules with meticulous care,
New paths for segmentation now brightly shine,
With docs and tests all perfectly in line,
A playful hop and a joyful cheer—code is fine!
🐰✨

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai plan to trigger planning for file edits and PR creation.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

qodo-merge-pro · 2025-03-25T13:30:13Z

Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Documentation Path Error The navigation path in mkdocs.yml references 'api/segmentation/html.md' but the actual file is named 'api/segmentation/hml.md'. This will cause a broken link in the documentation. - HML Segmentation: api/segmentation/html.md Import Error There's an incorrect import statement referencing 'pyretailscience.segmentation.html' instead of 'pyretailscience.segmentation.hml' which will cause runtime errors. from pyretailscience.segmentation.html import HMLSegmentation Test Logic Issue The test_thresholds_too_too_few_thresholds method has a redundant name and tests both too few and too many thresholds in the same test method, which violates the single responsibility principle for tests. def test_thresholds_too_too_few_thresholds(self): """Test that the method raises an error when there are too few/many thresholds for the number of segments.""" df = pd.DataFrame( { get_option("column.customer_id"): [1, 2, 3, 4, 5], cols.unit_spend: [100, 200, 300, 400, 500], }, ) thresholds = [0.4, 1] segments = ["Low", "Medium", "High"] with pytest.raises(ValueError): ThresholdSegmentation(df, thresholds, segments) thresholds = [0.2, 0.5, 0.6, 0.8, 1] with pytest.raises(ValueError): ThresholdSegmentation(df, thresholds, segments)

qodo-merge-pro · 2025-03-25T13:30:35Z

Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Fix import path typo There's a typo in the import statement. The module name should be 'hml' not 'html'. docs/analysis_modules.md [770] -from pyretailscience.segmentation.html import HMLSegmentation +from pyretailscience.segmentation.hml import HMLSegmentation Apply this suggestion Suggestion importance[1-10]: 10 __ Why: The import statement contains a critical typo using "html" instead of "hml", which would cause import errors when users try to run the code. This is a high-impact fix as it corrects code that would definitely fail.	High
Possible issue	Fix documentation file path The file path for HML Segmentation documentation contains a typo. It should be 'hml.md' instead of 'html.md' to match the actual module name. mkdocs.yml [30] -- HML Segmentation: api/segmentation/html.md +- HML Segmentation: api/segmentation/hml.md Apply this suggestion Suggestion importance[1-10]: 9 __ Why: The documentation path contains the same "html" vs "hml" typo, which would result in broken documentation links. This is a high-impact fix as it ensures users can access the correct documentation pages.	High
General	Use timezone.utc for compatibility The code is using `datetime.UTC` which was introduced in Python 3.11. For better compatibility with older Python versions, consider using `datetime.timezone.utc` instead. pyretailscience/segmentation/rfm.py [87] -current_date = datetime.datetime.now(datetime.UTC).date() +current_date = datetime.datetime.now(datetime.timezone.utc).date() Apply this suggestion Suggestion importance[1-10]: 8 __ Why: The suggestion addresses a potential compatibility issue with older Python versions by replacing datetime.UTC (introduced in Python 3.11) with datetime.timezone.utc. This is a valid concern that could prevent the code from running in environments using Python 3.10 or earlier.	Medium
General	Improve test case robustness The test is checking for a specific error message, but it's not verifying that the error is raised for the correct reason. Consider adding a test case that checks the error is raised specifically because of the invalid date type, not for some other reason. tests/segmentation/test_rfm.py [139-145] def test_invalid_current_date_type(self, base_df): """Test that RFMSegmentation raises a TypeError when an invalid current_date is provided.""" with pytest.raises( TypeError, match="current_date must be a string in 'YYYY-MM-DD' format, a datetime.date object, or None", ): RFMSegmentation(base_df, current_date=12345) + + # Verify valid date types work + RFMSegmentation(base_df, current_date="2025-03-17") # String format + RFMSegmentation(base_df, current_date=datetime.date(2025, 3, 17)) # date object + RFMSegmentation(base_df, current_date=None) # None defaults to current date Apply this suggestion Suggestion importance[1-10]: 5 __ Why: The suggestion enhances test coverage by adding verification for valid date types alongside the existing invalid type test. While the original test correctly checks for error raising, the additional tests provide more comprehensive validation of the date handling functionality.	Low
Update

codecov · 2025-03-25T13:31:33Z

Codecov Report

Attention: Patch coverage is 93.00000% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
pyretailscience/segmentation/rfm.py	92.10%	0 Missing and 3 partials ⚠️
pyretailscience/segmentation/base.py	80.00%	1 Missing and 1 partial ⚠️
pyretailscience/segmentation/threshold.py	95.23%	0 Missing and 2 partials ⚠️

Files with missing lines	Coverage Δ
pyretailscience/segmentation/hml.py	`100.00% <100.00%> (ø)`
pyretailscience/segmentation/segstats.py	`68.22% <100.00%> (ø)`
pyretailscience/segmentation/base.py	`80.00% <80.00%> (ø)`
pyretailscience/segmentation/threshold.py	`95.23% <95.23%> (ø)`
pyretailscience/segmentation/rfm.py	`92.10% <92.10%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (12)

pyretailscience/segmentation/threshold.py (3)

31-31: Change segment keys to numeric types for clarity
The parameter segments is typed as dict[any, str]. If your threshold keys are numeric quantiles, using dict[float, str] can improve clarity and readability.

50-52: Mismatch in docstring regarding null value checks
The docstring states a ValueError is raised if columns contain null values, but the code only checks for missing columns. Consider adding explicit checks for null values to match the documented behavior.

100-106: Potential repeated execution of Ibis query
Each time the df property is accessed, table.execute() is called, re-running the query. For large datasets, consider caching results or separating data retrieval from downstream operations to optimize performance.
pyretailscience/segmentation/base.py (2)
8-10: Consider expanding the class docstring.

The class docstring could provide more information about how the class should be used, particularly explaining the expected structure of self.df that will be referenced in the add_segment method.
class BaseSegmentation:
-    """A base class for customer segmentation."""
+    """A base class for customer segmentation.
+
+    This class expects subclasses to initialize self.df with customer data,
+    including a 'segment_name' column that maps customer IDs to their segments.
+    """
11-33:
❓ Verification inconclusive

Add initialization validation and test coverage for error path.

The add_segment method is well-documented, but there are a few improvements that could be made:

There's no validation that self.df exists and has the required "segment_name" column.

The error case (row count mismatch) is not covered by tests.
def add_segment(self, df: pd.DataFrame) -> pd.DataFrame:
    """Adds the segment to the dataframe based on the customer_id column.

    Args:
        df (pd.DataFrame): The dataframe to add the segment to. The dataframe must have a customer_id column.

    Returns:
        pd.DataFrame: The dataframe with the segment added.

    Raises:
        ValueError: If the number of rows before and after the merge do not match.
+       AttributeError: If self.df is not initialized or missing the segment_name column.
    """
+    if not hasattr(self, 'df') or 'segment_name' not in self.df.columns:
+        raise AttributeError("self.df must be initialized with a 'segment_name' column.")
+        
+    customer_id_col = get_option("column.customer_id")
+    if customer_id_col not in df.columns:
+        raise ValueError(f"DataFrame must contain the customer ID column: {customer_id_col}")
+        
    rows_before = len(df)
    df = df.merge(
        self.df["segment_name"],
        how="left",
        left_on=get_option("column.customer_id"),
        right_index=True,
    )
    rows_after = len(df)
    if rows_before != rows_after:
        raise ValueError("The number of rows before and after the merge do not match. This should not happen.")

    return df
To verify test coverage for the error path, let's check if there are tests for this class:

🏁 Script executed:
#!/bin/bash
# Look for tests that cover the BaseSegmentation class
rg "BaseSegmentation" -g "test_*.py" --files-with-matches
Length of output: 57
Initialization Validation Added – Please Add Tests for Error Paths

The updated changes correctly include validation to ensure that both self.df and its "segment_name" column, as well as the customer ID column in df, are present before proceeding with the merge. However, our search for tests covering the error cases (such as when these validations fail or the row counts don’t match) didn’t yield any results.

Ensure that tests are added for the following scenarios:

When self.df is not initialized or is missing the "segment_name" column.

When the input DataFrame does not include the expected customer ID column.

When the row count mismatch triggers the ValueError during the merge.

Please add the necessary tests to cover these error paths to maintain robust error handling in the code.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 32-32: pyretailscience/segmentation/base.py#L32
Added line #L32 was not covered by tests
tests/segmentation/test_threshold.py (1)
170-188: Additional threshold validation tests.

These tests complement the previous validation tests by checking the case where there are too few thresholds for the number of segments.

I notice a minor typo in the test name test_thresholds_too_too_few_thresholds (duplicated "too"), but this doesn't affect functionality.
-    def test_thresholds_too_too_few_thresholds(self):
+    def test_thresholds_too_few_thresholds(self):
tests/segmentation/test_segstats.py (1)

51-77: Consider adding edge case coverage with no transactions.

While these tests correctly confirm transaction-level behavior, it could be valuable to add a scenario where one or more segments have zero rows (e.g., empty segment or no matching transactions) to ensure the code gracefully handles entirely missing segment data.
pyretailscience/segmentation/hml.py (1)
3-6: Fix minor grammatical issue in docstring.

“commonly know as the 80/20 rule” should be “commonly known as the 80/20 rule.”
-commonly know as the 80/20 rule
+commonly known as the 80/20 rule
pyretailscience/segmentation/segstats.py (2)

137-140: Be mindful of memory usage for large DataFrames.

Converting a large pd.DataFrame to an Ibis memtable can increase memory consumption. If the data is substantial, you might consider an alternative Ibis backend or a partitioned approach to avoid potential out-of-memory issues.

250-251: Support for multi-column plots.

Currently, the plot method raises an error if multiple segment columns exist. If multi-level plotting becomes a requirement, refactoring to handle multiple segment levels (e.g., stacked bar charts) may be beneficial.
pyretailscience/segmentation/rfm.py (2)
91-132: Consider making the bin size configurable.

The current implementation of ibis.ntile(10) is hardcoded to 10 bins. Some users may want more or fewer bins for their RFM segmentation use cases. Making this a parameter could increase flexibility and reusability.

Below is a possible change illustrating how to introduce a bins parameter:
-def _compute_rfm(self, df: ibis.Table, current_date: datetime.date) -> ibis.Table:
+def _compute_rfm(self, df: ibis.Table, current_date: datetime.date, bins: int = 10) -> ibis.Table:
     ...
     rfm_scores = customer_metrics.mutate(
-        r_score=(ibis.ntile(10).over(window_recency)),
-        f_score=(ibis.ntile(10).over(window_frequency)),
-        m_score=(ibis.ntile(10).over(window_monetary)),
+        r_score=(ibis.ntile(bins).over(window_recency)),
+        f_score=(ibis.ntile(bins).over(window_frequency)),
+        m_score=(ibis.ntile(bins).over(window_monetary)),
     )
     ...
133-144: Minor doc alignment suggestion.

The docstring for the df property mentions “segment names,” but the method actually returns the entire table with RFM metrics and indexes on customer_id. Consider clarifying that the DataFrame includes R, F, M, and segment columns. Similarly, you may wish to document the auxiliary fm_segment column or remove it if it’s not needed.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 91edbf4 and a0179ca.

📒 Files selected for processing (16)

docs/analysis_modules.md (4 hunks)
docs/api/segmentation/hml.md (1 hunks)
docs/api/segmentation/rfm.md (1 hunks)
docs/api/segmentation/segstats.md (1 hunks)
docs/api/segmentation/threshold.md (1 hunks)
mkdocs.yml (1 hunks)
pyretailscience/segmentation/base.py (1 hunks)
pyretailscience/segmentation/hml.py (1 hunks)
pyretailscience/segmentation/rfm.py (1 hunks)
pyretailscience/segmentation/segstats.py (1 hunks)
pyretailscience/segmentation/threshold.py (1 hunks)
tests/analysis/test_segmentation.py (0 hunks)
tests/segmentation/test_hml.py (1 hunks)
tests/segmentation/test_rfm.py (1 hunks)
tests/segmentation/test_segstats.py (1 hunks)
tests/segmentation/test_threshold.py (1 hunks)

💤 Files with no reviewable changes (1)

tests/analysis/test_segmentation.py

🧰 Additional context used

🧬 Code Definitions (6)

tests/segmentation/test_rfm.py (1)

pyretailscience/segmentation/rfm.py (3)

RFMSegmentation (34-143)

df (134-138)

ibis_table (141-143)

tests/segmentation/test_hml.py (6)

pyretailscience/segmentation/hml.py (1)

HMLSegmentation (16-49)

tests/segmentation/test_segstats.py (1)

base_df (17-27)

tests/segmentation/test_rfm.py (1)

base_df (18-33)

pyretailscience/segmentation/rfm.py (1)

df (134-138)

pyretailscience/segmentation/threshold.py (1)

df (101-105)

pyretailscience/segmentation/segstats.py (1)

df (190-204)

tests/segmentation/test_segstats.py (1)

pyretailscience/segmentation/segstats.py (3)

SegTransactionStats (23-299)

df (190-204)

plot (206-299)

pyretailscience/segmentation/hml.py (3)

pyretailscience/segmentation/threshold.py (2)

ThresholdSegmentation (22-105)

df (101-105)

pyretailscience/segmentation/rfm.py (1)

df (134-138)

pyretailscience/segmentation/segstats.py (1)

df (190-204)

pyretailscience/segmentation/threshold.py (3)

pyretailscience/segmentation/base.py (1)

BaseSegmentation (8-34)

pyretailscience/segmentation/rfm.py (1)

df (134-138)

pyretailscience/segmentation/segstats.py (1)

df (190-204)

pyretailscience/segmentation/rfm.py (2)

pyretailscience/segmentation/threshold.py (1)

df (101-105)

pyretailscience/segmentation/segstats.py (1)

df (190-204)

🪛 GitHub Check: codecov/patch

pyretailscience/segmentation/base.py

[warning] 32-32: pyretailscience/segmentation/base.py#L32
Added line #L32 was not covered by tests

🔇 Additional comments (38)

pyretailscience/segmentation/threshold.py (1)

75-82: Clarify behavior for "include_with_light" zero-spend customers
Currently there is no specific branch handling zero-spend customers when zero_value_customers="include_with_light". Verify if allowing them to remain in the main dataset and receive a normal percentile rank is the intended behavior for your business logic.

docs/api/segmentation/hml.md (1)

1-3: Documentation references updated successfully
The file properly references the pyretailscience.segmentation.hml module, which aligns with the new segmentation structure.

docs/api/segmentation/rfm.md (1)

1-3: Documentation looks good.

The documentation for the RFM Segmentation module follows the project's documentation pattern with a concise header and proper module reference.

docs/api/segmentation/threshold.md (1)

1-3: Documentation looks good.

The documentation for the Threshold Segmentation module is consistent with other documentation patterns in the project.

docs/api/segmentation/segstats.md (1)

1-3: Documentation looks good.

The documentation for the SegTransactionStats Segmentation module maintains consistency with other segmentation documentation patterns.

pyretailscience/segmentation/base.py (1)

1-7: Well-structured imports and module docstring.

The module docstring clearly explains the purpose of this file, and the imports are organized appropriately.

tests/segmentation/test_hml.py (7)

1-24: LGTM: Well-structured test fixture and setup.

The imports, class declaration, and fixture setup look clean and follow good testing practices.

25-38: Comprehensive test for "exclude" option.

This test effectively verifies that zero-spend customers are excluded from the result DataFrame when the zero_value_customers parameter is set to "exclude".

39-50: Comprehensive test for "include_with_light" option.

The test appropriately verifies that zero-spend customers are included in the "Light" segment when the zero_value_customers parameter is set to "include_with_light".

51-62: Comprehensive test for "separate_segment" option.

The test correctly verifies that zero-spend customers are assigned to a "Zero" segment when the zero_value_customers parameter is set to "separate_segment".

63-68: Good validation test for missing columns.

This test ensures the class properly raises a ValueError when required columns are missing, which is essential for early error detection.

69-78: Input data integrity check.

Good practice to verify that the original DataFrame is not modified during processing, ensuring the method is non-destructive.

79-89: Effective test for alternate value column.

This test demonstrates the flexibility of the class by showing it can work with different column names for the value parameter.

tests/segmentation/test_threshold.py (6)

1-37: LGTM: Well-structured basic segmentation test.

The initial test case effectively verifies that customers are correctly segmented based on provided thresholds. The structure follows good testing practices.

38-49: Error handling test for single customer scenario.

Good test to verify that the class raises an appropriate error when there's only one customer in the DataFrame. This prevents segmentation issues with too few data points.

50-78: Thorough testing of custom aggregation function.

This test properly verifies that the aggregation function works correctly with a non-standard value column. The use of pandas testing utilities for DataFrame comparison is a good practice.

79-115: Good validation of segment data merging.

Comprehensive test to verify that segment information is correctly merged back into the original DataFrame using the add_segment method.

116-136: Important test for duplicate customer IDs.

This test verifies that the segmentation handles DataFrames with duplicate customer_id entries correctly, which is a common real-world scenario.

137-169: Thorough error handling for threshold and segment validation.

These tests ensure that the class properly validates the thresholds and segments parameters, raising appropriate errors when they don't meet requirements.

tests/segmentation/test_rfm.py (8)

1-50: LGTM: Well-structured test fixtures for RFM segmentation.

The test fixtures provide comprehensive test data and expected results for RFM segmentation. The use of explicit expected values for RFM scores is good practice.

51-64: Comprehensive RFM segmentation test with current date.

This test correctly verifies that RFM scores and segments are calculated properly based on a specified current date.

65-77: Good error handling test for missing columns.

This test ensures the class properly validates required columns and raises an appropriate error when they're missing.

78-92: Testing single customer scenario.

Important edge case test to verify that RFM segmentation works correctly with a single customer DataFrame.

93-114: Multiple transactions handling test.

This test verifies that the class correctly aggregates multiple transactions per customer when calculating RFM scores.

115-124: Verification of complete customer processing.

Good test to ensure that all customers are processed and receive RFM segments.

125-138: Clever use of freezegun for date testing.

This test effectively uses freezegun to mock the current date, allowing for predictable testing of the automatic date handling feature.

139-161: Type validation tests.

Thorough tests to verify that the class properly validates input types and raises appropriate errors for invalid inputs.

tests/segmentation/test_segstats.py (2)

29-49: Comprehensive scenario validation.

These tests effectively validate various aggregation outputs (revenue, transactions, and customers) for multi-segment data, including the “Total” row. Great work ensuring robust coverage of common scenarios.

153-321: Impressive multiple-parameter coverage.

These tests thoroughly exercise segment columns, plotting constraints, and extra aggregations. They validate correct error handling and proper data shape for advanced usage. No significant issues noted.

pyretailscience/segmentation/hml.py (1)

19-49: Enhance test coverage for HMLSegmentation.

Although this class extends ThresholdSegmentation and appears logically sound, consider adding dedicated tests to verify that Light, Medium, Heavy, and Zero spend segments are computed correctly in line with the 50/30/20 splits.

pyretailscience/segmentation/segstats.py (1)

1-10: Well-structured documentation updates.

The updated docstring clearly communicates the purpose and capabilities of SegTransactionStats. The emphasis on both Pandas DataFrames and Ibis Tables is concise and aligns with the underlying logic.

pyretailscience/segmentation/rfm.py (4)

1-24: Great descriptive module docstring.

The in-depth explanation of RFM segmentation logic, scoring tier details, and business implications is well-written and thorough. This provides valuable clarity to users and maintainers.

26-33: Verify Python version compatibility.

The line datetime.datetime.now(datetime.UTC).date() (line 87) relies on datetime.UTC, which is only available in Python 3.11+. If this library must support older Python versions, consider switching to datetime.timezone.utc.

Do you need to support Python versions earlier than 3.11? If so, please confirm and update accordingly.

34-46: Well-structured class docstring.

The class docstring neatly outlines the RFM approach and clearly communicates the scoring system. The explanation aligns well with the module docstring. No issues found here.

48-90: Robust initialization checks.

The constructor correctly enforces required columns and handles both pandas DataFrame and Ibis Table inputs. The early validation steps (lines 79-82) and type checks (lines 77, 89) help prevent runtime errors. Nicely done!

docs/analysis_modules.md (3)

684-684: Approve HMLSegmentation Import Update
The import for HMLSegmentation has been correctly updated to use the new module path (pyretailscience.segmentation.hml). This aligns with the intended reorganization of the segmentation modules.

727-727: Approve ThresholdSegmentation Import Update
The import for ThresholdSegmentation is now correctly pointing to pyretailscience.segmentation.threshold, which reflects the updated module structure.

822-822: Approve RFMSegmentation Import Update
The import for RFMSegmentation has been updated to pyretailscience.segmentation.rfm, in line with the new module organization.

coderabbitai · 2025-03-25T13:35:08Z

pyretailscience/segmentation/threshold.py

+        for quantile, segment in zip(thresholds, segments, strict=True):
+            case = case.when(df["ptile"] <= quantile, segment)
+
+        case = case.end()
+
+        df = df.mutate(segment_name=case).drop(["ptile"])


🛠️ Refactor suggestion

No fallback for percentile values above highest threshold
This block only handles cases where the percentile is <= each threshold in turn. If the highest threshold is below 1.0, some customers may not be assigned to any segment. Consider adding a fallback or requiring 1.0 as the final threshold.

mkdocs.yml

docs/analysis_modules.md

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

tests/segmentation/test_threshold.py (1)
170-170: Fix typo in test method name

There's a duplicate "too" in the method name.
-def test_thresholds_too_too_few_thresholds(self):
+def test_thresholds_too_few_thresholds(self):

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0179ca and 1fc2b6e.

📒 Files selected for processing (16)

docs/analysis_modules.md (4 hunks)
docs/api/segmentation/hml.md (1 hunks)
docs/api/segmentation/rfm.md (1 hunks)
docs/api/segmentation/segstats.md (1 hunks)
docs/api/segmentation/threshold.md (1 hunks)
mkdocs.yml (1 hunks)
pyretailscience/segmentation/base.py (1 hunks)
pyretailscience/segmentation/hml.py (1 hunks)
pyretailscience/segmentation/rfm.py (1 hunks)
pyretailscience/segmentation/segstats.py (1 hunks)
pyretailscience/segmentation/threshold.py (1 hunks)
tests/analysis/test_segmentation.py (0 hunks)
tests/segmentation/test_hml.py (1 hunks)
tests/segmentation/test_rfm.py (1 hunks)
tests/segmentation/test_segstats.py (1 hunks)
tests/segmentation/test_threshold.py (1 hunks)

💤 Files with no reviewable changes (1)

tests/analysis/test_segmentation.py

🚧 Files skipped from review as they are similar to previous changes (6)

docs/api/segmentation/hml.md
docs/api/segmentation/rfm.md
docs/api/segmentation/segstats.md
pyretailscience/segmentation/hml.py
docs/analysis_modules.md
docs/api/segmentation/threshold.md

🧰 Additional context used

🧬 Code Definitions (3)

tests/segmentation/test_hml.py (6)

pyretailscience/segmentation/hml.py (1)

HMLSegmentation (16-49)

tests/segmentation/test_rfm.py (1)

base_df (18-33)

tests/segmentation/test_segstats.py (1)

base_df (17-27)

pyretailscience/segmentation/rfm.py (1)

df (134-138)

pyretailscience/segmentation/threshold.py (1)

df (101-105)

pyretailscience/segmentation/segstats.py (1)

df (190-204)

tests/segmentation/test_segstats.py (1)

pyretailscience/segmentation/segstats.py (3)

SegTransactionStats (23-299)

df (190-204)

plot (206-299)

pyretailscience/segmentation/threshold.py (3)

pyretailscience/segmentation/base.py (1)

BaseSegmentation (8-34)

pyretailscience/segmentation/rfm.py (1)

df (134-138)

pyretailscience/segmentation/segstats.py (1)

df (190-204)

🪛 GitHub Check: codecov/patch

pyretailscience/segmentation/base.py

[warning] 32-32: pyretailscience/segmentation/base.py#L32
Added line #L32 was not covered by tests

🔇 Additional comments (32)

mkdocs.yml (1)

29-33:
✅ Verification successful

Consider verifying the new doc file references.

The newly introduced documentation paths for HML, RFM, SegTransactionStats, and Threshold Segmentation look consistent. However, please confirm that these files (hml.md, rfm.md, segstats.md, and threshold.md) exist and are properly recognized by MkDocs.

Here is a script to check for these files:

🏁 Script executed:
#!/bin/bash
# Verify the existence of new doc files under docs/api/segmentation
fd hml.md docs/api/segmentation
fd rfm.md docs/api/segmentation
fd segstats.md docs/api/segmentation
fd threshold.md docs/api/segmentation
Length of output: 273
Doc file references have been verified.
The new documentation paths for HML, RFM, SegTransactionStats, and Threshold Segmentation have all been confirmed to exist under the docs/api/segmentation directory:

docs/api/segmentation/hml.md

docs/api/segmentation/rfm.md

docs/api/segmentation/segstats.md

docs/api/segmentation/threshold.md

No further changes are required.
pyretailscience/segmentation/threshold.py (1)

88-91: No fallback for customers above the highest threshold.

If the highest threshold is below 1.0, customers with a percentile rank above that threshold have no assigned segment. A fallback or enforced final threshold of 1.0 is recommended to handle these customers correctly.

pyretailscience/segmentation/base.py (1)

8-34: Add unit test coverage for line 32.

Static analysis indicates that this error-handling line (raise ValueError) is untested. Adding a negative test scenario that triggers a row mismatch after merging will help verify this logic and improve overall coverage.

Would you like me to provide a sample test case that forces a mismatch scenario for demonstration?

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 32-32: pyretailscience/segmentation/base.py#L32
Added line #L32 was not covered by tests

tests/segmentation/test_hml.py (5)

1-89: Well-structured test suite for HMLSegmentation

The test coverage for the HMLSegmentation class is comprehensive, effectively testing all three options for handling zero-value customers: "exclude", "include_with_light", and "separate_segment". The tests properly verify the segmentation logic and edge cases.

25-37: Test case properly verifies zero customer exclusion

Good test for ensuring zero-spend customers are completely excluded from the result DataFrame when the "exclude" option is used.

64-67: Thorough validation of error handling

The test properly verifies that a ValueError is raised when required columns are missing. This helps ensure the class fails fast with clear error messages rather than producing unexpected results.

70-77: Important verification of non-mutation

This test is critical for ensuring that the segmentation process doesn't modify the original DataFrame, which is essential for maintaining data integrity when the same DataFrame is used in multiple operations.

79-88: Good test for alternate column functionality

Testing with an alternate value column ensures the flexibility of the implementation to work with different column names, which is important for adapting to various data structures.

tests/segmentation/test_threshold.py (3)

1-188: Comprehensive test suite for ThresholdSegmentation

The test suite covers the core functionality of ThresholdSegmentation, including proper segmentation, aggregation functions, error handling, and edge cases. The tests are well-structured and verify both expected behavior and proper error handling.

38-48: Correctly tests single customer limitation

This test verifies that attempting to use ThresholdSegmentation with a DataFrame containing only one customer raises a ValueError. This is an important limitation to test for, as segmentation generally requires multiple data points to be meaningful.

50-77: Thorough test of custom aggregation functions

This test effectively validates that the ThresholdSegmentation class correctly applies custom aggregation functions (in this case 'nunique') for segmenting on product_id. The test covers a complex case with multiple products per customer and verifies the expected segmentation outcome.

tests/segmentation/test_rfm.py (4)

1-161: Comprehensive test suite for RFMSegmentation

The test suite thoroughly covers various aspects of RFM segmentation, including correct calculation of scores, handling of edge cases, date handling, and validation of input types. The use of pytest fixtures and freezegun for consistent testing is particularly effective.

35-49: Well-structured fixture for expected values

The expected_df fixture provides a clear reference point for RFM segmentation tests, making the assertions more readable and maintainable. This is a good practice for tests that need to compare complex DataFrame structures.

125-137: Good use of freezegun for date testing

Using freezegun to control the current date ensures consistent test results, regardless of when the tests are run. This is crucial for date-dependent functionality like RFM segmentation.

154-160: Important test for ibis.Table compatibility

This test verifies that the RFMSegmentation class properly handles ibis.Table objects, which is an important feature for users working with large datasets where Ibis can provide performance benefits.

tests/segmentation/test_segstats.py (5)

13-151: Well-structured tests for segment statistics calculation

The TestCalcSegStats class thoroughly tests the segment statistics calculation logic, including revenue, transactions, and customers per segment. The tests cover various scenarios like single segments, zero net units, and excluding total rows.

153-320: Comprehensive testing of SegTransactionStats functionality

The TestSegTransactionStats class provides good coverage of the class's functionality, including error handling, multiple segment columns, plotting limitations, and custom aggregations. The tests for the extra_aggs parameter are particularly thorough.

210-227: Important validation of plotting limitations

This test verifies that attempting to plot with multiple segment columns raises an appropriate ValueError, which is essential for providing clear error messages to users who might attempt this unsupported operation.

229-288: Thorough testing of extra aggregations feature

The tests for the extra_aggs parameter cover both single and multiple custom aggregations, verifying that the results match expected values. This is crucial for ensuring that this powerful feature works correctly.

290-320: Good error handling tests for invalid aggregations

These tests verify that appropriate errors are raised for invalid columns and functions in extra_aggs, ensuring that users receive clear error messages rather than unexpected results.

pyretailscience/segmentation/segstats.py (7)

1-10: Clear and focused module documentation.

The updated docstring effectively communicates the module's purpose, focusing solely on transaction statistics by segment. This aligns well with the modular restructuring objectives of the PR.

19-19: Import statements have been appropriately streamlined.

The import statements have been simplified to focus only on dependencies needed for SegTransactionStats, removing references to the segmentation classes that have been moved to separate modules.

23-85: The SegTransactionStats class implementation is well-structured.

The class initialization and validation logic are robust, with proper error handling for missing columns and invalid input formats. The class effectively supports both Pandas DataFrames and Ibis Tables.

86-114: Column order management is well-implemented.

The static method for determining column order based on data availability is clean and maintainable, with good documentation explaining its purpose.

115-187: Segment statistics calculation is robust and well-implemented.

The implementation handles various edge cases appropriately:

Converting DataFrame to Ibis table when needed

Validating input types

Handling optional quantity columns

Supporting custom aggregations

Including total metrics when requested

Computing derived metrics correctly

189-204: The df property implementation is efficient and flexible.

The property correctly handles column ordering, including both standard metrics and any custom aggregations specified during initialization.

206-299: The plot method is comprehensive with good parameter validation.

The plotting functionality provides flexible options for visualization while maintaining appropriate parameter validation. The method effectively handles orientation, sorting, axis labels, and formatting.

pyretailscience/segmentation/rfm.py (5)

1-24: Excellent comprehensive module documentation.

The docstring provides a thorough explanation of RFM segmentation, including its purpose, benefits, and the scoring methodology. This helps users understand the module's functionality and how to interpret the results.

26-32: Import statements are appropriately minimal.

The imports include only the necessary modules and libraries needed for the RFM segmentation functionality.

34-47: Clear class-level documentation for RFMSegmentation.

The class docstring effectively explains how customers are scored on the three dimensions (recency, frequency, monetary) and how the RFM segment is calculated.

93-132: RFM computation logic is well-implemented.

The _compute_rfm method effectively:

Calculates the base metrics (recency, frequency, monetary)

Creates appropriate window functions for ranking

Computes percentile scores using ntile

Constructs the final RFM and FM segments

Using secondary sorting by customer_id in window functions is a good practice to ensure deterministic results when there are ties in the primary metric.

133-144: Accessor properties are appropriately implemented.

The class provides convenient properties to access the segmentation results in both DataFrame and Ibis table formats. The DataFrame is lazily executed only when needed, which is efficient.

coderabbitai · 2025-03-25T13:45:39Z

pyretailscience/segmentation/threshold.py

+        segments: dict[any, str],
+        value_col: str | None = None,


🛠️ Refactor suggestion

Clarify the usage of the segments dictionary.

The docstring states that “keys are segment IDs and values are segment names,” but the loop iterates over the dictionary keys as if they are the displayed segment names. This may cause confusion or unintended labeling.

Consider either:

Changing the loop to use the dictionary values as segment labels, or

Updating the docstring to reflect that the dictionary keys will be used as the segment labels.

pyretailscience/segmentation/threshold.py

coderabbitai · 2025-03-25T13:45:39Z

pyretailscience/segmentation/rfm.py

+    def __init__(self, df: pd.DataFrame | ibis.Table, current_date: str | datetime.date | None = None) -> None:
+        """Initializes the RFM segmentation process.
+
+        Args:
+            df (pd.DataFrame | ibis.Table): A DataFrame or Ibis table containing transaction data.
+                Must include the following columns:
+                - customer_id
+                - transaction_date
+                - unit_spend
+                - transaction_id
+            current_date (Optional[Union[str, datetime.date]]): The reference date for calculating recency.
+                Can be a string (format: "YYYY-MM-DD"), a date object, or None (defaults to the current system date).
+
+        Raises:
+            ValueError: If the dataframe is missing required columns.
+            TypeError: If the input data is not a pandas DataFrame or an Ibis Table.
+        """
+        cols = ColumnHelper()
+        required_cols = [
+            cols.customer_id,
+            cols.transaction_date,
+            cols.unit_spend,
+            cols.transaction_id,
+        ]
+        if isinstance(df, pd.DataFrame):
+            df = ibis.memtable(df)
+        elif not isinstance(df, ibis.Table):
+            raise TypeError("df must be either a pandas DataFrame or an Ibis Table")
+
+        missing_cols = set(required_cols) - set(df.columns)
+        if missing_cols:
+            error_message = f"Missing required columns: {missing_cols}"
+            raise ValueError(error_message)
+
+        if isinstance(current_date, str):
+            current_date = datetime.date.fromisoformat(current_date)
+        elif current_date is None:
+            current_date = datetime.datetime.now(datetime.UTC).date()
+        elif not isinstance(current_date, datetime.date):
+            raise TypeError("current_date must be a string in 'YYYY-MM-DD' format, a datetime.date object, or None")
+
+        self.table = self._compute_rfm(df, current_date)
+


💡 Verification agent

🧩 Analysis chain

Initialization method has thorough validation.

The __init__ method includes comprehensive validation of input parameters and handles different input types appropriately.

However, there's a potential Python version compatibility issue:

🏁 Script executed:

#!/bin/bash # Description: Check Python version requirement for datetime.UTC usage # Check if the repository specifies a minimum Python version grep -r "python_requires\|python_version" setup.py pyproject.toml requirements*.txt | head -5 # Check if the code uses datetime.UTC elsewhere rg "datetime\.UTC"

Length of output: 317

Attention: Python Version Compatibility for datetime.UTC

The initialization method’s validations are thorough and correctly handle alternative input types.

However, the use of datetime.UTC in pyretailscience/segmentation/rfm.py (line 58) may lead to compatibility issues on Python versions earlier than 3.11.

The repository configuration (e.g., in pyproject.toml) does not enforce a minimum Python version, so consider either:

Updating the project’s Python version requirement to 3.11 or higher, or

Replacing datetime.UTC with the more broadly supported datetime.timezone.utc for compatibility with older Python versions.

mkdocs.yml

pyretailscience/segmentation/rfm.py

pyretailscience/segmentation/threshold.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pyretailscience/segmentation/segstats.py (1)
23-24: Consider expanding the class-level docstring.

While the implementation and __init__ docstring are detailed, the class-level docstring is quite brief. Consider expanding it to match the comprehensiveness of the module-level docstring.
-    """Calculates transaction statistics by segment."""
+    """Calculates and visualizes transaction statistics by segment.
+    
+    This class computes transaction-based statistics grouped by segment columns,
+    including total spend, unique customers, transactions per customer, and more.
+    It supports both Pandas DataFrames and Ibis Tables as input data formats and
+    provides visualization capabilities for segment-based statistics.
+    """

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1fc2b6e and 835a8e1.

📒 Files selected for processing (16)

docs/analysis_modules.md (4 hunks)
docs/api/segmentation/hml.md (1 hunks)
docs/api/segmentation/rfm.md (1 hunks)
docs/api/segmentation/segstats.md (1 hunks)
docs/api/segmentation/threshold.md (1 hunks)
mkdocs.yml (1 hunks)
pyretailscience/segmentation/base.py (1 hunks)
pyretailscience/segmentation/hml.py (1 hunks)
pyretailscience/segmentation/rfm.py (1 hunks)
pyretailscience/segmentation/segstats.py (1 hunks)
pyretailscience/segmentation/threshold.py (1 hunks)
tests/analysis/test_segmentation.py (0 hunks)
tests/segmentation/test_hml.py (1 hunks)
tests/segmentation/test_rfm.py (1 hunks)
tests/segmentation/test_segstats.py (1 hunks)
tests/segmentation/test_threshold.py (1 hunks)

💤 Files with no reviewable changes (1)

tests/analysis/test_segmentation.py

🚧 Files skipped from review as they are similar to previous changes (10)

docs/api/segmentation/rfm.md
docs/api/segmentation/threshold.md
docs/api/segmentation/hml.md
docs/api/segmentation/segstats.md
tests/segmentation/test_threshold.py
tests/segmentation/test_hml.py
pyretailscience/segmentation/hml.py
tests/segmentation/test_rfm.py
docs/analysis_modules.md
mkdocs.yml

🧰 Additional context used

🧬 Code Definitions (2)

tests/segmentation/test_segstats.py (1)

pyretailscience/segmentation/segstats.py (3)

SegTransactionStats (23-299)

df (190-204)

plot (206-299)

pyretailscience/segmentation/threshold.py (3)

pyretailscience/segmentation/base.py (1)

BaseSegmentation (8-34)

pyretailscience/segmentation/rfm.py (1)

df (134-138)

pyretailscience/segmentation/segstats.py (1)

df (190-204)

🪛 GitHub Check: codecov/patch

pyretailscience/segmentation/base.py

[warning] 32-32: pyretailscience/segmentation/base.py#L32
Added line #L32 was not covered by tests

🔇 Additional comments (12)

pyretailscience/segmentation/segstats.py (2)

1-10: Excellent docstring improvement!

The updated module docstring clearly explains the purpose of the module and the functionality provided by the SegTransactionStats class. It effectively communicates the key capabilities, such as computing transaction-based statistics by segment and supporting both Pandas DataFrames and Ibis Tables.

19-19: Import path update looks good.

The updated import for ColumnHelper from pyretailscience.options is correctly implemented.

tests/segmentation/test_segstats.py (5)

1-321: Well-structured comprehensive test suite!

The test file provides excellent coverage of the SegTransactionStats class functionality. The tests are well-organized into two classes that separately test the internal _calc_seg_stats method and the public-facing class functionality.

13-151: Thorough testing of the calculation logic.

The TestCalcSegStats class thoroughly tests the calculation functionality with various scenarios:

Transaction-item level calculations

Transaction level calculations

Single segment handling

Zero net units edge case

Total row exclusion option

This comprehensive approach ensures the core calculation logic is robust.

153-321: Effective validation of class behavior and error handling.

The TestSegTransactionStats class effectively tests:

Input validation by checking errors on missing required columns

Multi-segment column functionality

Plot method restrictions

Extra aggregations functionality and error handling

These tests provide confidence in the class's robustness when handling various inputs and edge cases.

102-125: Excellent edge case handling for zero net units.

The test for handling zero net units is particularly valuable as it ensures the code correctly handles division-by-zero scenarios, which could otherwise lead to runtime errors.

229-289: Well-structured test for extra aggregations.

The test_extra_aggs_functionality method effectively validates both single and multiple extra aggregations. The use of descriptive constant variables for expected values enhances readability and makes the test purpose clear.

pyretailscience/segmentation/base.py (1)

11-22: Consider verifying the existence of "segment_name"
This method assumes self.df["segment_name"] always exists. If self.df is uninitialized or missing the "segment_name" column, the merge will fail. Ensure that self.df is properly populated with a "segment_name" column before calling add_segment.

Would you like to verify the presence of "segment_name" in self.df before merging and add a safeguard for potential KeyError?
pyretailscience/segmentation/rfm.py (2)

87-87: Compatibility warning for Python versions < 3.11
Using datetime.UTC can cause compatibility issues for those on older Python versions. To maintain broader support, consider using datetime.timezone.utc or specifying a minimum Python version of 3.11.

112-126:
❓ Verification inconclusive

Verify ranking logic for recency
The docstring says “9 represents the best customer” for Recency, yet the code orders by ascending recency (lowest recency first). This might place the most recent purchasers in the lowest ntile value. Confirm the intended behavior and adjust if necessary (e.g., by reversing the ordering or reinterpreting the ntile scale).

🏁 Script executed:
#!/bin/bash
# We can verify how ibis.ntile(10) assigns ranks with ascending vs descending order.
# This script searches for usage of ntile in the codebase to confirm no contradictory logic.
rg 'ibis\.ntile$10$'
Length of output: 312
Clarify Recency Ranking Intent

After verifying the ntile usage in the code, it’s clear that the three metrics (recency, frequency, monetary) are partitioned using ascending sort orders. Given that a lower recency (i.e., fewer days since the last purchase) should intuitively indicate a better customer, this implementation would assign lower ntile values—contrary to the docstring which states “9 represents the best customer.” Please confirm whether the intended behavior is to have customers with the smallest recency values receive a higher ntile score (perhaps by reversing the ordering for recency or by adjusting the ntile scale) or if the docstring should be modified to reflect the current implementation.

File: pyretailscience/segmentation/rfm.py

Lines: 112-126

Action Needed: Verify and reconcile the mismatch between the documented best-customer score for recency and the current ascending order logic.
pyretailscience/segmentation/threshold.py (2)

35-47: Implement “include_with_light” zero-value option
The code handles the "exclude" and "separate_segment" cases but not "include_with_light". This may lead to unexpected behavior if a user passes "include_with_light".

88-90: No fallback segment for percentile values above the highest threshold
If the highest threshold is below 1.0, customers above that threshold remain unassigned. Consider adding a final .else_() branch or requiring 1.0 as the last threshold.

pyretailscience/segmentation/base.py

murray-ds · 2025-03-26T08:17:18Z

mkdocs.yml

@@ -26,7 +26,12 @@ nav:
          - Haversine Distance: api/analysis/haversine.md
          - Product Association: api/analysis/product_association.md
          - Revenue Tree: api/analysis/revenue_tree.md
-          - Segmentation: api/analysis/segmentation.md
+      - Segmentation:
+          - Base Segmentation: api/segmentation/base.md


Does this file exist? I don't see it in the PR

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pyretailscience/segmentation/threshold.py (1)

53-54: No fallback for percentile values above the highest threshold
If the largest threshold is below 1.0, some customers may not be assigned to any segment. Consider ensuring 1.0 is the final threshold or adding a final catch-all.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 835a8e1 and 6a4058b.

📒 Files selected for processing (17)

docs/analysis_modules.md (4 hunks)
docs/api/segmentation/base.md (1 hunks)
docs/api/segmentation/hml.md (1 hunks)
docs/api/segmentation/rfm.md (1 hunks)
docs/api/segmentation/segstats.md (1 hunks)
docs/api/segmentation/threshold.md (1 hunks)
mkdocs.yml (1 hunks)
pyretailscience/segmentation/base.py (1 hunks)
pyretailscience/segmentation/hml.py (1 hunks)
pyretailscience/segmentation/rfm.py (1 hunks)
pyretailscience/segmentation/segstats.py (1 hunks)
pyretailscience/segmentation/threshold.py (1 hunks)
tests/analysis/test_segmentation.py (0 hunks)
tests/segmentation/test_hml.py (1 hunks)
tests/segmentation/test_rfm.py (1 hunks)
tests/segmentation/test_segstats.py (1 hunks)
tests/segmentation/test_threshold.py (1 hunks)

💤 Files with no reviewable changes (1)

tests/analysis/test_segmentation.py

✅ Files skipped from review due to trivial changes (1)

docs/api/segmentation/base.md

🚧 Files skipped from review as they are similar to previous changes (10)

docs/api/segmentation/threshold.md
docs/api/segmentation/segstats.md
docs/api/segmentation/rfm.md
docs/api/segmentation/hml.md
tests/segmentation/test_threshold.py
tests/segmentation/test_hml.py
pyretailscience/segmentation/hml.py
mkdocs.yml
tests/segmentation/test_rfm.py
docs/analysis_modules.md

🧰 Additional context used

🧬 Code Definitions (3)

tests/segmentation/test_segstats.py (1)

pyretailscience/segmentation/segstats.py (3)

SegTransactionStats (23-299)

df (190-204)

plot (206-299)

pyretailscience/segmentation/threshold.py (3)

pyretailscience/segmentation/base.py (1)

BaseSegmentation (8-34)

pyretailscience/segmentation/rfm.py (1)

df (134-138)

pyretailscience/segmentation/segstats.py (1)

df (190-204)

pyretailscience/segmentation/rfm.py (2)

pyretailscience/segmentation/threshold.py (1)

df (101-105)

pyretailscience/segmentation/segstats.py (1)

df (190-204)

🪛 GitHub Check: codecov/patch

pyretailscience/segmentation/base.py

[warning] 32-32: pyretailscience/segmentation/base.py#L32
Added line #L32 was not covered by tests

🔇 Additional comments (22)

pyretailscience/segmentation/base.py (3)

8-9: Well-structured base class
Your docstring and class structure provide a concise and user-friendly foundation for specialized segmentation classes.

11-22: Merge-based logic verification
The logic for merging segment names appears clear. However, ensure that the caller sets self.df with a valid segment_name column before invoking add_segment.

Would you confirm that the codebase has tests ensuring a ValueError is raised if self.df lacks segment_name?

32-32: Unit test coverage recommendation
The line raising ValueError is not covered by tests.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 32-32: pyretailscience/segmentation/base.py#L32
Added line #L32 was not covered by tests

pyretailscience/segmentation/rfm.py (3)

1-24: Comprehensive documentation
The docstring provides a thorough overview of RFM segmentation, covering methodology and benefits clearly.

87-87: Python version compatibility concern
datetime.UTC is only available in Python 3.11 and later. Consider using datetime.timezone.utc for broader compatibility.

If you intend to support Python versions below 3.11, please confirm that this usage won't break in those environments.

133-139: Lazy-loading the DataFrame
Your property-based approach to executing the data and caching the result is efficient. Just be mindful of memory usage for large datasets.

pyretailscience/segmentation/threshold.py (1)

95-96: Ensure union schema alignment
When uniting df and zero_df, confirm both tables share the same schema. A mismatch in columns or data types might cause errors in future changes.

pyretailscience/segmentation/segstats.py (3)

1-10: Improved module documentation

The module now has a clear, comprehensive docstring that accurately describes its purpose and capabilities. The documentation clearly explains that this module focuses on transaction statistics by segment, supporting both Pandas DataFrames and Ibis Tables.

19-19: Import path updated correctly

The import path for ColumnHelper has been updated as part of the segmentation module restructuring.

23-24: SegTransactionStats class is properly maintained

This class has been correctly maintained as the sole focus of this module, aligning with the PR objective of splitting segmentation logic into modular files.

tests/segmentation/test_segstats.py (12)

1-11: Well-structured test module

The test module has clear imports and appropriate organization. Using ColumnHelper for consistent column access is a good practice for maintainability.

13-28: Good test fixture setup

The TestCalcSegStats class with the base_df fixture provides a solid foundation for testing with sample data that includes all required columns. This approach promotes code reuse across multiple tests.

29-50: Comprehensive basic functionality test

This test thoroughly verifies the core calculation logic, ensuring that revenue, transactions, and customers are correctly calculated per segment. The expected values are well-defined and the verification uses proper assertion methods.

51-77: Additional test for dataframes without quantity column

This test appropriately verifies that the class works correctly when the input dataframe doesn't include the unit quantity column, which is an important edge case to cover.

78-101: Single segment handling test

Good test for verifying behavior when all rows belong to the same segment, an important edge case that could otherwise lead to unexpected behavior.

102-125: Zero net units handling test

This test verifies correct behavior when a segment has zero net units, properly checking that:

The price per unit is set to NaN (as division by zero would be undefined)

The units per transaction is calculated as zero

Other metrics remain unaffected

This ensures robust handling of edge cases in real-world data.

126-151: Total row exclusion test

Good verification of the calc_total parameter functionality, ensuring the total row is excluded when specified and that customer percentages are adjusted correctly.

153-164: Error handling test for missing required columns

This test correctly verifies that the class raises an appropriate error when required columns are missing, which is crucial for providing clear feedback to users.

165-209: Multiple segment columns test

This test thoroughly verifies the class's ability to handle multiple segment columns, which is an important feature for complex segmentation analysis. The test correctly:

Verifies that both segment columns are present in the result

Checks that only actual combinations (not all possible combinations) are included

Validates the aggregation calculations across the segment combinations

210-228: Plot limitations test

Good test to validate that the plotting functionality correctly raises an error when multiple segment columns are provided, as this is a documented limitation of the implementation.

229-289: Extra aggregations functionality test

Comprehensive test for the extra_aggs parameter, verifying both single and multiple additional aggregations. The test uses well-defined expected values and thorough assertions to ensure the feature works correctly.

290-320: Error handling tests for invalid extra aggregations

These tests verify appropriate error handling for:

Invalid column names in extra aggregations

Invalid aggregation functions

Both tests ensure users receive clear error messages when configuration issues occur, improving the developer experience.

coderabbitai · 2025-03-26T09:07:23Z

pyretailscience/segmentation/threshold.py

+        zero_value_customers: Literal["separate_segment", "exclude", "include_with_light"] = "separate_segment",
+    ) -> None:
+        """Segments customers based on user-defined thresholds and segments.
+
+        Args:
+            df (pd.DataFrame | ibis.Table): A dataframe with the transaction data. The dataframe must contain a customer_id column.
+            thresholds (List[float]): The percentile thresholds for segmentation.
+            segments (List[str]): A list of segment names for each threshold.
+            value_col (str, optional): The column to use for the segmentation. Defaults to get_option("column.unit_spend").
+            agg_func (str, optional): The aggregation function to use when grouping by customer_id. Defaults to "sum".
+            zero_segment_name (str, optional): The name of the segment for customers with zero spend. Defaults to "Zero".
+            zero_value_customers (Literal["separate_segment", "exclude", "include_with_light"], optional): How to handle
+                customers with zero spend. Defaults to "separate_segment".


🛠️ Refactor suggestion

Missing handling for "include_with_light"
The code does not handle the include_with_light option for zero spend customers. This could lead to inconsistent outcomes or errors.

qodo-merge-pro bot added the Review effort 3/5 label Mar 25, 2025

coderabbitai bot reviewed Mar 25, 2025

View reviewed changes

mayurkmmt force-pushed the feature/split-segmentation branch from a0179ca to 1fc2b6e Compare March 25, 2025 13:39

coderabbitai bot reviewed Mar 25, 2025

View reviewed changes

mvanwyk reviewed Mar 25, 2025

View reviewed changes

mkdocs.yml Show resolved Hide resolved

pyretailscience/segmentation/rfm.py Outdated Show resolved Hide resolved

pyretailscience/segmentation/threshold.py Outdated Show resolved Hide resolved

pyretailscience/segmentation/threshold.py Outdated Show resolved Hide resolved

mayurkmmt force-pushed the feature/split-segmentation branch from 1fc2b6e to 835a8e1 Compare March 26, 2025 05:50

coderabbitai bot reviewed Mar 26, 2025

View reviewed changes

pyretailscience/segmentation/base.py Show resolved Hide resolved

murray-ds reviewed Mar 26, 2025

View reviewed changes

feat: split segmentations.py file

6a4058b

mayurkmmt force-pushed the feature/split-segmentation branch from 835a8e1 to 6a4058b Compare March 26, 2025 09:03

coderabbitai bot reviewed Mar 26, 2025

View reviewed changes

mayurkmmt merged commit c4be7f1 into main Mar 26, 2025
3 checks passed

Split Segmentation #154

Split Segmentation #154

Uh oh!

Conversation

mayurkmmt commented Mar 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Changes walkthrough 📝

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

qodo-merge-pro bot commented Mar 25, 2025

PR Reviewer Guide 🔍

Uh oh!

qodo-merge-pro bot commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Uh oh!

codecov bot commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

murray-ds Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mayurkmmt commented Mar 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 25, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

qodo-merge-pro bot commented Mar 25, 2025 •

edited

Loading

codecov bot commented Mar 25, 2025 •

edited

Loading