feat: convert seg stats to use Ibis #90

mvanwyk · 2025-02-08T10:29:39Z

PR Type

Enhancement, Tests

Description

Migrated segmentation statistics computation from DuckDB to Ibis framework.
Enhanced SegTransactionStats class with Ibis-based aggregation and metrics.
Updated tests to validate Ibis-based implementation and ensure correctness.
Added new column options and adjusted TOML file handling logic.

Changes walkthrough 📝

Relevant files

Enhancement

segmentation.py `Transition segmentation statistics to Ibis framework` pyretailscience/segmentation.py Replaced DuckDB with Ibis for data handling and aggregation. Updated `SegTransactionStats` to use Ibis for metrics calculation. Added `_get_col_order` method for consistent column ordering. Refactored `_calc_seg_stats` to support Ibis-based aggregation.	+93/-55
options.py `Add new column option and refine TOML handling` pyretailscience/options.py Added `customers_pct` column option for percentage calculations. Adjusted TOML file handling to remove unnecessary noqa directive.	+2/-1

Tests

test_segmentation.py `Update tests for Ibis-based segmentation stats` tests/test_segmentation.py Updated tests to validate Ibis-based `SegTransactionStats` implementation. Adjusted test data and expected outputs for Ibis compatibility. Removed redundant test for unaltered DataFrame.	+12/-18

Configuration changes

pyproject.toml `Update linting rules for Ibis compatibility` pyproject.toml Added `SLF001` to ignored linting rules for Ibis-specific syntax.	+1/-0

Additional files

segmentation.ipynb	+43/-61

Need help?
Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
Check out the documentation for more information.

Summary by CodeRabbit

New Features:
- Added a customer percentage metric that enriches data analysis and provides deeper insights into customer trends.
Refactor:
- Enhanced segmentation processing by supporting varied data inputs and implementing improved aggregation methods for more reliable results.
Chores:
- Refined internal configurations and streamlined code formatting to uphold high development quality and maintain a consistent codebase.

coderabbitai · 2025-02-08T10:29:46Z

Walkthrough

This pull request makes several modifications across multiple files. In pyproject.toml, it adds the "SLF001" rule to the ignore list under [tool.ruff.lint]. The options module now includes a new attribute (customers_pct) in its ColumnHelper class and removes an obsolete comment. The segmentation module has been updated to transition from DuckDB to ibis with revised type hints, aggregation logic, and error handling using ColumnHelper, and the corresponding tests have been updated for these changes.

Changes

File(s)	Change Summary
pyproject.toml	Added "SLF001" to the ignore list under `[tool.ruff.lint]` to suppress private member access warnings.
pyretailscience/options.py	Introduced new attribute `self.customers_pct` in `ColumnHelper` and removed an unnecessary `# noqa: SLF001` comment from `load_from_toml`.
pyretailscience/segmentation.py, tests/test_segmentation.py	Transitioned from `DuckDBPyRelation` to `ibis.Table`; updated method signatures, error handling via `ColumnHelper`, and aggregation logic; modified tests for instance creation and datatype consistency.

Possibly related PRs

Linting improvements #54: Adjusted linting configurations to ignore specific rules, closely related to the current update in the ignore list.

Suggested labels

enhancement, Tests

Poem

Hoppity hops, changes abound in my code,
New attributes and flows lighten my humble abode.
With ibis I dance and segmentation sings,
Lint rules now whisper softer, like gentle spring flings.
I nibble on updates with a futuristic glow –
A rabbit’s delight as our projects grow!

✨ Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

qodo-merge-pro · 2025-02-08T10:30:04Z

Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Data Type Handling The unit_spend column is now explicitly handled as float in tests but the conversion from other numeric types is not validated in the main code. This could cause precision/rounding issues. cols.agg_unit_spend: data[cols.unit_spend].sum(), cols.agg_transaction_id: data[cols.transaction_id].nunique(), cols.agg_customer_id: data[cols.customer_id].nunique(), Error Handling The code assumes all required columns exist after initial validation but doesn't handle cases where columns might be missing in subsequent operations, particularly with the unit_qty optional column. if cols.unit_qty in data.columns: aggs[cols.agg_unit_qty] = data[cols.unit_qty].sum()

qodo-merge-pro · 2025-02-08T10:30:30Z

Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

PR Code Suggestions ✨

Explore these optional code suggestions:

Category Suggestion Impact

Possible issue

Handle zero division cases

Add validation for zero division cases in the metrics calculations. Currently,
division operations could raise ZeroDivisionError if any of the aggregated
values are zero.

pyretailscience/segmentation.py [285-293]

 final_metrics = ibis.union(segment_metrics, total_metrics).mutate(
     **{
-        cols.calc_spend_per_cust: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_customer_id],
-        cols.calc_spend_per_trans: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_transaction_id],
-        cols.calc_trans_per_cust: ibis._[cols.agg_transaction_id] / ibis._[cols.agg_customer_id],
-        cols.customers_pct: ibis._[cols.agg_customer_id] / total_customers,
+        cols.calc_spend_per_cust: ibis.case().when(ibis._[cols.agg_customer_id] > 0, ibis._[cols.agg_unit_spend] / ibis._[cols.agg_customer_id]).else_(0),
+        cols.calc_spend_per_trans: ibis.case().when(ibis._[cols.agg_transaction_id] > 0, ibis._[cols.agg_unit_spend] / ibis._[cols.agg_transaction_id]).else_(0),
+        cols.calc_trans_per_cust: ibis.case().when(ibis._[cols.agg_customer_id] > 0, ibis._[cols.agg_transaction_id] / ibis._[cols.agg_customer_id]).else_(0),
+        cols.customers_pct: ibis.case().when(total_customers > 0, ibis._[cols.agg_customer_id] / total_customers).else_(0),
     },
 )

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 8

__

Why: This is a critical improvement that prevents potential runtime errors by handling division by zero cases using Ibis's case expressions. This enhances the robustness of the metrics calculations.

Medium

codecov · 2025-02-08T10:31:35Z

Codecov Report

Attention: Patch coverage is 87.80488% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
pyretailscience/segmentation.py	87.17%	4 Missing and 1 partial ⚠️

Files with missing lines	Coverage Δ
pyretailscience/options.py	`97.50% <100.00%> (+1.54%)`	⬆️
pyretailscience/segmentation.py	`69.01% <87.17%> (+8.85%)`	⬆️

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

pyretailscience/segmentation.py (3)

222-247: Column ordering approach works, but consider consistent naming.

Inserting a literal "units" can occasionally conflict with aggregated column naming. If you prefer uniform naming conventions, consider using a more descriptive or dynamic reference akin to self.agg_unit_qty.

250-302: ibis-based aggregation logic is sound, but check performance for large data.

Switching to ibis for grouping and union is correct. For very large datasets, confirm that combining segment_metrics and total_metrics with ibis.union does not cause excessive overhead. You might want to test with scaled data to ensure acceptable performance.

366-368: Filtering out “Total” is fine, but maintain alignment with aggregator.

If the label for the aggregated row changes later, this filtering might break. For future-proofing, consider referencing a configurable label or a known constant rather than a literal string.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4b4825 and de3a12b.

📒 Files selected for processing (4)

pyproject.toml (1 hunks)
pyretailscience/options.py (2 hunks)
pyretailscience/segmentation.py (4 hunks)
tests/test_segmentation.py (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Pre-Commit

🔇 Additional comments (18)

pyretailscience/segmentation.py (9)

156-156: Allowing both DataFrame and ibis.Table is good.

Updating the constructor to accept either a pandas DataFrame or an ibis Table is appropriate for broader flexibility. Just ensure the docstring and usage references this extended input type clearly throughout.

190-191: Typed attribute annotation looks solid.

The explicit type hint (_df: pd.DataFrame | None) helps with clarity and IDE assistance.

192-192: Constructor aligns with ibis-based approach.

Replacing DuckDB references with (pd.DataFrame | ibis.Table) in the constructor is consistent with the PR objective to use ibis.

196-196: Docstring updated accordingly.

The docstring now reflects the new accepted data types, which improves clarity.

202-202: No issues noted here.

Calling ColumnHelper at initialization remains a clean approach to manage column naming.

204-206: Good check for required columns.

Including the segment_col in this list ensures consistent coverage of necessary columns in the data source.

209-210: Optional column handling is clear.

Appending unit_qty only if present in the data’s columns promotes flexible usage.

219-219: Storing the table at initialization is appropriate.

Assigning self.table in the constructor centralizes the aggregated structure for later property access.

305-315: Column reordering is straightforward and clear.

Executing the ibis table and selecting columns in a defined order is well organized.

tests/test_segmentation.py (6)

21-21: Floating spend values improve accuracy.

Replacing integer spend values with floats is a valid step for monetary amounts.

42-42: Verifying percentage column for segments.

The new "customers_pct" usage aligns with the updated code. Values (0.6, 0.4, 1.0) seem correct.

44-47: Minor structural changes.

Refactoring these lines to set up and instantiate SegTransactionStats is neat. No functional concerns.

55-55: Float-based spend values remain consistent.

Ensures test data aligns with the new float usage throughout the module.

72-74: Creating SegTransactionStats instance and verifying output.

Switching from direct static calls to instance-based usage is correct. Sorting by segment_name before comparing is good practice.

94-98: Validating total coverage for single-segment scenarios.

Testing the scenario with 1.0 for the entire population across two rows is consistent with the logic for “segment” plus “Total.”

pyretailscience/options.py (2)

240-240: Comment removal is inconsequential.

No functional impact from dropping the “# noqa: SLF001” comment; code remains clear.

395-395: New attribute customers_pct is coherent.

This addition helps standardize referencing the customer percentage column. Nicely integrates with existing naming patterns.

pyproject.toml (1)

71-71: SLF001 Ignore Rule Addition Approved
The addition of "SLF001" to the ignore list is appropriate given Ibis’s frequent use of internal/private attributes (e.g., ibis._[column]) that would otherwise trigger this lint warning. The inline comment clearly explains the rationale.

Copilot

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

pyretailscience/segmentation.py:314

The column ordering in the df property method should ensure that all required columns are present and in the correct order.

self._df = self.table.execute()[col_order]

pyretailscience/segmentation.py:368

The hide_total check is case-sensitive. Ensure consistency in how 'Total' is handled throughout the code.

val_s = val_s[val_s.index != "Total"]

pyproject.toml:71

The comment has a minor grammatical issue. It should be: 'Ibis makes extensive use of ibis._[column], which triggers this.'

"SLF001", # Ibis makes a lot of use of the ibis._[column] which triggers this

Copilot · 2025-02-08T10:34:59Z

pyretailscience/segmentation.py

+                cols.calc_spend_per_cust: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_customer_id],
+                cols.calc_spend_per_trans: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_transaction_id],
+                cols.calc_trans_per_cust: ibis._[cols.agg_transaction_id] / ibis._[cols.agg_customer_id],
+                cols.customers_pct: ibis._[cols.agg_customer_id] / total_customers,


The usage of ibis._ seems incorrect. It should be replaced with the proper syntax for accessing columns in Ibis expressions.

Suggested change

cols.calc_spend_per_cust: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_customer_id],

cols.calc_spend_per_trans: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_transaction_id],

cols.calc_trans_per_cust: ibis._[cols.agg_transaction_id] / ibis._[cols.agg_customer_id],

cols.customers_pct: ibis._[cols.agg_customer_id] / total_customers,

cols.calc_spend_per_cust: final_metrics[cols.agg_unit_spend] / final_metrics[cols.agg_customer_id],

cols.calc_spend_per_trans: final_metrics[cols.agg_unit_spend] / final_metrics[cols.agg_transaction_id],

cols.calc_trans_per_cust: final_metrics[cols.agg_transaction_id] / final_metrics[cols.agg_customer_id],

cols.customers_pct: final_metrics[cols.agg_customer_id] / total_customers,

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

pyretailscience/segmentation.py (1)

287-290: ⚠️ Potential issue

Fix incorrect Ibis column access syntax.

The usage of ibis._ is incorrect. Use direct column access instead.

Apply this diff:

-                cols.calc_spend_per_cust: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_customer_id],
-                cols.calc_spend_per_trans: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_transaction_id],
-                cols.calc_trans_per_cust: ibis._[cols.agg_transaction_id] / ibis._[cols.agg_customer_id],
-                cols.customers_pct: ibis._[cols.agg_customer_id] / total_customers,
+                cols.calc_spend_per_cust: final_metrics[cols.agg_unit_spend] / final_metrics[cols.agg_customer_id],
+                cols.calc_spend_per_trans: final_metrics[cols.agg_unit_spend] / final_metrics[cols.agg_transaction_id],
+                cols.calc_trans_per_cust: final_metrics[cols.agg_transaction_id] / final_metrics[cols.agg_customer_id],
+                cols.customers_pct: final_metrics[cols.agg_customer_id] / total_customers,

🧹 Nitpick comments (2)

pyretailscience/segmentation.py (2)
156-156: Update docstring to reflect new type.

The docstring should be updated to match the new type annotation:
-            df (pd.DataFrame): A dataframe with the transaction data. The dataframe must contain a customer_id column.
+            df (pd.DataFrame | ibis.Table): A dataframe or Ibis table with the transaction data. The data must contain a customer_id column.
Also applies to: 169-169

190-219: Update docstring and improve type handling.

Update the docstring to reflect the new type:
-            data (pd.DataFrame | ibis.Table): The transaction data. The dataframe must contain the columns
+            data (pd.DataFrame | ibis.Table): The transaction data. The data must contain the columns
Consider adding type hints for the class variable:
-    _df: pd.DataFrame | None = None
+    _df: pd.DataFrame | None

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between de3a12b and d30da1a.

📒 Files selected for processing (2)

pyretailscience/segmentation.py (4 hunks)
tests/test_segmentation.py (5 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Pre-Commit

🔇 Additional comments (7)

tests/test_segmentation.py (4)

22-22: LGTM! The changes improve data type precision and code structure.

The changes enhance the test by:

Using float values for monetary amounts, which is more appropriate for financial calculations.

Adopting an object-oriented approach that aligns with the new implementation.

Also applies to: 46-48

56-56: LGTM! The changes improve data type precision and code structure.

The changes enhance the test by:

Using float values for monetary amounts, which is more appropriate for financial calculations.

Adopting an object-oriented approach that aligns with the new implementation.

Also applies to: 75-75

99-99: LGTM! The code structure is improved.

The change to object-oriented instantiation aligns with the new implementation.

102-124: LGTM! Good addition of edge case testing.

The new test function effectively verifies the handling of zero net units, ensuring that:

Price per unit calculations correctly handle division by zero.

Other metrics remain accurate even when some segments have zero units.

pyretailscience/segmentation.py (3)

221-249: LGTM! Good addition of column order management.

The new static method effectively:

Centralizes column order management

Handles both cases with and without quantity columns

Improves maintainability by providing a single source of truth

250-302: LGTM! Good migration to Ibis.

The changes effectively:

Handle both DataFrame and Ibis Table inputs

Use Ibis for aggregations and calculations

Maintain the same functionality while leveraging Ibis features

303-313: LGTM! Good use of column order management.

The property effectively:

Uses the new _get_col_order method for consistent column ordering

Handles the conversion from Ibis Table to DataFrame

Caches the result for better performance

feat: convert seg stats to use Ibis

de3a12b

mvanwyk requested a review from Copilot February 8, 2025 10:29

mvanwyk self-assigned this Feb 8, 2025

qodo-merge-pro bot added the Review effort [1-5]: 3 label Feb 8, 2025

coderabbitai bot reviewed Feb 8, 2025

View reviewed changes

Copilot AI reviewed Feb 8, 2025

View reviewed changes

mvanwyk added 2 commits February 8, 2025 12:15

feat: add having filer to avoid div by zero

bbea343

feat: seg stats handle 0 unit segments

d30da1a

coderabbitai bot reviewed Feb 8, 2025

View reviewed changes

mvanwyk merged commit 68de169 into main Feb 8, 2025
4 checks passed

mvanwyk deleted the ibis_segstats branch February 8, 2025 14:30

coderabbitai bot mentioned this pull request Feb 14, 2025

refactor with ibis #95

Merged

This was referenced Mar 15, 2025

feat: add extra_aggs parameter to SegTransactionStats #138

Merged

feat: update SegTransactionStats to support multiple segment columns… #139

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: convert seg stats to use Ibis #90

feat: convert seg stats to use Ibis #90

mvanwyk commented Feb 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 8, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

qodo-merge-pro bot commented Feb 8, 2025

qodo-merge-pro bot commented Feb 8, 2025

codecov bot commented Feb 8, 2025 •

edited

Loading

coderabbitai bot left a comment

Copilot AI left a comment

Copilot AI Feb 8, 2025

coderabbitai bot left a comment

feat: convert seg stats to use Ibis #90

feat: convert seg stats to use Ibis #90

Conversation

mvanwyk commented Feb 8, 2025 • edited by coderabbitai bot Loading

PR Type

Description

Changes walkthrough 📝

Summary by CodeRabbit

coderabbitai bot commented Feb 8, 2025 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested labels

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

qodo-merge-pro bot commented Feb 8, 2025

PR Reviewer Guide 🔍

qodo-merge-pro bot commented Feb 8, 2025

PR Code Suggestions ✨

codecov bot commented Feb 8, 2025 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot AI Feb 8, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

mvanwyk commented Feb 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 8, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Feb 8, 2025 •

edited

Loading