Skip to content

feat: convert seg stats to use Ibis #90

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 8, 2025
Merged

feat: convert seg stats to use Ibis #90

merged 3 commits into from
Feb 8, 2025

Conversation

mvanwyk
Copy link
Contributor

@mvanwyk mvanwyk commented Feb 8, 2025

PR Type

Enhancement, Tests


Description

  • Migrated segmentation statistics computation from DuckDB to Ibis framework.

  • Enhanced SegTransactionStats class with Ibis-based aggregation and metrics.

  • Updated tests to validate Ibis-based implementation and ensure correctness.

  • Added new column options and adjusted TOML file handling logic.


Changes walkthrough 📝

Relevant files
Enhancement
segmentation.py
Transition segmentation statistics to Ibis framework         

pyretailscience/segmentation.py

  • Replaced DuckDB with Ibis for data handling and aggregation.
  • Updated SegTransactionStats to use Ibis for metrics calculation.
  • Added _get_col_order method for consistent column ordering.
  • Refactored _calc_seg_stats to support Ibis-based aggregation.
  • +93/-55 
    options.py
    Add new column option and refine TOML handling                     

    pyretailscience/options.py

  • Added customers_pct column option for percentage calculations.
  • Adjusted TOML file handling to remove unnecessary noqa directive.
  • +2/-1     
    Tests
    test_segmentation.py
    Update tests for Ibis-based segmentation stats                     

    tests/test_segmentation.py

  • Updated tests to validate Ibis-based SegTransactionStats
    implementation.
  • Adjusted test data and expected outputs for Ibis compatibility.
  • Removed redundant test for unaltered DataFrame.
  • +12/-18 
    Configuration changes
    pyproject.toml
    Update linting rules for Ibis compatibility                           

    pyproject.toml

    • Added SLF001 to ignored linting rules for Ibis-specific syntax.
    +1/-0     
    Additional files
    segmentation.ipynb +43/-61 

    Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • Summary by CodeRabbit

    • New Features:

      • Added a customer percentage metric that enriches data analysis and provides deeper insights into customer trends.
    • Refactor:

      • Enhanced segmentation processing by supporting varied data inputs and implementing improved aggregation methods for more reliable results.
    • Chores:

      • Refined internal configurations and streamlined code formatting to uphold high development quality and maintain a consistent codebase.

    @mvanwyk mvanwyk requested a review from Copilot February 8, 2025 10:29
    @mvanwyk mvanwyk self-assigned this Feb 8, 2025
    Copy link

    coderabbitai bot commented Feb 8, 2025

    Walkthrough

    This pull request makes several modifications across multiple files. In pyproject.toml, it adds the "SLF001" rule to the ignore list under [tool.ruff.lint]. The options module now includes a new attribute (customers_pct) in its ColumnHelper class and removes an obsolete comment. The segmentation module has been updated to transition from DuckDB to ibis with revised type hints, aggregation logic, and error handling using ColumnHelper, and the corresponding tests have been updated for these changes.

    Changes

    File(s) Change Summary
    pyproject.toml Added "SLF001" to the ignore list under [tool.ruff.lint] to suppress private member access warnings.
    pyretailscience/options.py Introduced new attribute self.customers_pct in ColumnHelper and removed an unnecessary # noqa: SLF001 comment from load_from_toml.
    pyretailscience/segmentation.py, tests/test_segmentation.py Transitioned from DuckDBPyRelation to ibis.Table; updated method signatures, error handling via ColumnHelper, and aggregation logic; modified tests for instance creation and datatype consistency.

    Possibly related PRs

    • Linting improvements #54: Adjusted linting configurations to ignore specific rules, closely related to the current update in the ignore list.

    Suggested labels

    enhancement, Tests

    Poem

    Hoppity hops, changes abound in my code,
    New attributes and flows lighten my humble abode.
    With ibis I dance and segmentation sings,
    Lint rules now whisper softer, like gentle spring flings.
    I nibble on updates with a futuristic glow –
    A rabbit’s delight as our projects grow!

    ✨ Finishing Touches
    • 📝 Generate Docstrings (Beta)

    Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

    ❤️ Share
    🪧 Tips

    Chat

    There are 3 ways to chat with CodeRabbit:

    • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
      • I pushed a fix in commit <commit_id>, please review it.
      • Generate unit testing code for this file.
      • Open a follow-up GitHub issue for this discussion.
    • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
      • @coderabbitai generate unit testing code for this file.
      • @coderabbitai modularize this function.
    • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
      • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
      • @coderabbitai read src/utils.ts and generate unit testing code.
      • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
      • @coderabbitai help me debug CodeRabbit configuration file.

    Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

    CodeRabbit Commands (Invoked using PR comments)

    • @coderabbitai pause to pause the reviews on a PR.
    • @coderabbitai resume to resume the paused reviews.
    • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
    • @coderabbitai full review to do a full review from scratch and review all the files again.
    • @coderabbitai summary to regenerate the summary of the PR.
    • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
    • @coderabbitai resolve resolve all the CodeRabbit review comments.
    • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
    • @coderabbitai help to get help.

    Other keywords and placeholders

    • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
    • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
    • Add @coderabbitai anywhere in the PR title to generate the title automatically.

    CodeRabbit Configuration File (.coderabbit.yaml)

    • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
    • Please see the configuration documentation for more information.
    • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

    Documentation and Community

    • Visit our Documentation for detailed information on how to use CodeRabbit.
    • Join our Discord Community to get help, request features, and share feedback.
    • Follow us on X/Twitter for updates and announcements.

    Copy link
    Contributor

    qodo-merge-pro bot commented Feb 8, 2025

    Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Data Type Handling

    The unit_spend column is now explicitly handled as float in tests but the conversion from other numeric types is not validated in the main code. This could cause precision/rounding issues.

    cols.agg_unit_spend: data[cols.unit_spend].sum(),
    cols.agg_transaction_id: data[cols.transaction_id].nunique(),
    cols.agg_customer_id: data[cols.customer_id].nunique(),
    Error Handling

    The code assumes all required columns exist after initial validation but doesn't handle cases where columns might be missing in subsequent operations, particularly with the unit_qty optional column.

    if cols.unit_qty in data.columns:
        aggs[cols.agg_unit_qty] = data[cols.unit_qty].sum()

    Copy link
    Contributor

    qodo-merge-pro bot commented Feb 8, 2025

    Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Handle zero division cases

    Add validation for zero division cases in the metrics calculations. Currently,
    division operations could raise ZeroDivisionError if any of the aggregated
    values are zero.

    pyretailscience/segmentation.py [285-293]

     final_metrics = ibis.union(segment_metrics, total_metrics).mutate(
         **{
    -        cols.calc_spend_per_cust: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_customer_id],
    -        cols.calc_spend_per_trans: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_transaction_id],
    -        cols.calc_trans_per_cust: ibis._[cols.agg_transaction_id] / ibis._[cols.agg_customer_id],
    -        cols.customers_pct: ibis._[cols.agg_customer_id] / total_customers,
    +        cols.calc_spend_per_cust: ibis.case().when(ibis._[cols.agg_customer_id] > 0, ibis._[cols.agg_unit_spend] / ibis._[cols.agg_customer_id]).else_(0),
    +        cols.calc_spend_per_trans: ibis.case().when(ibis._[cols.agg_transaction_id] > 0, ibis._[cols.agg_unit_spend] / ibis._[cols.agg_transaction_id]).else_(0),
    +        cols.calc_trans_per_cust: ibis.case().when(ibis._[cols.agg_customer_id] > 0, ibis._[cols.agg_transaction_id] / ibis._[cols.agg_customer_id]).else_(0),
    +        cols.customers_pct: ibis.case().when(total_customers > 0, ibis._[cols.agg_customer_id] / total_customers).else_(0),
         },
     )

    [To ensure code accuracy, apply this suggestion manually]

    Suggestion importance[1-10]: 8

    __

    Why: This is a critical improvement that prevents potential runtime errors by handling division by zero cases using Ibis's case expressions. This enhances the robustness of the metrics calculations.

    Medium

    Copy link

    codecov bot commented Feb 8, 2025

    Codecov Report

    Attention: Patch coverage is 87.80488% with 5 lines in your changes missing coverage. Please review.

    Files with missing lines Patch % Lines
    pyretailscience/segmentation.py 87.17% 4 Missing and 1 partial ⚠️
    Files with missing lines Coverage Δ
    pyretailscience/options.py 97.50% <100.00%> (+1.54%) ⬆️
    pyretailscience/segmentation.py 69.01% <87.17%> (+8.85%) ⬆️

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    🧹 Nitpick comments (3)
    pyretailscience/segmentation.py (3)

    222-247: Column ordering approach works, but consider consistent naming.

    Inserting a literal "units" can occasionally conflict with aggregated column naming. If you prefer uniform naming conventions, consider using a more descriptive or dynamic reference akin to self.agg_unit_qty.


    250-302: ibis-based aggregation logic is sound, but check performance for large data.

    Switching to ibis for grouping and union is correct. For very large datasets, confirm that combining segment_metrics and total_metrics with ibis.union does not cause excessive overhead. You might want to test with scaled data to ensure acceptable performance.


    366-368: Filtering out “Total” is fine, but maintain alignment with aggregator.

    If the label for the aggregated row changes later, this filtering might break. For future-proofing, consider referencing a configurable label or a known constant rather than a literal string.

    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between f4b4825 and de3a12b.

    📒 Files selected for processing (4)
    • pyproject.toml (1 hunks)
    • pyretailscience/options.py (2 hunks)
    • pyretailscience/segmentation.py (4 hunks)
    • tests/test_segmentation.py (4 hunks)
    ⏰ Context from checks skipped due to timeout of 90000ms (1)
    • GitHub Check: Pre-Commit
    🔇 Additional comments (18)
    pyretailscience/segmentation.py (9)

    156-156: Allowing both DataFrame and ibis.Table is good.

    Updating the constructor to accept either a pandas DataFrame or an ibis Table is appropriate for broader flexibility. Just ensure the docstring and usage references this extended input type clearly throughout.


    190-191: Typed attribute annotation looks solid.

    The explicit type hint (_df: pd.DataFrame | None) helps with clarity and IDE assistance.


    192-192: Constructor aligns with ibis-based approach.

    Replacing DuckDB references with (pd.DataFrame | ibis.Table) in the constructor is consistent with the PR objective to use ibis.


    196-196: Docstring updated accordingly.

    The docstring now reflects the new accepted data types, which improves clarity.


    202-202: No issues noted here.

    Calling ColumnHelper at initialization remains a clean approach to manage column naming.


    204-206: Good check for required columns.

    Including the segment_col in this list ensures consistent coverage of necessary columns in the data source.


    209-210: Optional column handling is clear.

    Appending unit_qty only if present in the data’s columns promotes flexible usage.


    219-219: Storing the table at initialization is appropriate.

    Assigning self.table in the constructor centralizes the aggregated structure for later property access.


    305-315: Column reordering is straightforward and clear.

    Executing the ibis table and selecting columns in a defined order is well organized.

    tests/test_segmentation.py (6)

    21-21: Floating spend values improve accuracy.

    Replacing integer spend values with floats is a valid step for monetary amounts.


    42-42: Verifying percentage column for segments.

    The new "customers_pct" usage aligns with the updated code. Values (0.6, 0.4, 1.0) seem correct.


    44-47: Minor structural changes.

    Refactoring these lines to set up and instantiate SegTransactionStats is neat. No functional concerns.


    55-55: Float-based spend values remain consistent.

    Ensures test data aligns with the new float usage throughout the module.


    72-74: Creating SegTransactionStats instance and verifying output.

    Switching from direct static calls to instance-based usage is correct. Sorting by segment_name before comparing is good practice.


    94-98: Validating total coverage for single-segment scenarios.

    Testing the scenario with 1.0 for the entire population across two rows is consistent with the logic for “segment” plus “Total.”

    pyretailscience/options.py (2)

    240-240: Comment removal is inconsequential.

    No functional impact from dropping the “# noqa: SLF001” comment; code remains clear.


    395-395: New attribute customers_pct is coherent.

    This addition helps standardize referencing the customer percentage column. Nicely integrates with existing naming patterns.

    pyproject.toml (1)

    71-71: SLF001 Ignore Rule Addition Approved
    The addition of "SLF001" to the ignore list is appropriate given Ibis’s frequent use of internal/private attributes (e.g., ibis._[column]) that would otherwise trigger this lint warning. The inline comment clearly explains the rationale.

    Copy link
    Contributor

    @Copilot Copilot AI left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

    Comments suppressed due to low confidence (3)

    pyretailscience/segmentation.py:314

    • The column ordering in the df property method should ensure that all required columns are present and in the correct order.
    self._df = self.table.execute()[col_order]
    

    pyretailscience/segmentation.py:368

    • The hide_total check is case-sensitive. Ensure consistency in how 'Total' is handled throughout the code.
    val_s = val_s[val_s.index != "Total"]
    

    pyproject.toml:71

    • The comment has a minor grammatical issue. It should be: 'Ibis makes extensive use of ibis._[column], which triggers this.'
    "SLF001", # Ibis makes a lot of use of the ibis._[column] which triggers this
    

    Comment on lines +288 to +291
    cols.calc_spend_per_cust: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_customer_id],
    cols.calc_spend_per_trans: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_transaction_id],
    cols.calc_trans_per_cust: ibis._[cols.agg_transaction_id] / ibis._[cols.agg_customer_id],
    cols.customers_pct: ibis._[cols.agg_customer_id] / total_customers,
    Copy link
    Preview

    Copilot AI Feb 8, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The usage of ibis._ seems incorrect. It should be replaced with the proper syntax for accessing columns in Ibis expressions.

    Suggested change
    cols.calc_spend_per_cust: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_customer_id],
    cols.calc_spend_per_trans: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_transaction_id],
    cols.calc_trans_per_cust: ibis._[cols.agg_transaction_id] / ibis._[cols.agg_customer_id],
    cols.customers_pct: ibis._[cols.agg_customer_id] / total_customers,
    cols.calc_spend_per_cust: final_metrics[cols.agg_unit_spend] / final_metrics[cols.agg_customer_id],
    cols.calc_spend_per_trans: final_metrics[cols.agg_unit_spend] / final_metrics[cols.agg_transaction_id],
    cols.calc_trans_per_cust: final_metrics[cols.agg_transaction_id] / final_metrics[cols.agg_customer_id],
    cols.customers_pct: final_metrics[cols.agg_customer_id] / total_customers,

    Copilot uses AI. Check for mistakes.

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    ♻️ Duplicate comments (1)
    pyretailscience/segmentation.py (1)

    287-290: ⚠️ Potential issue

    Fix incorrect Ibis column access syntax.

    The usage of ibis._ is incorrect. Use direct column access instead.

    Apply this diff:

    -                cols.calc_spend_per_cust: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_customer_id],
    -                cols.calc_spend_per_trans: ibis._[cols.agg_unit_spend] / ibis._[cols.agg_transaction_id],
    -                cols.calc_trans_per_cust: ibis._[cols.agg_transaction_id] / ibis._[cols.agg_customer_id],
    -                cols.customers_pct: ibis._[cols.agg_customer_id] / total_customers,
    +                cols.calc_spend_per_cust: final_metrics[cols.agg_unit_spend] / final_metrics[cols.agg_customer_id],
    +                cols.calc_spend_per_trans: final_metrics[cols.agg_unit_spend] / final_metrics[cols.agg_transaction_id],
    +                cols.calc_trans_per_cust: final_metrics[cols.agg_transaction_id] / final_metrics[cols.agg_customer_id],
    +                cols.customers_pct: final_metrics[cols.agg_customer_id] / total_customers,
    🧹 Nitpick comments (2)
    pyretailscience/segmentation.py (2)

    156-156: Update docstring to reflect new type.

    The docstring should be updated to match the new type annotation:

    -            df (pd.DataFrame): A dataframe with the transaction data. The dataframe must contain a customer_id column.
    +            df (pd.DataFrame | ibis.Table): A dataframe or Ibis table with the transaction data. The data must contain a customer_id column.

    Also applies to: 169-169


    190-219: Update docstring and improve type handling.

    1. Update the docstring to reflect the new type:
    -            data (pd.DataFrame | ibis.Table): The transaction data. The dataframe must contain the columns
    +            data (pd.DataFrame | ibis.Table): The transaction data. The data must contain the columns
    1. Consider adding type hints for the class variable:
    -    _df: pd.DataFrame | None = None
    +    _df: pd.DataFrame | None
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between de3a12b and d30da1a.

    📒 Files selected for processing (2)
    • pyretailscience/segmentation.py (4 hunks)
    • tests/test_segmentation.py (5 hunks)
    ⏰ Context from checks skipped due to timeout of 90000ms (1)
    • GitHub Check: Pre-Commit
    🔇 Additional comments (7)
    tests/test_segmentation.py (4)

    22-22: LGTM! The changes improve data type precision and code structure.

    The changes enhance the test by:

    1. Using float values for monetary amounts, which is more appropriate for financial calculations.
    2. Adopting an object-oriented approach that aligns with the new implementation.

    Also applies to: 46-48


    56-56: LGTM! The changes improve data type precision and code structure.

    The changes enhance the test by:

    1. Using float values for monetary amounts, which is more appropriate for financial calculations.
    2. Adopting an object-oriented approach that aligns with the new implementation.

    Also applies to: 75-75


    99-99: LGTM! The code structure is improved.

    The change to object-oriented instantiation aligns with the new implementation.


    102-124: LGTM! Good addition of edge case testing.

    The new test function effectively verifies the handling of zero net units, ensuring that:

    1. Price per unit calculations correctly handle division by zero.
    2. Other metrics remain accurate even when some segments have zero units.
    pyretailscience/segmentation.py (3)

    221-249: LGTM! Good addition of column order management.

    The new static method effectively:

    1. Centralizes column order management
    2. Handles both cases with and without quantity columns
    3. Improves maintainability by providing a single source of truth

    250-302: LGTM! Good migration to Ibis.

    The changes effectively:

    1. Handle both DataFrame and Ibis Table inputs
    2. Use Ibis for aggregations and calculations
    3. Maintain the same functionality while leveraging Ibis features

    303-313: LGTM! Good use of column order management.

    The property effectively:

    1. Uses the new _get_col_order method for consistent column ordering
    2. Handles the conversion from Ibis Table to DataFrame
    3. Caches the result for better performance

    @mvanwyk mvanwyk merged commit 68de169 into main Feb 8, 2025
    4 checks passed
    @mvanwyk mvanwyk deleted the ibis_segstats branch February 8, 2025 14:30
    @coderabbitai coderabbitai bot mentioned this pull request Feb 14, 2025
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant