Skip to content

feat: segment stats calc now uses duckdb to improve performance #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 9, 2024

Conversation

mvanwyk
Copy link
Contributor

@mvanwyk mvanwyk commented Aug 9, 2024

PR Type

Enhancement, Tests


Description

  • Integrated DuckDB to improve performance in segment statistics calculation.
  • Added new calculated columns: spend_per_customer, spend_per_transaction, and transactions_per_customer.
  • Updated tests to cover new calculated columns and DuckDB integration.
  • Added duckdb as a new dependency in pyproject.toml.

Changes walkthrough 📝

Relevant files
Enhancement
options.py
Add new calculated columns for spend and transactions       

pyretailscience/options.py

  • Added new calculated columns: spend_per_customer,
    spend_per_transaction, and transactions_per_customer.
  • Updated descriptions for the new calculated columns.
  • +6/-0     
    segmentation.py
    Use DuckDB for segment statistics calculation                       

    pyretailscience/segmentation.py

  • Integrated DuckDB for performance improvement in segment statistics
    calculation.
  • Modified _calc_seg_stats to handle both pd.DataFrame and
    DuckDBPyRelation.
  • Added new calculations for spend_per_customer, spend_per_transaction,
    and transactions_per_customer.
  • +64/-39 
    Tests
    test_segmentation.py
    Update tests for new calculated columns and DuckDB integration

    tests/test_segmentation.py

  • Updated tests to include new calculated columns.
  • Adjusted expected outputs to match new calculations.
  • +21/-9   
    Dependencies
    pyproject.toml
    Add DuckDB dependency                                                                       

    pyproject.toml

    • Added duckdb as a new dependency.
    +1/-0     

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Summary by CodeRabbit

    • New Features

      • Introduced new financial metrics: spend per customer, spend per transaction, and transactions per customer.
      • Added support for using DuckDB as a data source, enhancing data processing capabilities.
    • Bug Fixes

      • Improved clarity and structure of output data in segmentation tests.
    • Tests

      • Updated tests to include new metrics and ensure output accuracy with segment categorization.

    Copy link

    coderabbitai bot commented Aug 9, 2024

    Warning

    Rate limit exceeded

    @mvanwyk has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 22 minutes and 5 seconds before requesting another review.

    How to resolve this issue?

    After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

    We recommend that you space out your commits to avoid hitting the rate limit.

    How do rate limits work?

    CodeRabbit enforces hourly rate limits for each developer per organization.

    Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

    Please see our FAQ for further information.

    Commits

    Files that changed from the base of the PR and between 9ff1e22 and 697f761.

    Walkthrough

    This update enhances a Python project by introducing the duckdb library for improved data handling and analytics. It also expands functionality within the SegTransactionStats class to accommodate both DataFrames and DuckDB relations, optimizing statistical calculations. New calculated metrics enrich analytical capabilities, and test cases have been updated for clarity and precision, ensuring better alignment with the revised functionalities.

    Changes

    Files Change Summary
    pyproject.toml Added duckdb library dependency with version constraint ^1.0.0.
    pyretailscience/options.py Introduced new calculated columns: spend_per_customer, spend_per_transaction, and transactions_per_customer in class initialization.
    pyretailscience/segmentation.py Modified SegTransactionStats to accept DuckDBPyRelation alongside pd.DataFrame, enhancing data flexibility. Updated _calc_seg_stats for performance improvements.
    tests/test_segmentation.py Updated test cases with a new segment_name column, refined DataFrame values, and improved indexing for clarity and precision in expected outputs.

    Sequence Diagram(s)

    sequenceDiagram
        participant User
        participant Application
        participant DuckDB
        participant DataFrame
    
        User->>Application: Request transaction stats
        Application->>DataFrame: Retrieve data
        alt DataFrame detected
            Application->>DuckDB: Convert to DuckDBPyRelation
        end
        Application->>DuckDB: Perform aggregation
        DuckDB-->>Application: Return aggregated results
        Application-->>User: Display transaction stats
    
    Loading

    Poem

    🐰 In the meadow of code, new features bloom,
    DuckDB's magic dispels all gloom.
    Metrics now dance, in splendid array,
    Data flows swiftly, come join the play!
    With tests shining bright, all's clear and right,
    Let's hop into progress, a joyful delight! 🌟


    Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

    Share
    Tips

    Chat

    There are 3 ways to chat with CodeRabbit:

    • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
      • I pushed a fix in commit <commit_id>.
      • Generate unit testing code for this file.
      • Open a follow-up GitHub issue for this discussion.
    • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
      • @coderabbitai generate unit testing code for this file.
      • @coderabbitai modularize this function.
    • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
      • @coderabbitai generate interesting stats about this repository and render them as a table.
      • @coderabbitai show all the console.log statements in this repository.
      • @coderabbitai read src/utils.ts and generate unit testing code.
      • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
      • @coderabbitai help me debug CodeRabbit configuration file.

    Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

    CodeRabbit Commands (invoked as PR comments)

    • @coderabbitai pause to pause the reviews on a PR.
    • @coderabbitai resume to resume the paused reviews.
    • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
    • @coderabbitai full review to do a full review from scratch and review all the files again.
    • @coderabbitai summary to regenerate the summary of the PR.
    • @coderabbitai resolve resolve all the CodeRabbit review comments.
    • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
    • @coderabbitai help to get help.

    Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

    CodeRabbit Configuration File (.coderabbit.yaml)

    • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
    • Please see the configuration documentation for more information.
    • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

    Documentation and Community

    • Visit our Documentation for detailed information on how to use CodeRabbit.
    • Join our Discord Community to get help, request features, and share feedback.
    • Follow us on X/Twitter for updates and announcements.

    Copy link
    Contributor

    qodo-merge-pro bot commented Aug 9, 2024

    PR Reviewer Guide 🔍

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Key issues to review

    Possible Bug
    The method _calc_seg_stats in the SegTransactionStats class has been significantly refactored to use DuckDB. Ensure that the DuckDB operations are correctly handling the data types and aggregations, especially when converting from pd.DataFrame to DuckDBPyRelation. The use of aggregate functions and conditional checks on DataFrame columns should be carefully reviewed to avoid runtime errors or incorrect calculations.

    Error Handling
    The new implementation in SegTransactionStats class does not seem to handle potential errors or exceptions that might occur during the database operations with DuckDB, such as connection issues or SQL execution errors. Robust error handling mechanisms should be considered to ensure the stability of the application.

    Copy link
    Contributor

    qodo-merge-pro bot commented Aug 9, 2024

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Robustness
    Prevent division by zero errors in calculations

    Ensure that the division operations handle potential division by zero errors, which
    could occur if any of the aggregate counts are zero.

    pyretailscience/segmentation.py [267-271]

    -f"{get_option('column.agg.unit_spend')} / {get_option('column.agg.customer_id')} as {get_option('column.calc.spend_per_customer')},",
    -f"{get_option('column.agg.unit_spend')} / {get_option('column.agg.transaction_id')} as {get_option('column.calc.spend_per_transaction')},",
    -f"{get_option('column.agg.transaction_id')} / {get_option('column.agg.customer_id')} as {get_option('column.calc.transactions_per_customer')},"
    +f"CASE WHEN {get_option('column.agg.customer_id')} = 0 THEN NULL ELSE {get_option('column.agg.unit_spend')} / {get_option('column.agg.customer_id')} END as {get_option('column.calc.spend_per_customer')},",
    +f"CASE WHEN {get_option('column.agg.transaction_id')} = 0 THEN NULL ELSE {get_option('column.agg.unit_spend')} / {get_option('column.agg.transaction_id')} END as {get_option('column.calc.spend_per_transaction')},",
    +f"CASE WHEN {get_option('column.agg.customer_id')} = 0 THEN NULL ELSE {get_option('column.agg.transaction_id')} / {get_option('column.agg.customer_id')} END as {get_option('column.calc.transactions_per_customer')},"
     
    • Apply this suggestion
    Suggestion importance[1-10]: 10

    Why: Ensuring that division operations handle potential division by zero errors is crucial for robustness and correctness of the calculations. This suggestion addresses a significant potential bug.

    10
    ✅ Add error handling for unsupported data types

    Add error handling for the case where the data parameter is neither a pd.DataFrame
    nor a DuckDBPyRelation to ensure robustness and clear error reporting.

    pyretailscience/segmentation.py [255-256]

     if isinstance(data, pd.DataFrame):
         data = duckdb.from_df(data)
    +elif not isinstance(data, DuckDBPyRelation):
    +    raise TypeError("data must be either a pandas DataFrame or a DuckDBPyRelation")
     

    [Suggestion has been applied]

    Suggestion importance[1-10]: 9

    Why: Adding error handling for unsupported data types increases the robustness of the code and ensures clear error reporting, which is crucial for debugging and maintaining the code.

    9
    Performance
    ✅ Optimize the checking of missing columns for efficiency

    Use a more efficient method for checking missing columns by using set operations
    instead of list comprehensions, which can be slower for large datasets.

    pyretailscience/segmentation.py [233]

    -missing_cols = [col for col in required_cols if col not in data.columns]
    +missing_cols = set(required_cols) - set(data.columns)
     

    [Suggestion has been applied]

    Suggestion importance[1-10]: 8

    Why: Using set operations to check for missing columns is more efficient, especially for large datasets. This optimization improves performance and is a good practice for handling such checks.

    8
    Maintainability
    Use constants for column names to improve maintainability

    Replace the direct use of column names as strings with a variable or constant that
    can be easily maintained and changed. This will make the code more maintainable and
    less error-prone to changes in column names.

    pyretailscience/segmentation.py [259-261]

    -f"SUM({get_option('column.unit_spend')}) as {get_option('column.agg.unit_spend')},",
    -f"COUNT(DISTINCT {get_option('column.transaction_id')}) as {get_option('column.agg.transaction_id')},",
    -f"COUNT(DISTINCT {get_option('column.customer_id')}) as {get_option('column.agg.customer_id')},"
    +f"SUM({UNIT_SPEND_COLUMN}) as {UNIT_SPEND_AGG},",
    +f"COUNT(DISTINCT {TRANSACTION_ID_COLUMN}) as {TRANSACTION_ID_AGG},",
    +f"COUNT(DISTINCT {CUSTOMER_ID_COLUMN}) as {CUSTOMER_ID_AGG},"
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Using constants for column names improves maintainability and reduces the risk of errors if column names change. However, it is not a critical issue and mainly enhances code readability and maintainability.

    7

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL

    Commits

    Files that changed from the base of the PR and between f62b648 and 9ff1e22.

    Files ignored due to path filters (1)
    • poetry.lock is excluded by !**/*.lock
    Files selected for processing (4)
    • pyproject.toml (1 hunks)
    • pyretailscience/options.py (2 hunks)
    • pyretailscience/segmentation.py (3 hunks)
    • tests/test_segmentation.py (3 hunks)
    Additional comments not posted (7)
    pyproject.toml (1)

    22-22: Dependency Addition Approved.

    The addition of duckdb = "^1.0.0" is appropriate for enhancing data processing capabilities.

    pyretailscience/options.py (1)

    56-58: Enhancements Approved.

    The addition of new calculated columns and their descriptions enhances the functionality and is well-integrated.

    Also applies to: 92-94

    pyretailscience/segmentation.py (2)

    209-216: Constructor Update Approved.

    The constructor now accommodates both pd.DataFrame and DuckDBPyRelation, enhancing data source flexibility.


    244-298: Statistical Calculation Enhancements Approved.

    The _calc_seg_stats method effectively uses DuckDB for efficient aggregation, improving performance for large datasets.

    tests/test_segmentation.py (3)

    30-42: Test updates approved.

    The changes to the expected DataFrame in test_correctly_calculates_revenue_transactions_customers_per_segment correctly reflect the updated functionality. The inclusion of the segment_name column and additional calculated metrics aligns with the enhancements in the SegTransactionStats class.


    60-69: Test updates approved.

    The modifications in test_correctly_calculates_revenue_transactions_customers are consistent with the new structure and calculations in the SegTransactionStats class. The updates ensure accurate testing of the revised functionality.


    88-100: Test updates approved.

    The changes in test_handles_dataframe_with_one_segment correctly reflect the handling of single-segment data with the updated SegTransactionStats class. The expected output is appropriately structured.

    Co-authored-by: codiumai-pr-agent-pro[bot] <151058649+codiumai-pr-agent-pro[bot]@users.noreply.github.com>
    Co-authored-by: codiumai-pr-agent-pro[bot] <151058649+codiumai-pr-agent-pro[bot]@users.noreply.github.com>
    @mvanwyk mvanwyk merged commit 9df9fe1 into main Aug 9, 2024
    1 check passed
    @mvanwyk mvanwyk deleted the duckdb_seg branch August 9, 2024 19:25
    @coderabbitai coderabbitai bot mentioned this pull request Mar 25, 2025
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant