Skip to content

refactor with ibis #95

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 19, 2025
Merged

refactor with ibis #95

merged 5 commits into from
Feb 19, 2025

Conversation

mayurkmmt
Copy link
Collaborator

@mayurkmmt mayurkmmt commented Feb 13, 2025

User description

feat: refactor get_index function with ibis


PR Type

Enhancement, Tests


Description

  • Refactored get_indexes function to use Ibis for scalable computation.

  • Enhanced get_indexes to support both pandas DataFrame and Ibis Table.

  • Added robust error handling for invalid filters and aggregation functions.

  • Updated and expanded test cases for get_indexes with various scenarios.


Changes walkthrough 📝

Relevant files
Enhancement
index.py
Refactor `get_indexes` to use Ibis and enhance functionality

pyretailscience/plots/index.py

  • Refactored get_indexes to use Ibis for efficient computation.
  • Added support for both pandas DataFrame and Ibis Table.
  • Improved error handling for invalid filters and unsupported
    aggregation functions.
  • Updated logic for calculating proportions and indices.
  • +45/-14 
    Tests
    test_index.py
    Update and expand test cases for `get_indexes`                     

    tests/plots/test_index.py

  • Replaced outdated test cases with new ones for get_indexes.
  • Added tests for invalid filters and unsupported aggregation functions.
  • Included tests for various aggregation functions and offset handling.
  • Simplified and clarified test data and assertions.
  • +67/-85 

    Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • Summary by CodeRabbit

    • New Features

      • Updated plotting functionality with new parameters for enhanced index calculation.
      • Improved versatility of index retrieval by supporting both DataFrame and Ibis Table inputs.
    • Refactor

      • Enhanced data processing for faster and more reliable visualizations, improving performance on larger datasets.
      • Updated aggregation logic now delivers a more comprehensive view of grouped data.
    • Tests

      • Expanded test coverage to ensure robust handling of various input conditions and edge cases, including validation for invalid aggregation functions and improved clarity in test cases.

    Copy link

    coderabbitai bot commented Feb 13, 2025

    Walkthrough

    This pull request updates the data processing logic by modifying the get_indexes function in the plotting module. The function now accepts either a pd.DataFrame or an ibis.Table and leverages Ibis for grouping and aggregation, returning a pd.DataFrame instead of a pd.Series. Corresponding adjustments have been made in the plot function. The test suite has been extended and refactored to cover new input formats, error handling for invalid filter configurations, and unsupported aggregation functions.

    Changes

    File(s) Summary of Changes
    pyretailscience/…/index.py Updated get_indexes to accept `pd.DataFrame
    tests/…/test_index.py Renamed and restructured test functions to reflect new column names and structures; added tests for invalid filters, unsupported aggregation functions, and offset functionality.

    Sequence Diagram(s)

    sequenceDiagram
        participant C as Client
        participant P as Plot Function
        participant GI as get_indexes
        participant I as Ibis Engine
    
        C->>P: Call plot(data)
        P->>GI: Send data (DataFrame/ibis.Table) and parameters
        GI->>I: Execute group-by and aggregation via Ibis
        I-->>GI: Return aggregated results
        GI-->>P: Return pd.DataFrame with indexes
        P->>C: Render updated plot
    
    Loading

    Possibly related PRs

    Suggested labels

    documentation, Review effort [1-5]: 3

    Poem

    I hop through data fields with delight,
    Dancing with Ibis in the coding light.
    Old Pandas steps now gently fade,
    New indexes and plots beautifully made.
    A rabbit cheers for code well-played! 🐰

    ✨ Finishing Touches
    • 📝 Generate Docstrings (Beta)

    Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

    ❤️ Share
    🪧 Tips

    Chat

    There are 3 ways to chat with CodeRabbit:

    • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
      • I pushed a fix in commit <commit_id>, please review it.
      • Generate unit testing code for this file.
      • Open a follow-up GitHub issue for this discussion.
    • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
      • @coderabbitai generate unit testing code for this file.
      • @coderabbitai modularize this function.
    • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
      • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
      • @coderabbitai read src/utils.ts and generate unit testing code.
      • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
      • @coderabbitai help me debug CodeRabbit configuration file.

    Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

    CodeRabbit Commands (Invoked using PR comments)

    • @coderabbitai pause to pause the reviews on a PR.
    • @coderabbitai resume to resume the paused reviews.
    • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
    • @coderabbitai full review to do a full review from scratch and review all the files again.
    • @coderabbitai summary to regenerate the summary of the PR.
    • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
    • @coderabbitai resolve resolve all the CodeRabbit review comments.
    • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
    • @coderabbitai help to get help.

    Other keywords and placeholders

    • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
    • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
    • Add @coderabbitai anywhere in the PR title to generate the title automatically.

    CodeRabbit Configuration File (.coderabbit.yaml)

    • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
    • Please see the configuration documentation for more information.
    • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

    Documentation and Community

    • Visit our Documentation for detailed information on how to use CodeRabbit.
    • Join our Discord Community to get help, request features, and share feedback.
    • Follow us on X/Twitter for updates and announcements.

    Copy link
    Contributor

    Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Performance Impact

    The conversion from pandas DataFrame to Ibis table using memtable could have performance implications for large datasets. Need to validate performance with production-scale data.

    if isinstance(df, pd.DataFrame):
        df = df.copy()
        df["_filter"] = df_index_filter
        table = ibis.memtable(df)
    else:
        table = df.mutate(_filter=ibis.literal(df_index_filter))
    Error Handling

    The lambda functions used for aggregation could potentially fail silently if the column types are incompatible with the requested operation. Additional type checking may be needed.

    agg_fn = lambda x: getattr(x, agg_func)()
    Memory Usage

    Multiple intermediate tables are created during the index calculation process which could lead to high memory usage with large datasets. Consider optimizing the number of intermediate operations.

    overall_agg = table.group_by(group_cols).aggregate(value=agg_fn(table[value_col]))
    
    if index_subgroup_col is None:
        overall_total = overall_agg.value.sum().execute()
        overall_props = overall_agg.mutate(proportion=overall_agg.value / overall_total)
    else:
        overall_total = overall_agg.group_by(index_subgroup_col).aggregate(total=lambda t: t.value.sum())
        overall_props = (
            overall_agg.join(overall_total, index_subgroup_col)
            .mutate(proportion=lambda t: t.value / t.total)
            .drop("total")
        )
    
    overall_props = overall_props.mutate(proportion_overall=overall_props.proportion).drop("proportion")
    
    subset_agg = table.filter(table._filter).group_by(group_cols).aggregate(value=agg_fn(table[value_col]))
    
    if index_subgroup_col is None:
        subset_total = subset_agg.value.sum().name("total")
        subset_props = subset_agg.mutate(proportion=subset_agg.value / subset_total)
    else:
        subset_total = subset_agg.group_by(index_subgroup_col).aggregate(total=lambda t: t.value.sum())
        subset_props = (
            subset_agg.join(subset_total, index_subgroup_col)
            .mutate(proportion=lambda t: t.value / t.total)
            .drop("total")
        )
    
    result = subset_props.join(overall_props, group_cols).mutate(
        index=lambda t: (t.proportion / t.proportion_overall * 100) - offset,
    )
    return result.execute()

    Copy link
    Contributor

    qodo-merge-pro bot commented Feb 13, 2025

    Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Validate filter length matches data

    Add input validation to ensure the length of df_index_filter matches the number
    of rows in the input DataFrame/Table to prevent index misalignment issues.

    pyretailscience/plots/index.py [268-273]

     if isinstance(df, pd.DataFrame):
    +    if len(df_index_filter) != len(df):
    +        raise ValueError("Length of df_index_filter must match number of rows in DataFrame")
         df = df.copy()
         df["_filter"] = df_index_filter
         table = ibis.memtable(df)
     else:
    +    if len(df_index_filter) != df.count().execute():
    +        raise ValueError("Length of df_index_filter must match number of rows in Table")
         table = df.mutate(_filter=ibis.literal(df_index_filter))
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    __

    Why: This is a critical validation that prevents silent errors and data corruption by ensuring the filter array matches the data size. Missing this check could lead to incorrect index calculations or runtime errors.

    High
    Validate column existence before processing

    Add validation to ensure value_col and index_col exist in the DataFrame/Table
    before processing to prevent runtime errors.

    pyretailscience/plots/index.py [242-250]

     def get_indexes(
         df: pd.DataFrame | ibis.Table,
         df_index_filter: list[bool],
         index_col: str,
         value_col: str,
         index_subgroup_col: str | None = None,
         agg_func: str = "sum",
         offset: int = 0,
     ) -> pd.DataFrame:
    +    columns = df.columns if isinstance(df, pd.DataFrame) else df.columns
    +    if index_col not in columns or value_col not in columns:
    +        raise ValueError(f"Columns {index_col} and {value_col} must exist in the data")
    +    if index_subgroup_col and index_subgroup_col not in columns:
    +        raise ValueError(f"Subgroup column {index_subgroup_col} not found in data")
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    __

    Why: This validation is essential for preventing runtime errors and providing clear error messages when required columns are missing. It helps catch configuration errors early in the process.

    Medium

    Copy link

    codecov bot commented Feb 13, 2025

    Codecov Report

    All modified and coverable lines are covered by tests ✅

    Files with missing lines Coverage Δ
    pyretailscience/plots/index.py 81.08% <100.00%> (-4.64%) ⬇️

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    🧹 Nitpick comments (5)
    pyretailscience/plots/index.py (1)

    275-277: Good aggregator validation.

    The aggregator check avoids invalid strings. As a minor extension, consider implementing a dictionary-based mapping instead of relying on getattr for each aggregator to improve explicitness and reduce potential runtime edge cases.

    - agg_fn = lambda x: getattr(x, agg_func)()
    + aggregations = {
    +     "sum": lambda x: x.sum(),
    +     "mean": lambda x: x.mean(),
    +     "max": lambda x: x.max(),
    +     "min": lambda x: x.min(),
    +     "nunique": lambda x: x.nunique(),
    + }
    + agg_fn = aggregations[agg_func]
    tests/plots/test_index.py (4)

    11-11: Clarify purpose of OFFSET_THRESHOLD.

    In test code, OFFSET_THRESHOLD = -5 is used to check that values exceed -5. Consider adding a brief comment or docstring to explain how or why -5 is chosen for your offset validation.


    14-28: Basic test scenario covers essential columns and index computation.

    The test confirms output is a non-empty DataFrame with expected columns. Consider adding an assertion verifying correctness of the index values to ensure the logic (including offset) is accurate.


    77-93: Comprehensive aggregator coverage.

    Testing multiple aggregation functions strengthens robustness. Consider verifying the numeric correctness for each aggregator’s output as well (e.g., verifying sums or means are correct).


    95-110: Offset test ensures final index is above the threshold.

    This ensures offset logic is partially tested. You might expand coverage by explicitly comparing a known input and checking the final exact index result.

    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 111401b and 17f178b.

    📒 Files selected for processing (2)
    • pyretailscience/plots/index.py (3 hunks)
    • tests/plots/test_index.py (1 hunks)
    🔇 Additional comments (8)
    pyretailscience/plots/index.py (5)

    41-41: Consider checking Ibis availability or version constraints.

    While importing Ibis is crucial for this solution, ensure that your setup and environment have a compatible Ibis version. If your library users cannot install Ibis (e.g., due to environment constraints), you might provide a fallback or clear error messages explaining how to enable Ibis features.

    Would you like me to generate a shell script that checks for Ibis installation and prints the installed version?


    243-250: Nice extension to support both DataFrame and ibis.Table.

    Accepting either a Pandas DataFrame or an Ibis table improves flexibility. Confirm all downstream code can handle both input types, especially if any type-specific assumptions remain (e.g., .copy() usage, indexing, etc.).


    268-274: Validate _filter alignment between DataFrame and filter list lengths.

    When creating _filter with df_index_filter, ensure the list matches the length of the DataFrame. Otherwise, you may encounter misalignment or exceptions. Consider adding a proactive check to raise an error if lengths differ.


    281-282: Group columns ordering may be confusing.

    When a subgroup is provided, the group order is [index_subgroup_col, index_col]. Ensure this reversed ordering is intentional and that the subsequent logic (sorting, final indexing) aligns well with the reversed grouping.


    311-313: Potential divide-by-zero scenario.

    If totals in either the overall props or subset props are zero, (t.proportion / t.proportion_overall) might raise or produce NaNs or infinities. Consider adding a quick check to avoid unexpected results.

    tests/plots/test_index.py (3)

    31-47: Subgroup test properly verifies multi-level grouping.

    This test confirms that subgroup columns are included in the result. If feasible, add a check for correctness of the numeric results beyond just non-emptiness.


    49-60: Valid test for all-True filter scenario.

    Ensuring a ValueError is raised if df_index_filter is all True or all False is a critical correctness check.


    63-74: Validates invalid aggregator gracefully.

    This test effectively ensures the function raises a ValueError for unsupported aggregators. Coverage is good; no further concerns.

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    🧹 Nitpick comments (3)
    pyretailscience/plots/index.py (2)

    54-74: Consider refactoring to reduce complexity.

    The function is marked with noqa to ignore complexity warnings. Consider breaking it down into smaller, more focused functions to improve maintainability.

    Here's a suggested approach:

    1. Extract the data preparation logic (lines 141-156) into a separate function.
    2. Extract the series handling logic (lines 157-187) into a separate function.
    3. Extract the plotting logic (lines 189-240) into a separate function.

    This would make the code more modular and easier to test.


    280-280: Consider using a dictionary for aggregation functions.

    The lambda function could be replaced with a dictionary mapping for better readability and maintainability.

    -    agg_fn = lambda x: getattr(x, agg_func)()
    +    AGG_FUNCTIONS = {
    +        "sum": lambda x: x.sum(),
    +        "mean": lambda x: x.mean(),
    +        "max": lambda x: x.max(),
    +        "min": lambda x: x.min(),
    +        "nunique": lambda x: x.nunique(),
    +    }
    +    agg_fn = AGG_FUNCTIONS[agg_func]
    tests/plots/test_index.py (1)

    14-28: Consider adding more specific assertions.

    While the test verifies the basic structure of the result, it could benefit from more specific assertions about the calculated index values.

         assert isinstance(result, pd.DataFrame)
         assert "category" in result.columns
         assert "index" in result.columns
         assert not result.empty
    +    # Add specific assertions
    +    assert len(result) == 3  # Expected number of categories
    +    assert all(result["index"] >= 0)  # Index values should be non-negative
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 17f178b and 2b173c9.

    📒 Files selected for processing (2)
    • pyretailscience/plots/index.py (5 hunks)
    • tests/plots/test_index.py (2 hunks)
    🔇 Additional comments (6)
    pyretailscience/plots/index.py (4)

    141-150: LGTM! Parameter changes align with requirements.

    The changes to use index_col and value_to_index instead of df_index_filter align with the request to split the index concept into parameters for better scalability.


    244-253: LGTM! Type hints and parameters are well-defined.

    The function now accepts both pd.DataFrame and ibis.Table, making it more versatile. The parameter names are clear and descriptive.


    269-275: LGTM! Ibis table conversion is handled correctly.

    The code properly handles both pandas DataFrame and Ibis Table inputs, with appropriate conversion to Ibis for consistent processing.


    284-316: LGTM! Ibis operations are well-structured.

    The Ibis operations for grouping, aggregation, and joining are correctly implemented. The code handles both single-group and subgroup cases appropriately.

    tests/plots/test_index.py (2)

    74-95: LGTM! Comprehensive test coverage for aggregation functions.

    The test case thoroughly verifies all supported aggregation functions.


    54-71: LGTM! Error handling is well-tested.

    The test case properly verifies that an invalid aggregation function raises a ValueError with the correct error message.

    df["_filter"] = value_to_index
    table = ibis.memtable(df)
    else:
    table = df.mutate(_filter=ibis.literal(value_to_index))
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Where is the _filter column used? I don't see it referenced below and if df is a 3+ billion row table that could add a lot of data to the query.

    .drop("total")
    )

    overall_props = overall_props.mutate(proportion_overall=overall_props.proportion).drop("proportion")
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Is this just renaming proportion to proportion_overall?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Yes it is

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Can we just name it proportion_overall in the first place then and remove this line.

    .drop("total")
    )

    overall_props = overall_props.mutate(proportion_overall=overall_props.proportion).drop("proportion")
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Can we just name it proportion_overall in the first place then and remove this line.

    assert "category" in result.columns
    assert "index" in result.columns
    assert not result.empty
    assert all(result["index"] >= OFFSET_THRESHOLD)
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Since OFFSET_THRESHOLD is only used in this function, can you move its instantiation here please.

    index_col="category",
    value_col="value",
    group_col="category",
    offset=5,
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    SInce this is tied to the value of OFFSET_THRESHOLD I think we should change this line.

    Suggested change
    offset=5,
    offset=-OFFSET_THRESHOLD,

    Comment on lines 131 to 141
    expected_output = pd.DataFrame(
    {
    "group_col": ["A"],
    "value": [3],
    "proportion": [1.0],
    "value_right": [3],
    "proportion_overall": [0.142857],
    "index": [700.0],
    },
    )

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This isn't right. See the previous tests. There should be a value create for each of the values of group_col. ie A,B
    and C. Also, the other columns should be removed. Only group_col and index should be there.

    https://github.com/Data-Simply/pyretailscience/blob/main/tests/plots/test_index.py#L22

    Comment on lines 162 to 172
    expected_output = pd.DataFrame(
    {
    "group_col1": ["A", "B", "C"],
    "index": [140, 140, 46.6666667],
    "group_col1": ["A", "A"],
    "group_col2": ["D", "E"],
    "value": [3, 15],
    "proportion": [0.166667, 0.833333],
    "value_right": [3, 15],
    "proportion_overall": [0.166667, 0.833333],
    "index": [100.0, 100.0],
    },
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This isn't also right. See the previous tests. There should be a value created for each of the values of group_col2 and group_col2. Also, the other columns should be removed. Only group_col1, group_col2 and index should be there.

    https://github.com/Data-Simply/pyretailscience/blob/main/tests/plots/test_index.py#L42

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    ♻️ Duplicate comments (1)
    tests/plots/test_index.py (1)

    14-28: ⚠️ Potential issue

    Add tests to verify index value calculations.

    The current test only verifies the structure of the output DataFrame but doesn't validate that the index values are calculated correctly. This was present in the previous tests and should be added back.

    Add assertions to verify the actual index values:

         result = get_indexes(df, value_to_index="A", index_col="category", value_col="value", group_col="category")
         assert isinstance(result, pd.DataFrame)
         assert "category" in result.columns
         assert "index" in result.columns
         assert not result.empty
    +    # Verify index values
    +    expected_values = {
    +        "A": 100.0,  # Base category
    +        "B": 200.0,  # (30+40)/(10+20) * 100
    +        "C": 300.0,  # (50+60)/(10+20) * 100
    +    }
    +    for category, expected_index in expected_values.items():
    +        actual_index = result[result["category"] == category]["index"].iloc[0]
    +        assert np.isclose(actual_index, expected_index, rtol=1e-5)
    🧹 Nitpick comments (2)
    pyretailscience/plots/index.py (1)

    279-279: Consider these improvements for better code clarity and safety.

    1. Simplify the lambda function for aggregation:
    -    agg_fn = lambda x: getattr(x, agg_func)()
    +    agg_fn = getattr(ibis.expr.types.generic.Value, agg_func)
    1. Move the division by zero check earlier to prevent unnecessary computation:
         subset_total = subset_agg.group_by(index_subgroup_col).aggregate(total=lambda t: t.value.sum())
    +    subset_total = subset_total.filter(lambda t: t.total != 0)
         subset_props = (
             subset_agg.join(subset_total, index_subgroup_col)
    -        .filter(lambda t: t.total != 0)
             .mutate(proportion=lambda t: t.value / t.total)
             .drop("total")
         )

    Also applies to: 307-307

    tests/plots/test_index.py (1)

    76-81: Use more realistic test data.

    The current test data uses a simple arithmetic progression. Consider using more realistic data that better represents actual retail scenarios:

         df = pd.DataFrame(
             {
    -            "category": ["A", "A", "B", "B", "C", "C"],
    -            "value": [10, 20, 30, 40, 50, 60],
    +            "category": ["Electronics", "Electronics", "Clothing", "Clothing", "Food", "Food"],
    +            "value": [1200, 1500, 800, 950, 250, 300],
             },
         )
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 8d6e353 and 26459e9.

    📒 Files selected for processing (2)
    • pyretailscience/plots/index.py (5 hunks)
    • tests/plots/test_index.py (2 hunks)
    🔇 Additional comments (2)
    pyretailscience/plots/index.py (2)

    54-241: LGTM! The plot function has been successfully refactored.

    The changes align with the PR objectives, removing the df_index_filter parameter and introducing index_col and value_to_index parameters for better scalability. The implementation maintains all the existing functionality while improving the interface.


    244-321: LGTM! Successfully refactored to use Ibis for scalable computation.

    The implementation effectively uses Ibis for efficient computation at scale, with proper handling of both pandas DataFrame and Ibis Table inputs. The error handling for unsupported aggregation functions is comprehensive.

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    🧹 Nitpick comments (1)
    pyretailscience/plots/index.py (1)

    244-320: Optimize zero total filtering.

    Consider moving the zero total filter to the overall aggregation step to prevent unnecessary computations on rows that will be filtered out later.

    Apply this diff to optimize the filtering:

         if index_subgroup_col is None:
             overall_total = overall_agg.value.sum().execute()
             overall_props = overall_agg.mutate(proportion_overall=overall_agg.value / overall_total)
         else:
             overall_total = overall_agg.group_by(index_subgroup_col).aggregate(total=lambda t: t.value.sum())
    +        overall_total = overall_total.filter(lambda t: t.total != 0)
             overall_props = (
                 overall_agg.join(overall_total, index_subgroup_col)
                 .mutate(proportion_overall=lambda t: t.value / t.total)
                 .drop("total")
             )
    
         table = table.filter(table[index_col] == value_to_index)
         subset_agg = table.group_by(group_cols).aggregate(value=agg_fn(table[value_col]))
    
         if index_subgroup_col is None:
             subset_total = subset_agg.value.sum().name("total")
             subset_props = subset_agg.mutate(proportion=subset_agg.value / subset_total)
         else:
             subset_total = subset_agg.group_by(index_subgroup_col).aggregate(total=lambda t: t.value.sum())
             subset_props = (
                 subset_agg.join(subset_total, index_subgroup_col)
    -            .filter(lambda t: t.total != 0)
                 .mutate(proportion=lambda t: t.value / t.total)
                 .drop("total")
             )
    📜 Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL
    Plan: Pro

    📥 Commits

    Reviewing files that changed from the base of the PR and between 26459e9 and 4505b7a.

    📒 Files selected for processing (2)
    • pyretailscience/plots/index.py (5 hunks)
    • tests/plots/test_index.py (3 hunks)
    🔇 Additional comments (10)
    tests/plots/test_index.py (9)

    15-29: Add assertions for index value calculations.

    The test verifies the DataFrame structure but doesn't validate that the index values are calculated correctly. Based on past review feedback, please add assertions to verify the actual index values.


    31-53: Add assertions for index value calculations with subgroups.

    The test verifies the DataFrame structure but doesn't validate that the index values are calculated correctly for each subgroup. Please add assertions to verify the actual index values.


    55-73: LGTM!

    The test effectively validates error handling for invalid aggregation functions.


    75-97: Add assertions for index values with different aggregations.

    The test verifies the DataFrame structure but doesn't validate that the index values are calculated correctly for each aggregation function. Please add assertions to verify the actual index values.


    99-121: LGTM!

    The test effectively validates that index values respect the offset threshold.


    123-141: LGTM!

    The test effectively validates index calculations by comparing with expected output values.


    143-170: LGTM!

    The test effectively validates index calculations for two columns by comparing with expected output values.


    172-187: Add assertions for index values with Ibis table input.

    The test verifies the DataFrame structure but doesn't validate that the index values are calculated correctly when using an Ibis table input. Please add assertions to verify the actual index values.


    189-328: LGTM!

    The plot tests effectively validate:

    • Plot generation with default parameters
    • Custom title handling
    • Highlight range functionality
    • Group filtering
    • Error handling for invalid parameters
    • Source text and custom label handling
    🧰 Tools
    🪛 Ruff (0.8.2)

    192-192: Use @pytest.fixture over @pytest.fixture()

    Remove parentheses

    (PT001)

    pyretailscience/plots/index.py (1)

    54-241: LGTM!

    The plot function has been effectively updated to:

    • Use new parameters index_col and value_to_index
    • Maintain comprehensive documentation
    • Handle all plotting scenarios correctly

    Copy link
    Contributor

    @mvanwyk mvanwyk left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM!

    @mayurkmmt mayurkmmt merged commit 61591f9 into main Feb 19, 2025
    3 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants