-
Notifications
You must be signed in to change notification settings - Fork 1
refactor with ibis #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request updates the data processing logic by modifying the Changes
Sequence Diagram(s)sequenceDiagram
participant C as Client
participant P as Plot Function
participant GI as get_indexes
participant I as Ibis Engine
C->>P: Call plot(data)
P->>GI: Send data (DataFrame/ibis.Table) and parameters
GI->>I: Execute group-by and aggregation via Ibis
I-->>GI: Return aggregated results
GI-->>P: Return pd.DataFrame with indexes
P->>C: Render updated plot
Possibly related PRs
Suggested labels
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here. PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here. PR Code Suggestions ✨Explore these optional code suggestions:
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
pyretailscience/plots/index.py (1)
275-277
: Good aggregator validation.The aggregator check avoids invalid strings. As a minor extension, consider implementing a dictionary-based mapping instead of relying on
getattr
for each aggregator to improve explicitness and reduce potential runtime edge cases.- agg_fn = lambda x: getattr(x, agg_func)() + aggregations = { + "sum": lambda x: x.sum(), + "mean": lambda x: x.mean(), + "max": lambda x: x.max(), + "min": lambda x: x.min(), + "nunique": lambda x: x.nunique(), + } + agg_fn = aggregations[agg_func]tests/plots/test_index.py (4)
11-11
: Clarify purpose of OFFSET_THRESHOLD.In test code,
OFFSET_THRESHOLD = -5
is used to check that values exceed -5. Consider adding a brief comment or docstring to explain how or why-5
is chosen for your offset validation.
14-28
: Basic test scenario covers essential columns and index computation.The test confirms output is a non-empty DataFrame with expected columns. Consider adding an assertion verifying correctness of the index values to ensure the logic (including offset) is accurate.
77-93
: Comprehensive aggregator coverage.Testing multiple aggregation functions strengthens robustness. Consider verifying the numeric correctness for each aggregator’s output as well (e.g., verifying sums or means are correct).
95-110
: Offset test ensures final index is above the threshold.This ensures offset logic is partially tested. You might expand coverage by explicitly comparing a known input and checking the final exact index result.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pyretailscience/plots/index.py
(3 hunks)tests/plots/test_index.py
(1 hunks)
🔇 Additional comments (8)
pyretailscience/plots/index.py (5)
41-41
: Consider checking Ibis availability or version constraints.While importing Ibis is crucial for this solution, ensure that your setup and environment have a compatible Ibis version. If your library users cannot install Ibis (e.g., due to environment constraints), you might provide a fallback or clear error messages explaining how to enable Ibis features.
Would you like me to generate a shell script that checks for Ibis installation and prints the installed version?
243-250
: Nice extension to support both DataFrame and ibis.Table.Accepting either a Pandas DataFrame or an Ibis table improves flexibility. Confirm all downstream code can handle both input types, especially if any type-specific assumptions remain (e.g.,
.copy()
usage, indexing, etc.).
268-274
: Validate_filter
alignment between DataFrame and filter list lengths.When creating
_filter
withdf_index_filter
, ensure the list matches the length of the DataFrame. Otherwise, you may encounter misalignment or exceptions. Consider adding a proactive check to raise an error if lengths differ.
281-282
: Group columns ordering may be confusing.When a subgroup is provided, the group order is
[index_subgroup_col, index_col]
. Ensure this reversed ordering is intentional and that the subsequent logic (sorting, final indexing) aligns well with the reversed grouping.
311-313
: Potential divide-by-zero scenario.If totals in either the overall props or subset props are zero,
(t.proportion / t.proportion_overall)
might raise or produce NaNs or infinities. Consider adding a quick check to avoid unexpected results.tests/plots/test_index.py (3)
31-47
: Subgroup test properly verifies multi-level grouping.This test confirms that subgroup columns are included in the result. If feasible, add a check for correctness of the numeric results beyond just non-emptiness.
49-60
: Valid test for all-True filter scenario.Ensuring a ValueError is raised if
df_index_filter
is all True or all False is a critical correctness check.
63-74
: Validates invalid aggregator gracefully.This test effectively ensures the function raises a ValueError for unsupported aggregators. Coverage is good; no further concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
pyretailscience/plots/index.py (2)
54-74
: Consider refactoring to reduce complexity.The function is marked with
noqa
to ignore complexity warnings. Consider breaking it down into smaller, more focused functions to improve maintainability.Here's a suggested approach:
- Extract the data preparation logic (lines 141-156) into a separate function.
- Extract the series handling logic (lines 157-187) into a separate function.
- Extract the plotting logic (lines 189-240) into a separate function.
This would make the code more modular and easier to test.
280-280
: Consider using a dictionary for aggregation functions.The lambda function could be replaced with a dictionary mapping for better readability and maintainability.
- agg_fn = lambda x: getattr(x, agg_func)() + AGG_FUNCTIONS = { + "sum": lambda x: x.sum(), + "mean": lambda x: x.mean(), + "max": lambda x: x.max(), + "min": lambda x: x.min(), + "nunique": lambda x: x.nunique(), + } + agg_fn = AGG_FUNCTIONS[agg_func]tests/plots/test_index.py (1)
14-28
: Consider adding more specific assertions.While the test verifies the basic structure of the result, it could benefit from more specific assertions about the calculated index values.
assert isinstance(result, pd.DataFrame) assert "category" in result.columns assert "index" in result.columns assert not result.empty + # Add specific assertions + assert len(result) == 3 # Expected number of categories + assert all(result["index"] >= 0) # Index values should be non-negative
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pyretailscience/plots/index.py
(5 hunks)tests/plots/test_index.py
(2 hunks)
🔇 Additional comments (6)
pyretailscience/plots/index.py (4)
141-150
: LGTM! Parameter changes align with requirements.The changes to use
index_col
andvalue_to_index
instead ofdf_index_filter
align with the request to split the index concept into parameters for better scalability.
244-253
: LGTM! Type hints and parameters are well-defined.The function now accepts both
pd.DataFrame
andibis.Table
, making it more versatile. The parameter names are clear and descriptive.
269-275
: LGTM! Ibis table conversion is handled correctly.The code properly handles both pandas DataFrame and Ibis Table inputs, with appropriate conversion to Ibis for consistent processing.
284-316
: LGTM! Ibis operations are well-structured.The Ibis operations for grouping, aggregation, and joining are correctly implemented. The code handles both single-group and subgroup cases appropriately.
tests/plots/test_index.py (2)
74-95
: LGTM! Comprehensive test coverage for aggregation functions.The test case thoroughly verifies all supported aggregation functions.
54-71
: LGTM! Error handling is well-tested.The test case properly verifies that an invalid aggregation function raises a ValueError with the correct error message.
pyretailscience/plots/index.py
Outdated
df["_filter"] = value_to_index | ||
table = ibis.memtable(df) | ||
else: | ||
table = df.mutate(_filter=ibis.literal(value_to_index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the _filter
column used? I don't see it referenced below and if df is a 3+ billion row table that could add a lot of data to the query.
pyretailscience/plots/index.py
Outdated
.drop("total") | ||
) | ||
|
||
overall_props = overall_props.mutate(proportion_overall=overall_props.proportion).drop("proportion") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just renaming proportion
to proportion_overall
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just name it proportion_overall in the first place then and remove this line.
pyretailscience/plots/index.py
Outdated
.drop("total") | ||
) | ||
|
||
overall_props = overall_props.mutate(proportion_overall=overall_props.proportion).drop("proportion") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just name it proportion_overall in the first place then and remove this line.
tests/plots/test_index.py
Outdated
assert "category" in result.columns | ||
assert "index" in result.columns | ||
assert not result.empty | ||
assert all(result["index"] >= OFFSET_THRESHOLD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since OFFSET_THRESHOLD is only used in this function, can you move its instantiation here please.
tests/plots/test_index.py
Outdated
index_col="category", | ||
value_col="value", | ||
group_col="category", | ||
offset=5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SInce this is tied to the value of OFFSET_THRESHOLD I think we should change this line.
offset=5, | |
offset=-OFFSET_THRESHOLD, |
tests/plots/test_index.py
Outdated
expected_output = pd.DataFrame( | ||
{ | ||
"group_col": ["A"], | ||
"value": [3], | ||
"proportion": [1.0], | ||
"value_right": [3], | ||
"proportion_overall": [0.142857], | ||
"index": [700.0], | ||
}, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't right. See the previous tests. There should be a value create for each of the values of group_col. ie A,B
and C. Also, the other columns should be removed. Only group_col and index should be there.
tests/plots/test_index.py
Outdated
expected_output = pd.DataFrame( | ||
{ | ||
"group_col1": ["A", "B", "C"], | ||
"index": [140, 140, 46.6666667], | ||
"group_col1": ["A", "A"], | ||
"group_col2": ["D", "E"], | ||
"value": [3, 15], | ||
"proportion": [0.166667, 0.833333], | ||
"value_right": [3, 15], | ||
"proportion_overall": [0.166667, 0.833333], | ||
"index": [100.0, 100.0], | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't also right. See the previous tests. There should be a value created for each of the values of group_col2 and group_col2. Also, the other columns should be removed. Only group_col1, group_col2 and index should be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/plots/test_index.py (1)
14-28
:⚠️ Potential issueAdd tests to verify index value calculations.
The current test only verifies the structure of the output DataFrame but doesn't validate that the index values are calculated correctly. This was present in the previous tests and should be added back.
Add assertions to verify the actual index values:
result = get_indexes(df, value_to_index="A", index_col="category", value_col="value", group_col="category") assert isinstance(result, pd.DataFrame) assert "category" in result.columns assert "index" in result.columns assert not result.empty + # Verify index values + expected_values = { + "A": 100.0, # Base category + "B": 200.0, # (30+40)/(10+20) * 100 + "C": 300.0, # (50+60)/(10+20) * 100 + } + for category, expected_index in expected_values.items(): + actual_index = result[result["category"] == category]["index"].iloc[0] + assert np.isclose(actual_index, expected_index, rtol=1e-5)
🧹 Nitpick comments (2)
pyretailscience/plots/index.py (1)
279-279
: Consider these improvements for better code clarity and safety.
- Simplify the lambda function for aggregation:
- agg_fn = lambda x: getattr(x, agg_func)() + agg_fn = getattr(ibis.expr.types.generic.Value, agg_func)
- Move the division by zero check earlier to prevent unnecessary computation:
subset_total = subset_agg.group_by(index_subgroup_col).aggregate(total=lambda t: t.value.sum()) + subset_total = subset_total.filter(lambda t: t.total != 0) subset_props = ( subset_agg.join(subset_total, index_subgroup_col) - .filter(lambda t: t.total != 0) .mutate(proportion=lambda t: t.value / t.total) .drop("total") )Also applies to: 307-307
tests/plots/test_index.py (1)
76-81
: Use more realistic test data.The current test data uses a simple arithmetic progression. Consider using more realistic data that better represents actual retail scenarios:
df = pd.DataFrame( { - "category": ["A", "A", "B", "B", "C", "C"], - "value": [10, 20, 30, 40, 50, 60], + "category": ["Electronics", "Electronics", "Clothing", "Clothing", "Food", "Food"], + "value": [1200, 1500, 800, 950, 250, 300], }, )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pyretailscience/plots/index.py
(5 hunks)tests/plots/test_index.py
(2 hunks)
🔇 Additional comments (2)
pyretailscience/plots/index.py (2)
54-241
: LGTM! The plot function has been successfully refactored.The changes align with the PR objectives, removing the
df_index_filter
parameter and introducingindex_col
andvalue_to_index
parameters for better scalability. The implementation maintains all the existing functionality while improving the interface.
244-321
: LGTM! Successfully refactored to use Ibis for scalable computation.The implementation effectively uses Ibis for efficient computation at scale, with proper handling of both pandas DataFrame and Ibis Table inputs. The error handling for unsupported aggregation functions is comprehensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pyretailscience/plots/index.py (1)
244-320
: Optimize zero total filtering.Consider moving the zero total filter to the overall aggregation step to prevent unnecessary computations on rows that will be filtered out later.
Apply this diff to optimize the filtering:
if index_subgroup_col is None: overall_total = overall_agg.value.sum().execute() overall_props = overall_agg.mutate(proportion_overall=overall_agg.value / overall_total) else: overall_total = overall_agg.group_by(index_subgroup_col).aggregate(total=lambda t: t.value.sum()) + overall_total = overall_total.filter(lambda t: t.total != 0) overall_props = ( overall_agg.join(overall_total, index_subgroup_col) .mutate(proportion_overall=lambda t: t.value / t.total) .drop("total") ) table = table.filter(table[index_col] == value_to_index) subset_agg = table.group_by(group_cols).aggregate(value=agg_fn(table[value_col])) if index_subgroup_col is None: subset_total = subset_agg.value.sum().name("total") subset_props = subset_agg.mutate(proportion=subset_agg.value / subset_total) else: subset_total = subset_agg.group_by(index_subgroup_col).aggregate(total=lambda t: t.value.sum()) subset_props = ( subset_agg.join(subset_total, index_subgroup_col) - .filter(lambda t: t.total != 0) .mutate(proportion=lambda t: t.value / t.total) .drop("total") )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pyretailscience/plots/index.py
(5 hunks)tests/plots/test_index.py
(3 hunks)
🔇 Additional comments (10)
tests/plots/test_index.py (9)
15-29
: Add assertions for index value calculations.The test verifies the DataFrame structure but doesn't validate that the index values are calculated correctly. Based on past review feedback, please add assertions to verify the actual index values.
31-53
: Add assertions for index value calculations with subgroups.The test verifies the DataFrame structure but doesn't validate that the index values are calculated correctly for each subgroup. Please add assertions to verify the actual index values.
55-73
: LGTM!The test effectively validates error handling for invalid aggregation functions.
75-97
: Add assertions for index values with different aggregations.The test verifies the DataFrame structure but doesn't validate that the index values are calculated correctly for each aggregation function. Please add assertions to verify the actual index values.
99-121
: LGTM!The test effectively validates that index values respect the offset threshold.
123-141
: LGTM!The test effectively validates index calculations by comparing with expected output values.
143-170
: LGTM!The test effectively validates index calculations for two columns by comparing with expected output values.
172-187
: Add assertions for index values with Ibis table input.The test verifies the DataFrame structure but doesn't validate that the index values are calculated correctly when using an Ibis table input. Please add assertions to verify the actual index values.
189-328
: LGTM!The plot tests effectively validate:
- Plot generation with default parameters
- Custom title handling
- Highlight range functionality
- Group filtering
- Error handling for invalid parameters
- Source text and custom label handling
🧰 Tools
🪛 Ruff (0.8.2)
192-192: Use
@pytest.fixture
over@pytest.fixture()
Remove parentheses
(PT001)
pyretailscience/plots/index.py (1)
54-241
: LGTM!The plot function has been effectively updated to:
- Use new parameters
index_col
andvalue_to_index
- Maintain comprehensive documentation
- Handle all plotting scenarios correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
User description
feat: refactor get_index function with ibis
PR Type
Enhancement, Tests
Description
Refactored
get_indexes
function to use Ibis for scalable computation.Enhanced
get_indexes
to support both pandas DataFrame and Ibis Table.Added robust error handling for invalid filters and aggregation functions.
Updated and expanded test cases for
get_indexes
with various scenarios.Changes walkthrough 📝
index.py
Refactor `get_indexes` to use Ibis and enhance functionality
pyretailscience/plots/index.py
get_indexes
to use Ibis for efficient computation.aggregation functions.
test_index.py
Update and expand test cases for `get_indexes`
tests/plots/test_index.py
get_indexes
.Summary by CodeRabbit
New Features
Refactor
Tests