-
Notifications
You must be signed in to change notification settings - Fork 1
Product Association #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Product Association #131
Conversation
WalkthroughThis pull request refactors the Changes
Possibly related PRs
Suggested labels
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is
... and 1 file with indirect coverage changes 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (3)
pyretailscience/analysis/product_association.py (3)
44-44
: Constant for supported combinations.Introducing
SUPPORTED_COMBINATIONS
as a constant clarifies the current functionality. If future support is planned for more combinations, consider structuring the code so it's easy to extend.
213-214
: Validation restricting combinations to 2 only.Raising an error for unsupported combination values enforces the new constraint. This also aligns with the updated tests. Ensure the docstrings are consistent with only supporting pairs.
- ValueError: If the number of combinations is not 2 or 3, + ValueError: If the number of combinations is not 2,
380-382
: Filtering by target item.Filtering the final result to only rows where
value_col_1 == target_item
is a clean way to handle optional targeting. For greater flexibility, consider allowing a list of target items in the future.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pyretailscience/analysis/product_association.py
(6 hunks)tests/analysis/test_product_association.py
(1 hunks)
🔇 Additional comments (5)
tests/analysis/test_product_association.py (2)
228-228
: Updated error message aligns with new logic.The updated error message is consistent with restricting
number_of_combinations
to only 2. This test correctly verifies that invalid values trigger the new error.
235-235
: Duplicate check for invalid combination values.Having a separate check for other invalid values is good for completeness. The updated message matches the refactor limiting supported combinations to 2.
pyretailscience/analysis/product_association.py (3)
39-39
: New ibis import.Importing
ibis
is necessary for the memtable conversion and subsequent ibis-based operations. Make sure all dependencies are properly declared within your environment or setup files.
309-316
: Confidence and uplift calculations.This step is logically correct (confidence = cooccurrences / occurrences_1, uplift = support / (prob_1 * prob_2)) once the data is joined properly. However, ensure the preceding rename operations produce the correct column references.
92-96
:❓ Verification inconclusive
Parameter type change to accept ibis tables.
Allowing
df: pd.DataFrame | ibis.Table
gives more flexibility. Ensure that all downstream logic can handle both data structures, especially any operations that might differ between pandas and ibis.
🏁 Script executed:
#!/bin/bash # Description: Verify that no downstream call expects only a pandas DataFrame. # We'll look for methods that might call df.* methods not supported by ibis. rg -A 5 "df\."Length of output: 58417
Attention: Verify Downstream Handling of Mixed DataFrame Types
The parameter change to allow both
pd.DataFrame
andibis.Table
has been partially verified via our code search. In the affected file (pyretailscience/analysis/product_association.py
), methods such as accessingdf.columns
and callingdf.group_by(…)
appear to be used consistently. These methods are available on ibis tables, and similar operations are used elsewhere in the codebase. However, note that some downstream calls (e.g. pandas‑specific methods likegroupby
orto_dict
in other modules) may behave differently if a native pandas DataFrame is passed.
- In
product_association.py
, the following is observed:
- Column Checks:
missing_cols = set(required_cols) - set(df.columns)
This is valid on bothpd.DataFrame
andibis.Table
(which exposes acolumns
property).- Aggregation Operations:
The use ofdf.group_by(group_col).aggregate(…)
is consistent with ibis semantics.- Across the repository, both pandas‑specific and ibis‑specific methods are invoked. It is recommended to:
- Confirm at runtime that when a pandas DataFrame is provided, any pandas‑only operations are either skipped or preceded by an appropriate conversion step.
- Ensure that unit tests cover the scenarios for both data types so that any subtle differences are caught early.
Please verify that all downstream logic correctly handles both data structure types during integration testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR refactors the product association functionality to use ibis for flexible data input and streamlines the association calculation process while updating error messaging. Key changes include:
- Converting input data handling to support both pandas.DataFrame and ibis.Table.
- Simplifying and unifying the number_of_combinations argument using a constant.
- Adjusting tests to reflect the removal of triple combinations and updated exception messages.
Reviewed Changes
File | Description |
---|---|
pyretailscience/analysis/product_association.py | Refactored association logic to support ibis tables, updated error messaging, and streamlined renaming of intermediate columns. |
tests/analysis/test_product_association.py | Updated tests to match new error messages and removed unsupported combination scenarios. |
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (2)
pyretailscience/analysis/product_association.py:259
- [nitpick] The repeated renaming of product occurrences (e.g. for product_occurrences_1 and product_occurrences_2) may be error-prone; consider extracting the renaming logic into a helper function to improve clarity and reduce duplication.
product_occurrences_1 = product_occurrences.rename(
pyretailscience/analysis/product_association.py:318
- [nitpick] The mapping in the renaming of inverse_pairs swaps column names; adding an inline comment to explain the rationale behind this mapping would help improve code clarity.
inverse_pairs = result.rename({
d7742f0
to
59a63dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (4)
pyretailscience/analysis/product_association.py (4)
258-263
:⚠️ Potential issueColumn renaming inconsistency in product_occurrences.
The
product_occurrences
table has columns["item", "occurrences", "occurrence_probability"]
, but the rename operations are looking for columns that don't exist in this context (item_1
,occurrences_x
, etc.).-product_occurrences_1 = product_occurrences.rename( - {"item_1": "item", "occurrences_x": "occurrences", "occurrence_probability_x": "occurrence_probability"}, -) -product_occurrences_2 = product_occurrences.rename( - {"item_2": "item", "occurrences_y": "occurrences", "occurrence_probability_y": "occurrence_probability"}, -) +# Correctly rename columns for joining with merged_df +product_occurrences_1 = product_occurrences.rename({"item": "item_1"}) +product_occurrences_2 = product_occurrences.rename({"item": "item_2"})
326-328
:⚠️ Potential issueInconsistent column access for product_occurrences.
The rename operations are referencing columns like
{value_col}_1
and{value_col}_2
but it's not clear where these column names came from, as previous operations created columns named "item".-product_occurrences_1_rename2 = product_occurrences.rename({f"{value_col}_1": "item"}) -product_occurrences_2_rename2 = product_occurrences.rename({f"{value_col}_2": "item"}) +# Use consistent column naming based on the actual structure +product_occurrences_1_rename2 = product_occurrences.rename({"item": f"{value_col}_1"}) +product_occurrences_2_rename2 = product_occurrences.rename({"item": f"{value_col}_2"})
248-249
:⚠️ Potential issueColumn renaming issue with non-existent columns.
The code is attempting to rename columns
item_1
anditem_2
inunique_transactions
, but according to the previous operations, this table only has columns[group_col, "item"]
. This will likely cause runtime errors.-left_table = unique_transactions.rename({"item_1": "item"}) -right_table = unique_transactions.rename({"item_2": "item"}) +# These tables need correct column names for the join operation +left_table = unique_transactions.rename({"item": "item_1"}) +right_table = unique_transactions.rename({"item": "item_2"})
290-295
:⚠️ Potential issueInconsistent column naming in product_occurrences renames.
These rename operations reference columns
occurrences_1
,prob_1
, etc. that don't appear to exist in theproduct_occurrences
table based on the previous operations. Check these column names to ensure they match what's available.-product_occurrences_1_rename = product_occurrences.rename( - {"item_1": "item", "occurrences_1": "occurrences", "prob_1": "occurrence_probability"}, -) -product_occurrences_2_rename = product_occurrences.rename( - {"item_2": "item", "occurrences_2": "occurrences", "prob_2": "occurrence_probability"}, -) +# Use consistent column naming based on what exists in product_occurrences +product_occurrences_1_rename = product_occurrences.rename( + {"item": "item_1"} +) +product_occurrences_2_rename = product_occurrences.rename( + {"item": "item_2"} +)
🧹 Nitpick comments (1)
pyretailscience/analysis/product_association.py (1)
226-393
: Comprehensive refactoring to use ibis instead of numpy/scipy.The refactoring to use ibis operations is a good strategic move as it will likely improve performance for large datasets and enable integration with various data sources. However, ensure that all column naming is consistent throughout the implementation to prevent runtime errors.
Some key recommendations:
- Verify all column references match the actual columns available in each table
- Test thoroughly with both pandas DataFrames and ibis Tables
- Consider adding more inline comments explaining the complex data transformation steps
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pyretailscience/analysis/product_association.py
(6 hunks)tests/analysis/test_product_association.py
(1 hunks)
🔇 Additional comments (10)
tests/analysis/test_product_association.py (3)
228-228
: Error message updated to reflect current implementation.The error message has been correctly updated to specify that only
number_of_combinations=2
is currently supported, which aligns with the changes in the ProductAssociation class implementation.
235-235
: Error message updated for consistency.This error message has been consistently updated to match the first instance, reflecting the current implementation constraints.
1-306
:✅ Verification successful
Test coverage concern due to removed test methods.
Two test methods (
test_calc_association_all_pair_items
andtest_calc_association_target_pair_items
) have been removed from this file. While this aligns with the refactoring to use ibis instead of numpy/scipy, we should ensure that the new implementation is still adequately tested with similar test coverage.
🏁 Script executed:
#!/bin/bash # Check for any other tests that might cover the removed functionality echo "Checking for other tests that might test pair combinations..." rg -i "test.*pair" tests/Length of output: 461
Pair Association Test Coverage Verified
The functionality previously validated by the removed tests now appears to be covered. Our search found that the pair association logic is exercised in
tests/test_range_planning.py
—specifically, tests that verify the correct generation of pairs dataframes whenexclude_same_transaction_products
is set to True and False. This indicates that the new implementation has been adequately tested. Please review the tests intests/test_range_planning.py
to confirm that all edge cases for pair associations are covered.🧰 Tools
🪛 Ruff (0.8.2)
15-15: Use
@pytest.fixture
over@pytest.fixture()
Remove parentheses
(PT001)
26-26: Use
@pytest.fixture
over@pytest.fixture()
Remove parentheses
(PT001)
61-61: Use
@pytest.fixture
over@pytest.fixture()
Remove parentheses
(PT001)
pyretailscience/analysis/product_association.py (7)
39-44
: Good addition of ibis import and constant definition.Adding the ibis import and defining
SUPPORTED_COMBINATIONS
as a constant improves code maintainability by centralizing the supported value.
92-97
: Enhanced flexibility with expanded type annotation.Updating the parameter type to accept both pandas DataFrame and ibis Table enhances the flexibility of the class, allowing it to work with different data sources.
213-214
: Simplified validation with clear error message.The error message now clearly states that only
number_of_combinations=2
is supported, which is more explicit than the previous message.
226-228
: Good defensive type handling.The code correctly handles both pandas DataFrames and ibis Tables by converting DataFrames to ibis Tables when needed, ensuring a consistent implementation.
370-377
: Comprehensive approach to association rule generation.The code correctly executes ibis expressions and combines direct and inverse pairs to create a comprehensive set of association rules. The sorting and index resetting ensure consistent output formatting.
379-393
: Target item filtering and column selection are appropriate.The implementation correctly filters by target_item when specified and selects all relevant columns for the final result. This maintains compatibility with the previous implementation while leveraging ibis for improved performance.
1-394
:✅ Verification successful
Verify test coverage for the refactored implementation.
With the removal of two test methods from the test file and the complete refactoring of the implementation, it's important to ensure adequate test coverage for the new ibis-based logic.
🏁 Script executed:
#!/bin/bash # Let's check test coverage for the ProductAssociation class echo "Checking for tests of the ProductAssociation class..." rg -l "ProductAssociation" tests/ echo "Checking coverage of key ibis operations..." rg "ibis\.(join|group_by|aggregate|filter)" pyretailscience/analysis/product_association.pyLength of output: 617
Test Coverage for Product Association Refactoring Verified
The refactored implementation is supported by tests. Notably:
- The test file
tests/analysis/test_product_association.py
confirms that theProductAssociation
class is being tested.- Key IBIS operations such as
join
,group_by
,aggregate
, andfilter
are actively used in the implementation, ensuring that the new data-processing paths are exercised.Please ensure that all new branches—including error handling and parameter validations—are comprehensively covered within the tests.
product_pairs = product_pairs.mutate( | ||
confidence=product_pairs["cooccurrences"] / product_pairs["occurrences_1"], | ||
uplift=product_pairs["support"] / (product_pairs["prob_1"] * product_pairs["prob_2"]), | ||
) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
59a63dc
to
2743f04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (5)
pyretailscience/analysis/product_association.py (5)
241-246
:⚠️ Potential issueColumn name references in rename operations might be incorrect
The rename operations reference columns that don't appear to exist in
product_occurrences
. Based on line 221-229,product_occurrences
should have columns["item", "occurrences", "occurrence_probability"]
, not the columns being renamed.Revisit these rename operations to ensure they match the actual column structure of the dataframe.
273-278
:⚠️ Potential issueInconsistent column naming in rename operations
Similar to the previous rename operations, these don't match the expected columns in
product_occurrences
. Additionally, there's a mismatch with the columns referenced in the uplift calculation below (prob_1, prob_2).Ensure the column naming is consistent throughout the code.
291-294
:⚠️ Potential issueIssues with uplift calculation column references
The uplift calculation uses
prob_1
andprob_2
columns which don't appear to be properly defined in the preceding code. This was previously flagged in a review comment and still needs to be addressed.Update the uplift calculation to use the correct column references based on your join/rename strategy:
product_pairs = product_pairs.mutate( confidence=product_pairs["cooccurrences"] / product_pairs["occurrences_1"], - uplift=product_pairs["support"] / (product_pairs["prob_1"] * product_pairs["prob_2"]), + uplift=product_pairs["support"] / (product_pairs["occurrence_probability_x"] * product_pairs["occurrence_probability_y"]), )Adjust the column names to match your actual column structure after the joins.
322-325
:⚠️ Potential issueSame uplift calculation issue in inverse pairs
The same issue with column references exists in the inverse_pairs uplift calculation.
inverse_pairs = inverse_pairs.mutate( confidence=inverse_pairs["cooccurrences"] / inverse_pairs["occurrences_1"], - uplift=inverse_pairs["support"] / (inverse_pairs["prob_1"] * inverse_pairs["prob_2"]), + uplift=inverse_pairs["support"] / (inverse_pairs["occurrence_probability_x"] * inverse_pairs["occurrence_probability_y"]), )
231-232
:⚠️ Potential issuePotential column name mismatch in rename operations
The rename operations here don't seem to match the actual column structure. Based on the preceding code,
unique_transactions
has columns[group_col, "item"]
, not"item_1"
or"item_2"
.Verify the column names and adjust the rename operations, possibly:
-left_table = unique_transactions.rename({"item_1": "item"}) -right_table = unique_transactions.rename({"item_2": "item"}) +# If trying to create aliases for self-join, create copies instead +left_table = unique_transactions +right_table = unique_transactions
🧹 Nitpick comments (1)
pyretailscience/analysis/product_association.py (1)
212-218
: Consider using the table's unnest method directlyThe code is using the internal
ibis.expr.operations.Unnest
operation directly. Consider using the table'sunnest()
method instead for better readability and to avoid depending on internal implementation details.unique_transactions = unique_transactions.mutate( - item=ibis.expr.operations.Unnest(unique_transactions["products"]), + item=unique_transactions["products"].unnest(), ).drop("products")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pyretailscience/analysis/product_association.py
(5 hunks)tests/analysis/test_product_association.py
(0 hunks)
💤 Files with no reviewable changes (1)
- tests/analysis/test_product_association.py
🔇 Additional comments (10)
pyretailscience/analysis/product_association.py (10)
38-40
: Good update to include ibis importThe addition of the ibis import is appropriate for the refactoring of the class to support both Pandas DataFrame and ibis Table inputs.
44-45
: Simplified approach with constant instead of parameterReplacing the
number_of_combinations
parameter with a constantSUPPORTED_COMBINATIONS = 2
streamlines the API. This simplification makes the code more maintainable by removing parameter validation logic.
88-89
: Good enhancement of type signatureThe updated type annotation correctly reflects that the class now supports both Pandas DataFrame and ibis Table inputs.
102-103
: Docstring updated to match new type signatureThe documentation properly reflects the updated parameter type to include both DataFrame and ibis Table.
147-148
: Method signature update consistent with class signatureThe static method signature properly mirrors the class initialization parameter type.
162-164
: Docstring updated to match method signatureGood job updating the docstring to match the new parameter type.
209-211
: Good pattern for handling both input typesThis is a clean approach to handle both pandas DataFrame and ibis Table inputs, ensuring that the rest of the implementation only needs to deal with ibis Tables.
356-360
: Good final result constructionThe concatenation of the direct and inverse pairs, sorting, and index reset produces a well-structured final result.
362-364
: Clean target item filteringThe conditional filter for target_item is implemented clearly and efficiently.
365-376
: Proper column selection for final outputThe final column selection ensures a consistent output schema with all the important metrics.
unique_transactions = ( | ||
df.group_by(group_col).aggregate(products=lambda t, col=value_col: t[col].collect()).order_by(group_col) | ||
) | ||
unique_transactions = unique_transactions.mutate( | ||
item=ibis.expr.operations.Unnest(unique_transactions["products"]), | ||
).drop("products") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it's doing something very different from the CTE in the example query. Eg,
unique_transactions AS (
SELECT DISTINCT
customer_id,
product_id
FROM transactions
)
Based on the Ibis docs I think this should do the same thing as the SQL
unique_transactions = ( | |
df.group_by(group_col).aggregate(products=lambda t, col=value_col: t[col].collect()).order_by(group_col) | |
) | |
unique_transactions = unique_transactions.mutate( | |
item=ibis.expr.operations.Unnest(unique_transactions["products"]), | |
).drop("products") | |
unique_transactions = ( | |
df.select(df[group_col], df.product_id.name("item")).distinct() | |
) |
right_table, | ||
predicates=[left_table[group_col] == right_table[group_col]], | ||
) | ||
merged_df = merged_df.filter(merged_df["item_1"] < merged_df["item_2"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you definitely want to do this as part of the join. Otherwise, I believe some DBs will do this after the join.
merged_df, | ||
product_occurrences_2, | ||
predicates=[merged_df["item_2"] == product_occurrences_2["item_2"]], | ||
).order_by([group_col, "item_1", "item_2"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid order_by
on the non-final data. This could end up sorting millions or billions of rows of data.
…ature/product-association
d925e64
to
79b8cc6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (6)
pyretailscience/analysis/product_association.py (6)
222-223
:⚠️ Potential issueIncorrect column renaming
The rename operations here appear problematic because
unique_transactions
doesn't contain columns named "item_1" or "item_2". After line 210, it only has columns[group_col, "item"]
.These renames won't work as intended. Consider:
-left_table = unique_transactions.rename({"item_1": "item"}) -right_table = unique_transactions.rename({"item_2": "item"}) +left_table = unique_transactions.rename({"item": "item_1"}) +right_table = unique_transactions.rename({"item": "item_2"})
233-238
:⚠️ Potential issueIncorrect column renaming in product_occurrences
The rename operations for
product_occurrences_1
andproduct_occurrences_2
are using column names that don't exist inproduct_occurrences
, which only has["item", "occurrences", "occurrence_probability"]
.This will cause runtime errors. Fix the renames:
product_occurrences_1 = product_occurrences.rename( - {"item_1": "item", "occurrences_1": "occurrences", "occurrence_probability_1": "occurrence_probability"}, + {"item": "item_1", "occurrences": "occurrences_1", "occurrence_probability": "occurrence_probability_1"}, ) product_occurrences_2 = product_occurrences.rename( - {"item_2": "item", "occurrences_2": "occurrences", "occurrence_probability_2": "occurrence_probability"}, + {"item": "item_2", "occurrences": "occurrences_2", "occurrence_probability": "occurrence_probability_2"}, )
279-282
:⚠️ Potential issueUplift calculation uses undefined columns
The uplift calculation uses "prob_1" and "prob_2" columns, but these appear to be undefined or incorrectly referenced based on previous rename operations.
This is the same issue mentioned in previous review comments. Fix the column references:
product_pairs = product_pairs.mutate( confidence=product_pairs["cooccurrences"] / product_pairs["occurrences_1"], - uplift=product_pairs["support"] / (product_pairs["prob_1"] * product_pairs["prob_2"]), + uplift=product_pairs["support"] / (product_pairs["occurrence_probability_1"] * product_pairs["occurrence_probability_2"]), )Ensure that the column names match those created in your rename operations.
315-339
:⚠️ Potential issueColumn selection refers to non-existent columns
Lines 315-339 select columns using names that may not match the actual columns in the result and inverse_pairs DataFrames.
Ensure that all column references match the actual column names in your tables. For example, in line 315:
-result = result.rename({f"{value_col}_1": "item_1", f"{value_col}_2": "item_2"}) +# If result already has "item_1" and "item_2" columns, this rename is unnecessary +# Otherwise, make sure you're renaming from the correct source column namesSimilarly, verify that all column selections in the following lists refer to columns that actually exist in the respective DataFrames.
261-266
:⚠️ Potential issueRenamed columns don't match existing columns
Similar to the previous issues, these renames use column names like "prob_1" and "prob_2" that don't appear to exist in
product_occurrences
.Fix the renames to use the correct column names:
product_occurrences_1_rename = product_occurrences.rename( - {"item_1": "item", "occurrences_1": "occurrences", "prob_1": "occurrence_probability"}, + {"item": "item_1", "occurrences": "occurrences_1", "occurrence_probability": "prob_1"}, ) product_occurrences_2_rename = product_occurrences.rename( - {"item_2": "item", "occurrences_2": "occurrences", "prob_2": "occurrence_probability"}, + {"item": "item_2", "occurrences": "occurrences_2", "occurrence_probability": "prob_2"}, )
288-295
:⚠️ Potential issueIncorrect column references in inverse_pairs
The rename operation for inverse_pairs uses column references like
f"{value_col}_1"
andf"{value_col}_2"
but these columns don't exist in the result DataFrame at this point.Fix the column references:
inverse_pairs = result.rename( { - f"{value_col}_1": "item_1", - f"{value_col}_2": "item_2", + "item_1": "item_2", + "item_2": "item_1", "occurrences_1": "occurrences_2", "occurrences_2": "occurrences_1", }, )
🧹 Nitpick comments (3)
pyretailscience/analysis/product_association.py (3)
207-212
: Potential performance issue with early executionThe
total_transactions
calculation forces an immediate execution with.execute()
, breaking the lazy evaluation chain that ibis typically provides. This could negatively impact performance for large datasets.Consider making this part of the lazy evaluation chain by using:
-total_transactions = unique_transactions[group_col].nunique().execute() +total_transactions = unique_transactions[group_col].nunique().name("total_count")This matches the suggestion in previous comments and would allow the database to optimize the entire query.
85-141
: Documentation needs update to reflect Ibis supportThe class docstring (lines 46-82) still refers only to pandas.DataFrame input, but the implementation now supports both pandas DataFrame and ibis Table.
Update the class docstring to reflect the support for ibis.Table:
Args: - df (pandas.DataFrame): The input DataFrame containing transaction data. + df (pandas.DataFrame | ibis.Table): The input DataFrame or ibis Table containing transaction data.Also, consider adding an example using ibis.Table to demonstrate the new functionality.
207-364
: Inconsistent naming conventions for column suffixesThroughout the implementation, there's an inconsistent use of suffix naming conventions. Sometimes "_1"/"_2" are used, and other times you use "_x"/"_y" or other variations.
Standardize on a single naming convention for column suffixes (either "_1"/"_2" or "_x"/"_y") to improve code readability and maintenance. This will make it easier to track which columns are being used throughout the complex joins and transformations.
…ature/product-association
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (3)
pyretailscience/analysis/product_association.py (3)
316-340
:⚠️ Potential issueColumn selection inconsistency in result and inverse_pairs.
The column selection references
f"{value_col}_1"
andf"{value_col}_2"
, but after renaming they should be"item_1"
and"item_2"
.Fix the column selection to match the renamed columns:
result = result.rename({f"{value_col}_1": "item_1", f"{value_col}_2": "item_2"}) result = result[ [ - f"{value_col}_1", - f"{value_col}_2", + "item_1", + "item_2", "occurrences_1", "occurrences_2", "cooccurrences", "support", "confidence", "uplift", ] ] inverse_pairs = inverse_pairs[ [ - f"{value_col}_1", - f"{value_col}_2", + "item_1", + "item_2", "occurrences_1", "occurrences_2", "cooccurrences", "support", "confidence", "uplift", ] ]
354-365
:⚠️ Potential issueIncorrect column names in final result selection.
The final result selection uses
f"{value_col}_1"
andf"{value_col}_2"
but should use"item_1"
and"item_2"
after renaming.Correct the column selection:
return final_result[ [ - f"{value_col}_1", - f"{value_col}_2", + "item_1", + "item_2", "occurrences_1", "occurrences_2", "cooccurrences", "support", "confidence", "uplift", ] ]
298-310
:⚠️ Potential issueIncorrect column references in join predicates for inverse pairs.
The join operations use column references like
f"{value_col}_1"
that don't match the renamed columns, which would cause runtime errors.Fix the join predicates to use the correct column names:
product_occurrences_1_rename2 = product_occurrences.rename( - {f"{value_col}_1": value_col} + {"item_1": value_col} ) product_occurrences_2_rename2 = product_occurrences.rename( - {f"{value_col}_2": value_col} + {"item_2": value_col} ) inverse_pairs = ibis.join( inverse_pairs, product_occurrences_1_rename2, - predicates=[inverse_pairs[f"{value_col}_1"] == product_occurrences_1_rename2[f"{value_col}_1"]], + predicates=[inverse_pairs["item_1"] == product_occurrences_1_rename2["item_1"]], ) inverse_pairs = ibis.join( inverse_pairs, product_occurrences_2_rename2, - predicates=[inverse_pairs[f"{value_col}_2"] == product_occurrences_2_rename2[f"{value_col}_2"]], + predicates=[inverse_pairs["item_2"] == product_occurrences_2_rename2["item_2"]], )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pyretailscience/analysis/product_association.py
(5 hunks)
🔇 Additional comments (5)
pyretailscience/analysis/product_association.py (5)
86-96
: Good type annotation update to support ibis Tables.The parameter type has been correctly updated to accommodate both Pandas DataFrames and ibis Tables, aligning with the PR objective to convert functionality to the Ibis framework.
207-209
: Appropriate conversion from pandas to ibis table.The code correctly checks if the input is a pandas DataFrame and converts it to an ibis memtable when needed.
210-220
: Simplified transaction aggregation with ibis.The implementation uses ibis expressions to get unique transactions and calculate total transaction count, which is more scalable for large datasets compared to the previous implementation.
222-233
: Good use of ibis join predicates for generating pairs.The predicate
left_table["item_1"] < right_table["item_2"]
efficiently ensures we only get each pair once in a single direction.
211-211
: Efficient use of ibis for total transaction count.Using ibis's
nunique()
method with proper naming is a good approach for calculating the total transaction count efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
pyretailscience/analysis/product_association.py (4)
234-239
:⚠️ Potential issueColumn naming inconsistency in rename operations.
The rename operations reference columns like "item_1", "occurrences_1", and "occurrence_probability_1" that don't appear to exist in the
product_occurrences
table at this point.These issues are similar to previous review comments about column naming inconsistencies. Ensure that you're renaming columns that actually exist in the source table.
259-264
:⚠️ Potential issueColumn naming mismatch in product_occurrences rename operations.
The rename operations reference columns like "prob_1" and "prob_2" that don't appear to be created earlier in the code. This is likely to cause runtime errors.
This issue is similar to a previous review comment about column naming. The source table
product_occurrences
has "occurrence_probability" but not "prob_1" or "prob_2".
275-278
:⚠️ Potential issueColumn references don't match renamed columns.
The uplift calculation references "prob_1" and "prob_2" columns, but these don't match the renamed columns from earlier operations. This will likely cause runtime errors.
This issue is similar to a previous review comment. The columns should be renamed consistently throughout the code. If you're using "occurrence_probability_1" and "occurrence_probability_2" in some places, you should use the same names here.
284-292
:⚠️ Potential issueInconsistent column naming in inverse_pairs creation.
The inverse_pairs creation references "prob_1" and "prob_2" columns that may not exist in the result dataframe.
Similar to previous issues with column naming. Ensure consistent column naming throughout the code, especially for probability columns.
🧹 Nitpick comments (3)
pyretailscience/analysis/product_association.py (3)
209-219
: Simplify the group_by aggregate operation.The trailing comma in line 215 is unnecessary and should be removed for consistency with coding standards.
.aggregate( - occurrences=lambda t: t[group_col].nunique(), + occurrences=lambda t: t[group_col].nunique() )
253-254
: Remove trailing comma in single-item mutate calls.Similar to line 215, the trailing comma in line 253 should be removed for consistency.
cooccurrences = cooccurrences.mutate( - support=cooccurrences.cooccurrences / total_transactions, + support=cooccurrences.cooccurrences / total_transactions )
322-327
: Consider optimizing the execution flow.The current implementation executes the two ibis expressions separately and then concatenates them with pandas. Consider if this can be done within ibis to allow for backend optimization.
- result = result.execute() - inverse_pairs = inverse_pairs.execute() - - final_result = ( - pd.concat([result, inverse_pairs], ignore_index=True) - .sort_values(by=["item_1", "item_2"]) - .reset_index(drop=True) - ) + combined = ibis.union(result, inverse_pairs) + combined = combined.order_by(["item_1", "item_2"]) + final_result = combined.execute() + final_result = final_result.reset_index(drop=True)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pyretailscience/analysis/product_association.py
(5 hunks)tests/analysis/test_product_association.py
(0 hunks)
💤 Files with no reviewable changes (1)
- tests/analysis/test_product_association.py
🔇 Additional comments (3)
pyretailscience/analysis/product_association.py (3)
85-85
: Good type annotation enhancement.The parameter type has been properly updated to accept both pandas DataFrame and ibis Table, which enhances flexibility in the interface.
206-208
: Good practice handling different input types.Converting pandas DataFrame to ibis Table is a clean approach that allows the rest of the code to operate uniformly regardless of input type.
327-329
:✅ Verification successful
Target item filtering needs verification.
The filtering operation at line 329 assumes that the column name will be
f"{value_col}_1"
, which should be verified against the actual column names in the final result.
🏁 Script executed:
#!/bin/bash # Check how target_item filtering is handled in similar code rg -A 2 -B 2 "target_item.*reset_index" pyretailscience/Length of output: 583
Verified: Target item filtering logic is valid and consistent.
The shell script output confirms that the column renaming (line 327) correctly sets the column name as
f"{value_col}_1"
before filtering (line 329). This verifies that the filter condition is applied on the intended column, and the target item filtering logic is implemented correctly.
2b2a018
to
23641c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (4)
pyretailscience/analysis/product_association.py (4)
234-239
:⚠️ Potential issueRename operations reference non-existent columns.
Similar to the previous issue, these rename operations reference columns that don't exist in
product_occurrences
. The columns "item_1", "occurrences_1", and "occurrence_probability_1" (and their "_2" counterparts) aren't defined in the source table.These operations would fail at runtime. Consider correcting the rename operations to match the actual column structure:
- product_occurrences_1 = product_occurrences.rename( - {"item_1": value_col, "occurrences_1": "occurrences", "occurrence_probability_1": "occurrence_probability"}, - ) + product_occurrences_1 = product_occurrences.rename( + {value_col: "item_1", "occurrences": "occurrences_1", "occurrence_probability": "occurrence_probability_1"}, + )
275-278
:⚠️ Potential issueColumn naming mismatch in uplift calculation.
The uplift calculation references "prob_1" and "prob_2" columns, which don't match the renamed columns defined earlier (they should be "occurrence_probability_1" and "occurrence_probability_2").
Fix the column references in the uplift calculation to match the renamed columns:
product_pairs = product_pairs.mutate( confidence=product_pairs["cooccurrences"] / product_pairs["occurrences_1"], - uplift=product_pairs["support"] / (product_pairs["prob_1"] * product_pairs["prob_2"]), + uplift=product_pairs["support"] / (product_pairs["occurrence_probability_1"] * product_pairs["occurrence_probability_2"]), )
259-264
:⚠️ Potential issueColumn naming inconsistency in rename operations.
The rename operations reference "prob_1" and "prob_2", but these columns don't appear to be defined. Earlier code uses "occurrence_probability" instead.
Fix the column names in the rename operations to match the actual column structure:
- product_occurrences_1_rename = product_occurrences.rename( - {"item_1": value_col, "occurrences_1": "occurrences", "prob_1": "occurrence_probability"}, - ) + product_occurrences_1_rename = product_occurrences.rename( + {value_col: "item_1", "occurrences": "occurrences_1", "occurrence_probability": "occurrence_probability_1"}, + )
221-232
:⚠️ Potential issueRename operations and join predicates reference columns that don't exist.
The rename operations in lines 221-222 attempt to rename "item_1" and "item_2" columns that don't exist in
unique_transactions
at this point. Additionally, the join predicates in lines 228-229 reference these same non-existent columns.This appears to be a critical implementation bug that would cause runtime errors. Consider either:
- Using a mutate operation to create these columns first:
- left_table = unique_transactions.rename({"item_1": value_col}) - right_table = unique_transactions.rename({"item_2": value_col}) + left_table = unique_transactions.mutate(item_1=unique_transactions[value_col]) + right_table = unique_transactions.mutate(item_2=unique_transactions[value_col])
- Or directly using the value_col in the join predicates:
predicates=[ left_table[group_col] == right_table[group_col], - left_table["item_1"] < right_table["item_2"], + left_table[value_col] < right_table[value_col], ],
🧹 Nitpick comments (2)
pyretailscience/analysis/product_association.py (2)
309-323
: Consider deferring execute() call for better performance.Calling
.execute()
forces immediate evaluation of the query, which can impact performance with large datasets. In some cases, it might be better to return the ibis expression and let the caller decide when to execute it.Consider making the execution optional:
- final_result = result.execute().sort_values(by=["item_1", "item_2"]).reset_index(drop=True) + # Either return the ibis expression + if not execute: + return result + # Or execute it + final_result = result.execute().sort_values(by=["item_1", "item_2"]).reset_index(drop=True)This would require adding an
execute: bool = True
parameter to the method signature.
307-307
: Potential performance improvement for min_confidence filtering.The min_confidence filter is applied after potentially expensive operations like union and other calculations. Since confidence values differ between direct and inverse pairs, this placement is correct but could be optimized.
Consider applying the confidence filter for target_item queries before any union operations, since they don't need inverse pairs:
result = result.filter(result.uplift >= min_uplift) + # Apply confidence filter early when we don't need inverse pairs + if target_item is not None: + result = result.filter(result.confidence >= min_confidence) + if target_item is not None: result = result.filter(result["item_1"] == target_item)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pyretailscience/analysis/product_association.py
(5 hunks)tests/analysis/test_product_association.py
(0 hunks)
💤 Files with no reviewable changes (1)
- tests/analysis/test_product_association.py
🔇 Additional comments (5)
pyretailscience/analysis/product_association.py (5)
38-38
: Added ibis import reflects the refactored implementation.The addition of the ibis import supports the new implementation that accepts both pandas DataFrames and ibis Tables as inputs, aligning with the PR objective of converting functionality to the Ibis framework.
85-85
: Type annotations correctly updated to support both pandas and ibis inputs.The type annotations have been properly updated to accept both DataFrame and ibis.Table types, making the implementation more flexible for different data sources. This change is consistent throughout the class and well-documented in the docstrings.
Also applies to: 99-99, 144-145, 160-160
206-208
: Appropriate data type conversion.Converting pandas DataFrames to ibis memtables when needed ensures backward compatibility while enabling the Ibis-based implementation.
282-284
: Optimized filtering for target_item.The code efficiently applies target_item filtering early in the process if specified, avoiding unnecessary calculations for inverse pairs. This is a performance improvement over computing everything and filtering later.
Also applies to: 285-306
209-219
:❓ Verification inconclusive
Verify unique transactions logic.
The implementation of unique_transactions has changed significantly from the previous implementation. This code doesn't seem to be doing any deduplication or ensuring uniqueness per transaction. If transactions can contain multiple entries for the same product, this could affect your association calculations.
🏁 Script executed:
#!/bin/bash # Check how unique_transactions was implemented in the previous version rg -A 5 -B 5 "unique_transactions" --glob "*.py" | grep -v "pyretailscience/analysis/product_association.py"Length of output: 106
Action: Verify Transaction Deduplication Logic
The current implementation defines
unique_transactions
as a direct alias fordf
without any explicit deduplication step. While the aggregation later uses.nunique()
on thegroup_col
to count unique transactions, it’s not clear whether this meets your intended deduplication requirements—especially if individual transactions might list the same product multiple times.
- The previous version’s implementation for managing unique transactions wasn’t located by the automated search; please manually verify if a deduplication operation existed or was intentionally removed.
- Confirm if
df
is already preprocessed to ensure uniqueness, or if an explicit operation (e.g., adrop_duplicates
) should be applied before the aggregation.- Once verified, adjust the logic accordingly to prevent potential inaccuracies in association counts if duplicate entries per transaction are possible.
if target_item is not None: | ||
result = result.filter(result["item_1"] == target_item) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the comment above, you'd then remove this code.
if target_item is not None: | |
result = result.filter(result["item_1"] == target_item) |
…ature/product-association
23641c8
to
89b66d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
pyretailscience/analysis/product_association.py (5)
221-222
:⚠️ Potential issueColumn renaming issue: attempting to rename non-existent columns.
The code attempts to rename columns "item_1" and "item_2" in
unique_transactions
, but this table only has[group_col, value_col]
columns at this point.- left_table = unique_transactions.rename({"item_1": value_col}) - right_table = unique_transactions.rename({"item_2": value_col}) + left_table = unique_transactions.rename({value_col: "item_1"}) + right_table = unique_transactions.rename({value_col: "item_2"})
282-285
:⚠️ Potential issueIncorrect column references in uplift calculation.
The uplift calculation references columns
prob_1
andprob_2
which aren't properly defined in previous rename operations.- uplift=product_pairs["support"] / (product_pairs["prob_1"] * product_pairs["prob_2"]), + uplift=product_pairs["support"] / (product_pairs["occurrence_probability_1"] * product_pairs["occurrence_probability_2"]),
300-308
:⚠️ Potential issueInconsistent column references in inverse pairs creation.
The inverse pairs creation references columns like
prob_1
andprob_2
which aren't properly defined in the previous operations.item_1=result["item_2"], item_2=result["item_1"], occurrences_1=result["occurrences_2"], occurrences_2=result["occurrences_1"], - prob_1=result["prob_2"], - prob_2=result["prob_1"], + occurrence_probability_1=result["occurrence_probability_2"], + occurrence_probability_2=result["occurrence_probability_1"], confidence=result["cooccurrences"] / result["occurrences_2"],
241-246
:⚠️ Potential issueColumn renaming issue: attempting to rename non-existent columns.
Similar to the previous issue, these rename operations reference columns like "item_1", "occurrences_1", and "occurrence_probability_1" that don't exist in the
product_occurrences
table.- product_occurrences_1 = product_occurrences.rename( - {"item_1": value_col, "occurrences_1": "occurrences", "occurrence_probability_1": "occurrence_probability"}, - ) - product_occurrences_2 = product_occurrences.rename( - {"item_2": value_col, "occurrences_2": "occurrences", "occurrence_probability_2": "occurrence_probability"}, - ) + product_occurrences_1 = product_occurrences.rename({value_col: "item_1", "occurrences": "occurrences_1", "occurrence_probability": "occurrence_probability_1"}) + product_occurrences_2 = product_occurrences.rename({value_col: "item_2", "occurrences": "occurrences_2", "occurrence_probability": "occurrence_probability_2"})
266-271
:⚠️ Potential issueColumn renaming issue: inconsistent column naming pattern.
These rename operations are redundant with lines 241-246 and use different column names ("prob_1" instead of "occurrence_probability_1"). This inconsistency will cause issues.
- product_occurrences_1_rename = product_occurrences.rename( - {"item_1": value_col, "occurrences_1": "occurrences", "prob_1": "occurrence_probability"}, - ) - product_occurrences_2_rename = product_occurrences.rename( - {"item_2": value_col, "occurrences_2": "occurrences", "prob_2": "occurrence_probability"}, - ) + # Either remove these redundant rename operations and use product_occurrences_1/2 from earlier, + # or ensure consistent column naming between these and the earlier rename operations
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pyretailscience/analysis/product_association.py
(5 hunks)tests/analysis/test_product_association.py
(0 hunks)
💤 Files with no reviewable changes (1)
- tests/analysis/test_product_association.py
🔇 Additional comments (12)
pyretailscience/analysis/product_association.py (12)
38-39
: LGTM: Added import for ibis framework.The code now imports the
ibis
library, which is necessary for the new implementation using the Ibis framework instead of sparse matrix operations.
85-86
: LGTM: Updated parameter type to support Ibis tables.The constructor now accepts both pandas DataFrames and ibis Tables as input, enhancing flexibility.
99-100
: LGTM: Updated docstring to reflect new parameter type.The docstring correctly documents that the
df
parameter now accepts both pandas DataFrames and ibis Tables.
143-145
: LGTM: Updated static method parameter type.The
_calc_association
method now accepts both pandas DataFrames and ibis Tables as input, consistent with the constructor.
160-161
: LGTM: Updated method docstring to reflect new parameter type.The docstring correctly documents that the
df
parameter now accepts both pandas DataFrames and ibis Tables.
206-208
: LGTM: Automatically convert pandas DataFrame to ibis table.This is a good approach to handle both input types seamlessly.
209-211
: LGTM: Efficiently capture unique transactions and total count.This approach correctly creates a table of unique transactions and calculates the total count in an efficient manner.
212-219
: LGTM: Product occurrences calculation using ibis operations.The code effectively uses ibis group_by and aggregate operations to calculate product occurrences and filters by the minimum threshold.
224-239
: LGTM: Well-structured join logic with target_item handling.The join logic effectively handles both scenarios (with or without target_item) and uses appropriate ibis join features with named suffixes.
258-264
: LGTM: Effective cooccurrences calculation and filtering.The code correctly calculates cooccurrences by grouping on item pairs and applies proper filtering based on minimum thresholds.
287-299
: LGTM: Appropriate filtering and column ordering.The code correctly filters based on the minimum uplift threshold and defines a clear column order for the result.
311-327
: LGTM: Final result formatting and column selection.The code correctly filters the final results, executes the ibis expression, sorts and renames the columns, and returns a properly formatted pandas DataFrame.
merged_df = merged_df.join( | ||
product_occurrences_1, | ||
predicates=[merged_df["item_1"] == product_occurrences_1["item_1"]], | ||
) | ||
|
||
base_items = [target_item] | ||
if number_of_combinations == 2: # noqa: PLR2004 | ||
if target_item is None: | ||
base_items = unique_combo_df[value_col].cat.categories | ||
items = [([unique_combo_df[value_col].cat.categories.get_loc(cat)], cat) for cat in base_items] | ||
elif number_of_combinations == 3: # noqa: PLR2004 | ||
if target_item is None: | ||
base_items = sorted(combinations(unique_combo_df[value_col].cat.categories, 2)) | ||
items = [ | ||
([unique_combo_df[value_col].cat.categories.get_loc(i) for i in cats], cats) for cats in base_items | ||
] | ||
merged_df = merged_df.join( | ||
product_occurrences_2, | ||
predicates=[merged_df["item_2"] == product_occurrences_2["item_2"]], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Join predicates reference may not match renamed columns.
The join predicates reference columns like merged_df["item_1"]
and product_occurrences_1["item_1"]
, but given the rename issues above, these columns might not exist as expected.
This will need to be fixed after addressing the rename operations in lines 241-246. The join predicates should reference the correctly named columns.
product_pairs = cooccurrences.join( | ||
product_occurrences_1_rename, | ||
predicates=[cooccurrences["item_1"] == product_occurrences_1_rename["item_1"]], | ||
) | ||
product_pairs = product_pairs.join( | ||
product_occurrences_2_rename, | ||
predicates=[product_pairs["item_2"] == product_occurrences_2_rename["item_2"]], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistency in join predicates with renamed tables.
The join operations use predicates that reference columns like cooccurrences["item_1"]
and product_occurrences_1_rename["item_1"]
, but given the rename issues above, these columns might not match as expected.
This will need to be fixed after addressing the rename operations in lines 266-271.
feat: convert product_association into ibis
Summary by CodeRabbit
New Features
Refactor
Bug Fixes