Skip to content

Product Association #131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 19, 2025
Merged

Product Association #131

merged 4 commits into from
Mar 19, 2025

Conversation

mayurkmmt
Copy link
Collaborator

@mayurkmmt mayurkmmt commented Mar 10, 2025

feat: convert product_association into ibis

Summary by CodeRabbit

  • New Features

    • Enabled product association calculations to accept a wider range of data formats, increasing flexibility.
  • Refactor

    • Optimized the association computation process for improved performance and scalability.
    • Streamlined configuration settings by removing unnecessary parameters.
  • Bug Fixes

    • Removed outdated test cases related to association calculations and refined error handling for the number of combinations.

Copy link

coderabbitai bot commented Mar 10, 2025

Walkthrough

This pull request refactors the ProductAssociation class by updating the type for the df parameter to accept both a Pandas DataFrame and an ibis Table. The number_of_combinations parameter has been removed, simplifying the constructor. Error handling related to number_of_combinations has been eliminated. The association calculation logic has been overhauled to use ibis operations (e.g., select, group_by, aggregate, join) instead of numpy/scipy-based methods. Corresponding tests have been updated by removing two methods and adjusting error messages.

Changes

File(s) Change Summary
pyretailscience/.../product_association.py Updated ProductAssociation constructor and _calc_association method:
- df parameter now accepts `pd.DataFrame
tests/analysis/.../test_product_association.py Removed tests for all and target pair items
- Updated error messages in the invalid combinations test to indicate that only number_of_combinations=2 is supported.

Possibly related PRs

  • feat: improved performance of product assocation calculation #70: The changes in the main PR directly modify the constructor and _calc_association method of the ProductAssociation class, while the retrieved PR also alters these same methods by adding a new parameter, indicating a strong connection between the two.
  • feat: added production association rule module #69: The changes in the main PR directly modify the ProductAssociation class's constructor and _calc_association method, while the retrieved PR introduces a new ProductAssociation class with similar methods, indicating a strong connection at the code level.
  • Prod assoc perf improvements #71: The changes in the main PR focus on refactoring the ProductAssociation class and its methods to utilize the ibis library, while the retrieved PR modifies the handling of sparse matrices in the _calc_association method, indicating that they both alter the same method but in different contexts and approaches.

Suggested labels

enhancement, Bug fix, documentation, Review effort [1-5]: 3

Poem

I'm a happy rabbit, hopping through code,
Switching DataFrames for ibis on the road.
New types and tweaks make the flow so neat,
Only twos in play—oh, what a treat!
With each byte I nibble, my circuits sing,
Cheers to refactored joy and a smoother fling!
🐰🎉

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented Mar 10, 2025

Codecov Report

Attention: Patch coverage is 97.14286% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pyretailscience/analysis/product_association.py 97.14% 0 Missing and 1 partial ⚠️
Files with missing lines Coverage Δ
pyretailscience/analysis/product_association.py 87.50% <97.14%> (-0.64%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
pyretailscience/analysis/product_association.py (3)

44-44: Constant for supported combinations.

Introducing SUPPORTED_COMBINATIONS as a constant clarifies the current functionality. If future support is planned for more combinations, consider structuring the code so it's easy to extend.


213-214: Validation restricting combinations to 2 only.

Raising an error for unsupported combination values enforces the new constraint. This also aligns with the updated tests. Ensure the docstrings are consistent with only supporting pairs.

-            ValueError: If the number of combinations is not 2 or 3,
+            ValueError: If the number of combinations is not 2,

380-382: Filtering by target item.

Filtering the final result to only rows where value_col_1 == target_item is a clean way to handle optional targeting. For greater flexibility, consider allowing a list of target items in the future.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3eb32a8 and d7742f0.

📒 Files selected for processing (2)
  • pyretailscience/analysis/product_association.py (6 hunks)
  • tests/analysis/test_product_association.py (1 hunks)
🔇 Additional comments (5)
tests/analysis/test_product_association.py (2)

228-228: Updated error message aligns with new logic.

The updated error message is consistent with restricting number_of_combinations to only 2. This test correctly verifies that invalid values trigger the new error.


235-235: Duplicate check for invalid combination values.

Having a separate check for other invalid values is good for completeness. The updated message matches the refactor limiting supported combinations to 2.

pyretailscience/analysis/product_association.py (3)

39-39: New ibis import.

Importing ibis is necessary for the memtable conversion and subsequent ibis-based operations. Make sure all dependencies are properly declared within your environment or setup files.


309-316: Confidence and uplift calculations.

This step is logically correct (confidence = cooccurrences / occurrences_1, uplift = support / (prob_1 * prob_2)) once the data is joined properly. However, ensure the preceding rename operations produce the correct column references.


92-96:

❓ Verification inconclusive

Parameter type change to accept ibis tables.

Allowing df: pd.DataFrame | ibis.Table gives more flexibility. Ensure that all downstream logic can handle both data structures, especially any operations that might differ between pandas and ibis.


🏁 Script executed:

#!/bin/bash
# Description: Verify that no downstream call expects only a pandas DataFrame.
# We'll look for methods that might call df.* methods not supported by ibis.
rg -A 5 "df\." 

Length of output: 58417


Attention: Verify Downstream Handling of Mixed DataFrame Types

The parameter change to allow both pd.DataFrame and ibis.Table has been partially verified via our code search. In the affected file (pyretailscience/analysis/product_association.py), methods such as accessing df.columns and calling df.group_by(…) appear to be used consistently. These methods are available on ibis tables, and similar operations are used elsewhere in the codebase. However, note that some downstream calls (e.g. pandas‑specific methods like groupby or to_dict in other modules) may behave differently if a native pandas DataFrame is passed.

  • In product_association.py, the following is observed:
    • Column Checks:
      missing_cols = set(required_cols) - set(df.columns)
      This is valid on both pd.DataFrame and ibis.Table (which exposes a columns property).
    • Aggregation Operations:
      The use of df.group_by(group_col).aggregate(…) is consistent with ibis semantics.
  • Across the repository, both pandas‑specific and ibis‑specific methods are invoked. It is recommended to:
    • Confirm at runtime that when a pandas DataFrame is provided, any pandas‑only operations are either skipped or preceded by an appropriate conversion step.
    • Ensure that unit tests cover the scenarios for both data types so that any subtle differences are caught early.

Please verify that all downstream logic correctly handles both data structure types during integration testing.

@mvanwyk mvanwyk requested review from Copilot and mvanwyk March 11, 2025 06:15
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR refactors the product association functionality to use ibis for flexible data input and streamlines the association calculation process while updating error messaging. Key changes include:

  • Converting input data handling to support both pandas.DataFrame and ibis.Table.
  • Simplifying and unifying the number_of_combinations argument using a constant.
  • Adjusting tests to reflect the removal of triple combinations and updated exception messages.

Reviewed Changes

File Description
pyretailscience/analysis/product_association.py Refactored association logic to support ibis tables, updated error messaging, and streamlined renaming of intermediate columns.
tests/analysis/test_product_association.py Updated tests to match new error messages and removed unsupported combination scenarios.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

pyretailscience/analysis/product_association.py:259

  • [nitpick] The repeated renaming of product occurrences (e.g. for product_occurrences_1 and product_occurrences_2) may be error-prone; consider extracting the renaming logic into a helper function to improve clarity and reduce duplication.
product_occurrences_1 = product_occurrences.rename(

pyretailscience/analysis/product_association.py:318

  • [nitpick] The mapping in the renaming of inverse_pairs swaps column names; adding an inline comment to explain the rationale behind this mapping would help improve code clarity.
inverse_pairs = result.rename({

@mayurkmmt mayurkmmt force-pushed the feature/product-association branch from d7742f0 to 59a63dc Compare March 11, 2025 06:56
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
pyretailscience/analysis/product_association.py (4)

258-263: ⚠️ Potential issue

Column renaming inconsistency in product_occurrences.

The product_occurrences table has columns ["item", "occurrences", "occurrence_probability"], but the rename operations are looking for columns that don't exist in this context (item_1, occurrences_x, etc.).

-product_occurrences_1 = product_occurrences.rename(
-    {"item_1": "item", "occurrences_x": "occurrences", "occurrence_probability_x": "occurrence_probability"},
-)
-product_occurrences_2 = product_occurrences.rename(
-    {"item_2": "item", "occurrences_y": "occurrences", "occurrence_probability_y": "occurrence_probability"},
-)
+# Correctly rename columns for joining with merged_df
+product_occurrences_1 = product_occurrences.rename({"item": "item_1"})
+product_occurrences_2 = product_occurrences.rename({"item": "item_2"})

326-328: ⚠️ Potential issue

Inconsistent column access for product_occurrences.

The rename operations are referencing columns like {value_col}_1 and {value_col}_2 but it's not clear where these column names came from, as previous operations created columns named "item".

-product_occurrences_1_rename2 = product_occurrences.rename({f"{value_col}_1": "item"})
-product_occurrences_2_rename2 = product_occurrences.rename({f"{value_col}_2": "item"})
+# Use consistent column naming based on the actual structure
+product_occurrences_1_rename2 = product_occurrences.rename({"item": f"{value_col}_1"})
+product_occurrences_2_rename2 = product_occurrences.rename({"item": f"{value_col}_2"})

248-249: ⚠️ Potential issue

Column renaming issue with non-existent columns.

The code is attempting to rename columns item_1 and item_2 in unique_transactions, but according to the previous operations, this table only has columns [group_col, "item"]. This will likely cause runtime errors.

-left_table = unique_transactions.rename({"item_1": "item"})
-right_table = unique_transactions.rename({"item_2": "item"})
+# These tables need correct column names for the join operation
+left_table = unique_transactions.rename({"item": "item_1"})
+right_table = unique_transactions.rename({"item": "item_2"})

290-295: ⚠️ Potential issue

Inconsistent column naming in product_occurrences renames.

These rename operations reference columns occurrences_1, prob_1, etc. that don't appear to exist in the product_occurrences table based on the previous operations. Check these column names to ensure they match what's available.

-product_occurrences_1_rename = product_occurrences.rename(
-    {"item_1": "item", "occurrences_1": "occurrences", "prob_1": "occurrence_probability"},
-)
-product_occurrences_2_rename = product_occurrences.rename(
-    {"item_2": "item", "occurrences_2": "occurrences", "prob_2": "occurrence_probability"},
-)
+# Use consistent column naming based on what exists in product_occurrences
+product_occurrences_1_rename = product_occurrences.rename(
+    {"item": "item_1"}
+)
+product_occurrences_2_rename = product_occurrences.rename(
+    {"item": "item_2"}
+)
🧹 Nitpick comments (1)
pyretailscience/analysis/product_association.py (1)

226-393: Comprehensive refactoring to use ibis instead of numpy/scipy.

The refactoring to use ibis operations is a good strategic move as it will likely improve performance for large datasets and enable integration with various data sources. However, ensure that all column naming is consistent throughout the implementation to prevent runtime errors.

Some key recommendations:

  1. Verify all column references match the actual columns available in each table
  2. Test thoroughly with both pandas DataFrames and ibis Tables
  3. Consider adding more inline comments explaining the complex data transformation steps
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d7742f0 and 59a63dc.

📒 Files selected for processing (2)
  • pyretailscience/analysis/product_association.py (6 hunks)
  • tests/analysis/test_product_association.py (1 hunks)
🔇 Additional comments (10)
tests/analysis/test_product_association.py (3)

228-228: Error message updated to reflect current implementation.

The error message has been correctly updated to specify that only number_of_combinations=2 is currently supported, which aligns with the changes in the ProductAssociation class implementation.


235-235: Error message updated for consistency.

This error message has been consistently updated to match the first instance, reflecting the current implementation constraints.


1-306:

✅ Verification successful

Test coverage concern due to removed test methods.

Two test methods (test_calc_association_all_pair_items and test_calc_association_target_pair_items) have been removed from this file. While this aligns with the refactoring to use ibis instead of numpy/scipy, we should ensure that the new implementation is still adequately tested with similar test coverage.


🏁 Script executed:

#!/bin/bash
# Check for any other tests that might cover the removed functionality

echo "Checking for other tests that might test pair combinations..."
rg -i "test.*pair" tests/

Length of output: 461


Pair Association Test Coverage Verified

The functionality previously validated by the removed tests now appears to be covered. Our search found that the pair association logic is exercised in tests/test_range_planning.py—specifically, tests that verify the correct generation of pairs dataframes when exclude_same_transaction_products is set to True and False. This indicates that the new implementation has been adequately tested. Please review the tests in tests/test_range_planning.py to confirm that all edge cases for pair associations are covered.

🧰 Tools
🪛 Ruff (0.8.2)

15-15: Use @pytest.fixture over @pytest.fixture()

Remove parentheses

(PT001)


26-26: Use @pytest.fixture over @pytest.fixture()

Remove parentheses

(PT001)


61-61: Use @pytest.fixture over @pytest.fixture()

Remove parentheses

(PT001)

pyretailscience/analysis/product_association.py (7)

39-44: Good addition of ibis import and constant definition.

Adding the ibis import and defining SUPPORTED_COMBINATIONS as a constant improves code maintainability by centralizing the supported value.


92-97: Enhanced flexibility with expanded type annotation.

Updating the parameter type to accept both pandas DataFrame and ibis Table enhances the flexibility of the class, allowing it to work with different data sources.


213-214: Simplified validation with clear error message.

The error message now clearly states that only number_of_combinations=2 is supported, which is more explicit than the previous message.


226-228: Good defensive type handling.

The code correctly handles both pandas DataFrames and ibis Tables by converting DataFrames to ibis Tables when needed, ensuring a consistent implementation.


370-377: Comprehensive approach to association rule generation.

The code correctly executes ibis expressions and combines direct and inverse pairs to create a comprehensive set of association rules. The sorting and index resetting ensure consistent output formatting.


379-393: Target item filtering and column selection are appropriate.

The implementation correctly filters by target_item when specified and selects all relevant columns for the final result. This maintains compatibility with the previous implementation while leveraging ibis for improved performance.


1-394:

✅ Verification successful

Verify test coverage for the refactored implementation.

With the removal of two test methods from the test file and the complete refactoring of the implementation, it's important to ensure adequate test coverage for the new ibis-based logic.


🏁 Script executed:

#!/bin/bash
# Let's check test coverage for the ProductAssociation class

echo "Checking for tests of the ProductAssociation class..."
rg -l "ProductAssociation" tests/

echo "Checking coverage of key ibis operations..."
rg "ibis\.(join|group_by|aggregate|filter)" pyretailscience/analysis/product_association.py

Length of output: 617


Test Coverage for Product Association Refactoring Verified

The refactored implementation is supported by tests. Notably:

  • The test file tests/analysis/test_product_association.py confirms that the ProductAssociation class is being tested.
  • Key IBIS operations such as join, group_by, aggregate, and filter are actively used in the implementation, ensuring that the new data-processing paths are exercised.

Please ensure that all new branches—including error handling and parameter validations—are comprehensively covered within the tests.

Comment on lines +308 to +294
product_pairs = product_pairs.mutate(
confidence=product_pairs["cooccurrences"] / product_pairs["occurrences_1"],
uplift=product_pairs["support"] / (product_pairs["prob_1"] * product_pairs["prob_2"]),
)

This comment was marked as resolved.

@mayurkmmt mayurkmmt force-pushed the feature/product-association branch from 59a63dc to 2743f04 Compare March 11, 2025 08:59
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
pyretailscience/analysis/product_association.py (5)

241-246: ⚠️ Potential issue

Column name references in rename operations might be incorrect

The rename operations reference columns that don't appear to exist in product_occurrences. Based on line 221-229, product_occurrences should have columns ["item", "occurrences", "occurrence_probability"], not the columns being renamed.

Revisit these rename operations to ensure they match the actual column structure of the dataframe.


273-278: ⚠️ Potential issue

Inconsistent column naming in rename operations

Similar to the previous rename operations, these don't match the expected columns in product_occurrences. Additionally, there's a mismatch with the columns referenced in the uplift calculation below (prob_1, prob_2).

Ensure the column naming is consistent throughout the code.


291-294: ⚠️ Potential issue

Issues with uplift calculation column references

The uplift calculation uses prob_1 and prob_2 columns which don't appear to be properly defined in the preceding code. This was previously flagged in a review comment and still needs to be addressed.

Update the uplift calculation to use the correct column references based on your join/rename strategy:

 product_pairs = product_pairs.mutate(
     confidence=product_pairs["cooccurrences"] / product_pairs["occurrences_1"],
-    uplift=product_pairs["support"] / (product_pairs["prob_1"] * product_pairs["prob_2"]),
+    uplift=product_pairs["support"] / (product_pairs["occurrence_probability_x"] * product_pairs["occurrence_probability_y"]),
 )

Adjust the column names to match your actual column structure after the joins.


322-325: ⚠️ Potential issue

Same uplift calculation issue in inverse pairs

The same issue with column references exists in the inverse_pairs uplift calculation.

 inverse_pairs = inverse_pairs.mutate(
     confidence=inverse_pairs["cooccurrences"] / inverse_pairs["occurrences_1"],
-    uplift=inverse_pairs["support"] / (inverse_pairs["prob_1"] * inverse_pairs["prob_2"]),
+    uplift=inverse_pairs["support"] / (inverse_pairs["occurrence_probability_x"] * inverse_pairs["occurrence_probability_y"]),
 )

231-232: ⚠️ Potential issue

Potential column name mismatch in rename operations

The rename operations here don't seem to match the actual column structure. Based on the preceding code, unique_transactions has columns [group_col, "item"], not "item_1" or "item_2".

Verify the column names and adjust the rename operations, possibly:

-left_table = unique_transactions.rename({"item_1": "item"})
-right_table = unique_transactions.rename({"item_2": "item"})
+# If trying to create aliases for self-join, create copies instead
+left_table = unique_transactions
+right_table = unique_transactions
🧹 Nitpick comments (1)
pyretailscience/analysis/product_association.py (1)

212-218: Consider using the table's unnest method directly

The code is using the internal ibis.expr.operations.Unnest operation directly. Consider using the table's unnest() method instead for better readability and to avoid depending on internal implementation details.

 unique_transactions = unique_transactions.mutate(
-    item=ibis.expr.operations.Unnest(unique_transactions["products"]),
+    item=unique_transactions["products"].unnest(),
 ).drop("products")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 59a63dc and 2743f04.

📒 Files selected for processing (2)
  • pyretailscience/analysis/product_association.py (5 hunks)
  • tests/analysis/test_product_association.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/analysis/test_product_association.py
🔇 Additional comments (10)
pyretailscience/analysis/product_association.py (10)

38-40: Good update to include ibis import

The addition of the ibis import is appropriate for the refactoring of the class to support both Pandas DataFrame and ibis Table inputs.


44-45: Simplified approach with constant instead of parameter

Replacing the number_of_combinations parameter with a constant SUPPORTED_COMBINATIONS = 2 streamlines the API. This simplification makes the code more maintainable by removing parameter validation logic.


88-89: Good enhancement of type signature

The updated type annotation correctly reflects that the class now supports both Pandas DataFrame and ibis Table inputs.


102-103: Docstring updated to match new type signature

The documentation properly reflects the updated parameter type to include both DataFrame and ibis Table.


147-148: Method signature update consistent with class signature

The static method signature properly mirrors the class initialization parameter type.


162-164: Docstring updated to match method signature

Good job updating the docstring to match the new parameter type.


209-211: Good pattern for handling both input types

This is a clean approach to handle both pandas DataFrame and ibis Table inputs, ensuring that the rest of the implementation only needs to deal with ibis Tables.


356-360: Good final result construction

The concatenation of the direct and inverse pairs, sorting, and index reset produces a well-structured final result.


362-364: Clean target item filtering

The conditional filter for target_item is implemented clearly and efficiently.


365-376: Proper column selection for final output

The final column selection ensures a consistent output schema with all the important metrics.

Comment on lines 212 to 217
unique_transactions = (
df.group_by(group_col).aggregate(products=lambda t, col=value_col: t[col].collect()).order_by(group_col)
)
unique_transactions = unique_transactions.mutate(
item=ibis.expr.operations.Unnest(unique_transactions["products"]),
).drop("products")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's doing something very different from the CTE in the example query. Eg,

unique_transactions AS (
    SELECT DISTINCT
        customer_id,
        product_id
    FROM transactions
)

Based on the Ibis docs I think this should do the same thing as the SQL

Suggested change
unique_transactions = (
df.group_by(group_col).aggregate(products=lambda t, col=value_col: t[col].collect()).order_by(group_col)
)
unique_transactions = unique_transactions.mutate(
item=ibis.expr.operations.Unnest(unique_transactions["products"]),
).drop("products")
unique_transactions = (
df.select(df[group_col], df.product_id.name("item")).distinct()
)

right_table,
predicates=[left_table[group_col] == right_table[group_col]],
)
merged_df = merged_df.filter(merged_df["item_1"] < merged_df["item_2"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you definitely want to do this as part of the join. Otherwise, I believe some DBs will do this after the join.

merged_df,
product_occurrences_2,
predicates=[merged_df["item_2"] == product_occurrences_2["item_2"]],
).order_by([group_col, "item_1", "item_2"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid order_by on the non-final data. This could end up sorting millions or billions of rows of data.

@mayurkmmt mayurkmmt force-pushed the feature/product-association branch from d925e64 to 79b8cc6 Compare March 12, 2025 11:22
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
pyretailscience/analysis/product_association.py (6)

222-223: ⚠️ Potential issue

Incorrect column renaming

The rename operations here appear problematic because unique_transactions doesn't contain columns named "item_1" or "item_2". After line 210, it only has columns [group_col, "item"].

These renames won't work as intended. Consider:

-left_table = unique_transactions.rename({"item_1": "item"})
-right_table = unique_transactions.rename({"item_2": "item"})
+left_table = unique_transactions.rename({"item": "item_1"})
+right_table = unique_transactions.rename({"item": "item_2"})

233-238: ⚠️ Potential issue

Incorrect column renaming in product_occurrences

The rename operations for product_occurrences_1 and product_occurrences_2 are using column names that don't exist in product_occurrences, which only has ["item", "occurrences", "occurrence_probability"].

This will cause runtime errors. Fix the renames:

product_occurrences_1 = product_occurrences.rename(
-   {"item_1": "item", "occurrences_1": "occurrences", "occurrence_probability_1": "occurrence_probability"},
+   {"item": "item_1", "occurrences": "occurrences_1", "occurrence_probability": "occurrence_probability_1"},
)
product_occurrences_2 = product_occurrences.rename(
-   {"item_2": "item", "occurrences_2": "occurrences", "occurrence_probability_2": "occurrence_probability"},
+   {"item": "item_2", "occurrences": "occurrences_2", "occurrence_probability": "occurrence_probability_2"},
)

279-282: ⚠️ Potential issue

Uplift calculation uses undefined columns

The uplift calculation uses "prob_1" and "prob_2" columns, but these appear to be undefined or incorrectly referenced based on previous rename operations.

This is the same issue mentioned in previous review comments. Fix the column references:

product_pairs = product_pairs.mutate(
    confidence=product_pairs["cooccurrences"] / product_pairs["occurrences_1"],
-   uplift=product_pairs["support"] / (product_pairs["prob_1"] * product_pairs["prob_2"]),
+   uplift=product_pairs["support"] / (product_pairs["occurrence_probability_1"] * product_pairs["occurrence_probability_2"]),
)

Ensure that the column names match those created in your rename operations.


315-339: ⚠️ Potential issue

Column selection refers to non-existent columns

Lines 315-339 select columns using names that may not match the actual columns in the result and inverse_pairs DataFrames.

Ensure that all column references match the actual column names in your tables. For example, in line 315:

-result = result.rename({f"{value_col}_1": "item_1", f"{value_col}_2": "item_2"})
+# If result already has "item_1" and "item_2" columns, this rename is unnecessary
+# Otherwise, make sure you're renaming from the correct source column names

Similarly, verify that all column selections in the following lists refer to columns that actually exist in the respective DataFrames.


261-266: ⚠️ Potential issue

Renamed columns don't match existing columns

Similar to the previous issues, these renames use column names like "prob_1" and "prob_2" that don't appear to exist in product_occurrences.

Fix the renames to use the correct column names:

product_occurrences_1_rename = product_occurrences.rename(
-   {"item_1": "item", "occurrences_1": "occurrences", "prob_1": "occurrence_probability"},
+   {"item": "item_1", "occurrences": "occurrences_1", "occurrence_probability": "prob_1"},
)
product_occurrences_2_rename = product_occurrences.rename(
-   {"item_2": "item", "occurrences_2": "occurrences", "prob_2": "occurrence_probability"},
+   {"item": "item_2", "occurrences": "occurrences_2", "occurrence_probability": "prob_2"},
)

288-295: ⚠️ Potential issue

Incorrect column references in inverse_pairs

The rename operation for inverse_pairs uses column references like f"{value_col}_1" and f"{value_col}_2" but these columns don't exist in the result DataFrame at this point.

Fix the column references:

inverse_pairs = result.rename(
    {
-       f"{value_col}_1": "item_1",
-       f"{value_col}_2": "item_2",
+       "item_1": "item_2",
+       "item_2": "item_1",
        "occurrences_1": "occurrences_2",
        "occurrences_2": "occurrences_1",
    },
)
🧹 Nitpick comments (3)
pyretailscience/analysis/product_association.py (3)

207-212: Potential performance issue with early execution

The total_transactions calculation forces an immediate execution with .execute(), breaking the lazy evaluation chain that ibis typically provides. This could negatively impact performance for large datasets.

Consider making this part of the lazy evaluation chain by using:

-total_transactions = unique_transactions[group_col].nunique().execute()
+total_transactions = unique_transactions[group_col].nunique().name("total_count")

This matches the suggestion in previous comments and would allow the database to optimize the entire query.


85-141: Documentation needs update to reflect Ibis support

The class docstring (lines 46-82) still refers only to pandas.DataFrame input, but the implementation now supports both pandas DataFrame and ibis Table.

Update the class docstring to reflect the support for ibis.Table:

    Args:
-       df (pandas.DataFrame): The input DataFrame containing transaction data.
+       df (pandas.DataFrame | ibis.Table): The input DataFrame or ibis Table containing transaction data.

Also, consider adding an example using ibis.Table to demonstrate the new functionality.


207-364: Inconsistent naming conventions for column suffixes

Throughout the implementation, there's an inconsistent use of suffix naming conventions. Sometimes "_1"/"_2" are used, and other times you use "_x"/"_y" or other variations.

Standardize on a single naming convention for column suffixes (either "_1"/"_2" or "_x"/"_y") to improve code readability and maintenance. This will make it easier to track which columns are being used throughout the complex joins and transformations.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2743f04 and 79b8cc6.

📒 Files selected for processing (1)
  • pyretailscience/analysis/product_association.py (5 hunks)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (3)
pyretailscience/analysis/product_association.py (3)

316-340: ⚠️ Potential issue

Column selection inconsistency in result and inverse_pairs.

The column selection references f"{value_col}_1" and f"{value_col}_2", but after renaming they should be "item_1" and "item_2".

Fix the column selection to match the renamed columns:

result = result.rename({f"{value_col}_1": "item_1", f"{value_col}_2": "item_2"})
result = result[
    [
-       f"{value_col}_1",
-       f"{value_col}_2",
+       "item_1",
+       "item_2",
        "occurrences_1",
        "occurrences_2",
        "cooccurrences",
        "support",
        "confidence",
        "uplift",
    ]
]
inverse_pairs = inverse_pairs[
    [
-       f"{value_col}_1",
-       f"{value_col}_2",
+       "item_1",
+       "item_2",
        "occurrences_1",
        "occurrences_2",
        "cooccurrences",
        "support",
        "confidence",
        "uplift",
    ]
]

354-365: ⚠️ Potential issue

Incorrect column names in final result selection.

The final result selection uses f"{value_col}_1" and f"{value_col}_2" but should use "item_1" and "item_2" after renaming.

Correct the column selection:

return final_result[
    [
-       f"{value_col}_1",
-       f"{value_col}_2",
+       "item_1",
+       "item_2",
        "occurrences_1",
        "occurrences_2",
        "cooccurrences",
        "support",
        "confidence",
        "uplift",
    ]
]

298-310: ⚠️ Potential issue

Incorrect column references in join predicates for inverse pairs.

The join operations use column references like f"{value_col}_1" that don't match the renamed columns, which would cause runtime errors.

Fix the join predicates to use the correct column names:

product_occurrences_1_rename2 = product_occurrences.rename(
-   {f"{value_col}_1": value_col}
+   {"item_1": value_col}
)
product_occurrences_2_rename2 = product_occurrences.rename(
-   {f"{value_col}_2": value_col}
+   {"item_2": value_col}
)

inverse_pairs = ibis.join(
    inverse_pairs,
    product_occurrences_1_rename2,
-   predicates=[inverse_pairs[f"{value_col}_1"] == product_occurrences_1_rename2[f"{value_col}_1"]],
+   predicates=[inverse_pairs["item_1"] == product_occurrences_1_rename2["item_1"]],
)
inverse_pairs = ibis.join(
    inverse_pairs,
    product_occurrences_2_rename2,
-   predicates=[inverse_pairs[f"{value_col}_2"] == product_occurrences_2_rename2[f"{value_col}_2"]],
+   predicates=[inverse_pairs["item_2"] == product_occurrences_2_rename2["item_2"]],
)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 79b8cc6 and 35f0677.

📒 Files selected for processing (1)
  • pyretailscience/analysis/product_association.py (5 hunks)
🔇 Additional comments (5)
pyretailscience/analysis/product_association.py (5)

86-96: Good type annotation update to support ibis Tables.

The parameter type has been correctly updated to accommodate both Pandas DataFrames and ibis Tables, aligning with the PR objective to convert functionality to the Ibis framework.


207-209: Appropriate conversion from pandas to ibis table.

The code correctly checks if the input is a pandas DataFrame and converts it to an ibis memtable when needed.


210-220: Simplified transaction aggregation with ibis.

The implementation uses ibis expressions to get unique transactions and calculate total transaction count, which is more scalable for large datasets compared to the previous implementation.


222-233: Good use of ibis join predicates for generating pairs.

The predicate left_table["item_1"] < right_table["item_2"] efficiently ensures we only get each pair once in a single direction.


211-211: Efficient use of ibis for total transaction count.

Using ibis's nunique() method with proper naming is a good approach for calculating the total transaction count efficiently.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
pyretailscience/analysis/product_association.py (4)

234-239: ⚠️ Potential issue

Column naming inconsistency in rename operations.

The rename operations reference columns like "item_1", "occurrences_1", and "occurrence_probability_1" that don't appear to exist in the product_occurrences table at this point.

These issues are similar to previous review comments about column naming inconsistencies. Ensure that you're renaming columns that actually exist in the source table.


259-264: ⚠️ Potential issue

Column naming mismatch in product_occurrences rename operations.

The rename operations reference columns like "prob_1" and "prob_2" that don't appear to be created earlier in the code. This is likely to cause runtime errors.

This issue is similar to a previous review comment about column naming. The source table product_occurrences has "occurrence_probability" but not "prob_1" or "prob_2".


275-278: ⚠️ Potential issue

Column references don't match renamed columns.

The uplift calculation references "prob_1" and "prob_2" columns, but these don't match the renamed columns from earlier operations. This will likely cause runtime errors.

This issue is similar to a previous review comment. The columns should be renamed consistently throughout the code. If you're using "occurrence_probability_1" and "occurrence_probability_2" in some places, you should use the same names here.


284-292: ⚠️ Potential issue

Inconsistent column naming in inverse_pairs creation.

The inverse_pairs creation references "prob_1" and "prob_2" columns that may not exist in the result dataframe.

Similar to previous issues with column naming. Ensure consistent column naming throughout the code, especially for probability columns.

🧹 Nitpick comments (3)
pyretailscience/analysis/product_association.py (3)

209-219: Simplify the group_by aggregate operation.

The trailing comma in line 215 is unnecessary and should be removed for consistency with coding standards.

            .aggregate(
-                occurrences=lambda t: t[group_col].nunique(),
+                occurrences=lambda t: t[group_col].nunique()
            )

253-254: Remove trailing comma in single-item mutate calls.

Similar to line 215, the trailing comma in line 253 should be removed for consistency.

        cooccurrences = cooccurrences.mutate(
-            support=cooccurrences.cooccurrences / total_transactions,
+            support=cooccurrences.cooccurrences / total_transactions
        )

322-327: Consider optimizing the execution flow.

The current implementation executes the two ibis expressions separately and then concatenates them with pandas. Consider if this can be done within ibis to allow for backend optimization.

- result = result.execute()
- inverse_pairs = inverse_pairs.execute()
-
- final_result = (
-     pd.concat([result, inverse_pairs], ignore_index=True)
-     .sort_values(by=["item_1", "item_2"])
-     .reset_index(drop=True)
- )
+ combined = ibis.union(result, inverse_pairs)
+ combined = combined.order_by(["item_1", "item_2"])
+ final_result = combined.execute()
+ final_result = final_result.reset_index(drop=True)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35f0677 and 2b2a018.

📒 Files selected for processing (2)
  • pyretailscience/analysis/product_association.py (5 hunks)
  • tests/analysis/test_product_association.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/analysis/test_product_association.py
🔇 Additional comments (3)
pyretailscience/analysis/product_association.py (3)

85-85: Good type annotation enhancement.

The parameter type has been properly updated to accept both pandas DataFrame and ibis Table, which enhances flexibility in the interface.


206-208: Good practice handling different input types.

Converting pandas DataFrame to ibis Table is a clean approach that allows the rest of the code to operate uniformly regardless of input type.


327-329:

✅ Verification successful

Target item filtering needs verification.

The filtering operation at line 329 assumes that the column name will be f"{value_col}_1", which should be verified against the actual column names in the final result.


🏁 Script executed:

#!/bin/bash
# Check how target_item filtering is handled in similar code
rg -A 2 -B 2 "target_item.*reset_index" pyretailscience/

Length of output: 583


Verified: Target item filtering logic is valid and consistent.

The shell script output confirms that the column renaming (line 327) correctly sets the column name as f"{value_col}_1" before filtering (line 329). This verifies that the filter condition is applied on the intended column, and the target item filtering logic is implemented correctly.

@mayurkmmt mayurkmmt force-pushed the feature/product-association branch from 2b2a018 to 23641c8 Compare March 18, 2025 13:41
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
pyretailscience/analysis/product_association.py (4)

234-239: ⚠️ Potential issue

Rename operations reference non-existent columns.

Similar to the previous issue, these rename operations reference columns that don't exist in product_occurrences. The columns "item_1", "occurrences_1", and "occurrence_probability_1" (and their "_2" counterparts) aren't defined in the source table.

These operations would fail at runtime. Consider correcting the rename operations to match the actual column structure:

- product_occurrences_1 = product_occurrences.rename(
-     {"item_1": value_col, "occurrences_1": "occurrences", "occurrence_probability_1": "occurrence_probability"},
- )
+ product_occurrences_1 = product_occurrences.rename(
+     {value_col: "item_1", "occurrences": "occurrences_1", "occurrence_probability": "occurrence_probability_1"},
+ )

275-278: ⚠️ Potential issue

Column naming mismatch in uplift calculation.

The uplift calculation references "prob_1" and "prob_2" columns, which don't match the renamed columns defined earlier (they should be "occurrence_probability_1" and "occurrence_probability_2").

Fix the column references in the uplift calculation to match the renamed columns:

product_pairs = product_pairs.mutate(
    confidence=product_pairs["cooccurrences"] / product_pairs["occurrences_1"],
-   uplift=product_pairs["support"] / (product_pairs["prob_1"] * product_pairs["prob_2"]),
+   uplift=product_pairs["support"] / (product_pairs["occurrence_probability_1"] * product_pairs["occurrence_probability_2"]),
)

259-264: ⚠️ Potential issue

Column naming inconsistency in rename operations.

The rename operations reference "prob_1" and "prob_2", but these columns don't appear to be defined. Earlier code uses "occurrence_probability" instead.

Fix the column names in the rename operations to match the actual column structure:

- product_occurrences_1_rename = product_occurrences.rename(
-     {"item_1": value_col, "occurrences_1": "occurrences", "prob_1": "occurrence_probability"},
- )
+ product_occurrences_1_rename = product_occurrences.rename(
+     {value_col: "item_1", "occurrences": "occurrences_1", "occurrence_probability": "occurrence_probability_1"},
+ )

221-232: ⚠️ Potential issue

Rename operations and join predicates reference columns that don't exist.

The rename operations in lines 221-222 attempt to rename "item_1" and "item_2" columns that don't exist in unique_transactions at this point. Additionally, the join predicates in lines 228-229 reference these same non-existent columns.

This appears to be a critical implementation bug that would cause runtime errors. Consider either:

  1. Using a mutate operation to create these columns first:
- left_table = unique_transactions.rename({"item_1": value_col})
- right_table = unique_transactions.rename({"item_2": value_col})
+ left_table = unique_transactions.mutate(item_1=unique_transactions[value_col])
+ right_table = unique_transactions.mutate(item_2=unique_transactions[value_col])
  1. Or directly using the value_col in the join predicates:
predicates=[
    left_table[group_col] == right_table[group_col],
-   left_table["item_1"] < right_table["item_2"],
+   left_table[value_col] < right_table[value_col],
],
🧹 Nitpick comments (2)
pyretailscience/analysis/product_association.py (2)

309-323: Consider deferring execute() call for better performance.

Calling .execute() forces immediate evaluation of the query, which can impact performance with large datasets. In some cases, it might be better to return the ibis expression and let the caller decide when to execute it.

Consider making the execution optional:

- final_result = result.execute().sort_values(by=["item_1", "item_2"]).reset_index(drop=True)
+ # Either return the ibis expression
+ if not execute:
+     return result
+ # Or execute it
+ final_result = result.execute().sort_values(by=["item_1", "item_2"]).reset_index(drop=True)

This would require adding an execute: bool = True parameter to the method signature.


307-307: Potential performance improvement for min_confidence filtering.

The min_confidence filter is applied after potentially expensive operations like union and other calculations. Since confidence values differ between direct and inverse pairs, this placement is correct but could be optimized.

Consider applying the confidence filter for target_item queries before any union operations, since they don't need inverse pairs:

result = result.filter(result.uplift >= min_uplift)

+ # Apply confidence filter early when we don't need inverse pairs
+ if target_item is not None:
+     result = result.filter(result.confidence >= min_confidence)
+

if target_item is not None:
    result = result.filter(result["item_1"] == target_item)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b2a018 and 23641c8.

📒 Files selected for processing (2)
  • pyretailscience/analysis/product_association.py (5 hunks)
  • tests/analysis/test_product_association.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/analysis/test_product_association.py
🔇 Additional comments (5)
pyretailscience/analysis/product_association.py (5)

38-38: Added ibis import reflects the refactored implementation.

The addition of the ibis import supports the new implementation that accepts both pandas DataFrames and ibis Tables as inputs, aligning with the PR objective of converting functionality to the Ibis framework.


85-85: Type annotations correctly updated to support both pandas and ibis inputs.

The type annotations have been properly updated to accept both DataFrame and ibis.Table types, making the implementation more flexible for different data sources. This change is consistent throughout the class and well-documented in the docstrings.

Also applies to: 99-99, 144-145, 160-160


206-208: Appropriate data type conversion.

Converting pandas DataFrames to ibis memtables when needed ensures backward compatibility while enabling the Ibis-based implementation.


282-284: Optimized filtering for target_item.

The code efficiently applies target_item filtering early in the process if specified, avoiding unnecessary calculations for inverse pairs. This is a performance improvement over computing everything and filtering later.

Also applies to: 285-306


209-219:

❓ Verification inconclusive

Verify unique transactions logic.

The implementation of unique_transactions has changed significantly from the previous implementation. This code doesn't seem to be doing any deduplication or ensuring uniqueness per transaction. If transactions can contain multiple entries for the same product, this could affect your association calculations.


🏁 Script executed:

#!/bin/bash
# Check how unique_transactions was implemented in the previous version
rg -A 5 -B 5 "unique_transactions" --glob "*.py" | grep -v "pyretailscience/analysis/product_association.py"

Length of output: 106


Action: Verify Transaction Deduplication Logic

The current implementation defines unique_transactions as a direct alias for df without any explicit deduplication step. While the aggregation later uses .nunique() on the group_col to count unique transactions, it’s not clear whether this meets your intended deduplication requirements—especially if individual transactions might list the same product multiple times.

  • The previous version’s implementation for managing unique transactions wasn’t located by the automated search; please manually verify if a deduplication operation existed or was intentionally removed.
  • Confirm if df is already preprocessed to ensure uniqueness, or if an explicit operation (e.g., a drop_duplicates) should be applied before the aggregation.
  • Once verified, adjust the logic accordingly to prevent potential inaccuracies in association counts if duplicate entries per transaction are possible.

Comment on lines 282 to 284
if target_item is not None:
result = result.filter(result["item_1"] == target_item)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the comment above, you'd then remove this code.

Suggested change
if target_item is not None:
result = result.filter(result["item_1"] == target_item)

@mayurkmmt mayurkmmt force-pushed the feature/product-association branch from 23641c8 to 89b66d5 Compare March 19, 2025 05:50
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
pyretailscience/analysis/product_association.py (5)

221-222: ⚠️ Potential issue

Column renaming issue: attempting to rename non-existent columns.

The code attempts to rename columns "item_1" and "item_2" in unique_transactions, but this table only has [group_col, value_col] columns at this point.

-        left_table = unique_transactions.rename({"item_1": value_col})
-        right_table = unique_transactions.rename({"item_2": value_col})
+        left_table = unique_transactions.rename({value_col: "item_1"})
+        right_table = unique_transactions.rename({value_col: "item_2"})

282-285: ⚠️ Potential issue

Incorrect column references in uplift calculation.

The uplift calculation references columns prob_1 and prob_2 which aren't properly defined in previous rename operations.

-            uplift=product_pairs["support"] / (product_pairs["prob_1"] * product_pairs["prob_2"]),
+            uplift=product_pairs["support"] / (product_pairs["occurrence_probability_1"] * product_pairs["occurrence_probability_2"]),

300-308: ⚠️ Potential issue

Inconsistent column references in inverse pairs creation.

The inverse pairs creation references columns like prob_1 and prob_2 which aren't properly defined in the previous operations.

                item_1=result["item_2"],
                item_2=result["item_1"],
                occurrences_1=result["occurrences_2"],
                occurrences_2=result["occurrences_1"],
-                prob_1=result["prob_2"],
-                prob_2=result["prob_1"],
+                occurrence_probability_1=result["occurrence_probability_2"],
+                occurrence_probability_2=result["occurrence_probability_1"],
                confidence=result["cooccurrences"] / result["occurrences_2"],

241-246: ⚠️ Potential issue

Column renaming issue: attempting to rename non-existent columns.

Similar to the previous issue, these rename operations reference columns like "item_1", "occurrences_1", and "occurrence_probability_1" that don't exist in the product_occurrences table.

-        product_occurrences_1 = product_occurrences.rename(
-            {"item_1": value_col, "occurrences_1": "occurrences", "occurrence_probability_1": "occurrence_probability"},
-        )
-        product_occurrences_2 = product_occurrences.rename(
-            {"item_2": value_col, "occurrences_2": "occurrences", "occurrence_probability_2": "occurrence_probability"},
-        )
+        product_occurrences_1 = product_occurrences.rename({value_col: "item_1", "occurrences": "occurrences_1", "occurrence_probability": "occurrence_probability_1"})
+        product_occurrences_2 = product_occurrences.rename({value_col: "item_2", "occurrences": "occurrences_2", "occurrence_probability": "occurrence_probability_2"})

266-271: ⚠️ Potential issue

Column renaming issue: inconsistent column naming pattern.

These rename operations are redundant with lines 241-246 and use different column names ("prob_1" instead of "occurrence_probability_1"). This inconsistency will cause issues.

-        product_occurrences_1_rename = product_occurrences.rename(
-            {"item_1": value_col, "occurrences_1": "occurrences", "prob_1": "occurrence_probability"},
-        )
-        product_occurrences_2_rename = product_occurrences.rename(
-            {"item_2": value_col, "occurrences_2": "occurrences", "prob_2": "occurrence_probability"},
-        )
+        # Either remove these redundant rename operations and use product_occurrences_1/2 from earlier,
+        # or ensure consistent column naming between these and the earlier rename operations
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23641c8 and 89b66d5.

📒 Files selected for processing (2)
  • pyretailscience/analysis/product_association.py (5 hunks)
  • tests/analysis/test_product_association.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/analysis/test_product_association.py
🔇 Additional comments (12)
pyretailscience/analysis/product_association.py (12)

38-39: LGTM: Added import for ibis framework.

The code now imports the ibis library, which is necessary for the new implementation using the Ibis framework instead of sparse matrix operations.


85-86: LGTM: Updated parameter type to support Ibis tables.

The constructor now accepts both pandas DataFrames and ibis Tables as input, enhancing flexibility.


99-100: LGTM: Updated docstring to reflect new parameter type.

The docstring correctly documents that the df parameter now accepts both pandas DataFrames and ibis Tables.


143-145: LGTM: Updated static method parameter type.

The _calc_association method now accepts both pandas DataFrames and ibis Tables as input, consistent with the constructor.


160-161: LGTM: Updated method docstring to reflect new parameter type.

The docstring correctly documents that the df parameter now accepts both pandas DataFrames and ibis Tables.


206-208: LGTM: Automatically convert pandas DataFrame to ibis table.

This is a good approach to handle both input types seamlessly.


209-211: LGTM: Efficiently capture unique transactions and total count.

This approach correctly creates a table of unique transactions and calculates the total count in an efficient manner.


212-219: LGTM: Product occurrences calculation using ibis operations.

The code effectively uses ibis group_by and aggregate operations to calculate product occurrences and filters by the minimum threshold.


224-239: LGTM: Well-structured join logic with target_item handling.

The join logic effectively handles both scenarios (with or without target_item) and uses appropriate ibis join features with named suffixes.


258-264: LGTM: Effective cooccurrences calculation and filtering.

The code correctly calculates cooccurrences by grouping on item pairs and applies proper filtering based on minimum thresholds.


287-299: LGTM: Appropriate filtering and column ordering.

The code correctly filters based on the minimum uplift threshold and defines a clear column order for the result.


311-327: LGTM: Final result formatting and column selection.

The code correctly filters the final results, executes the ibis expression, sorts and renames the columns, and returns a properly formatted pandas DataFrame.

Comment on lines +248 to +256
merged_df = merged_df.join(
product_occurrences_1,
predicates=[merged_df["item_1"] == product_occurrences_1["item_1"]],
)

base_items = [target_item]
if number_of_combinations == 2: # noqa: PLR2004
if target_item is None:
base_items = unique_combo_df[value_col].cat.categories
items = [([unique_combo_df[value_col].cat.categories.get_loc(cat)], cat) for cat in base_items]
elif number_of_combinations == 3: # noqa: PLR2004
if target_item is None:
base_items = sorted(combinations(unique_combo_df[value_col].cat.categories, 2))
items = [
([unique_combo_df[value_col].cat.categories.get_loc(i) for i in cats], cats) for cats in base_items
]
merged_df = merged_df.join(
product_occurrences_2,
predicates=[merged_df["item_2"] == product_occurrences_2["item_2"]],
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Join predicates reference may not match renamed columns.

The join predicates reference columns like merged_df["item_1"] and product_occurrences_1["item_1"], but given the rename issues above, these columns might not exist as expected.

This will need to be fixed after addressing the rename operations in lines 241-246. The join predicates should reference the correctly named columns.

Comment on lines +273 to +280
product_pairs = cooccurrences.join(
product_occurrences_1_rename,
predicates=[cooccurrences["item_1"] == product_occurrences_1_rename["item_1"]],
)
product_pairs = product_pairs.join(
product_occurrences_2_rename,
predicates=[product_pairs["item_2"] == product_occurrences_2_rename["item_2"]],
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Inconsistency in join predicates with renamed tables.

The join operations use predicates that reference columns like cooccurrences["item_1"] and product_occurrences_1_rename["item_1"], but given the rename issues above, these columns might not match as expected.

This will need to be fixed after addressing the rename operations in lines 266-271.

@mayurkmmt mayurkmmt merged commit e0f4878 into main Mar 19, 2025
3 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants