Skip to content

Setup GCP BigQuery Integration Tests for Analysis Modules #244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mayurkmmt
Copy link
Collaborator

@mayurkmmt mayurkmmt commented May 14, 2025

feat: Setup GCP BigQuery Integration Tests for Analysis Modules

Summary by CodeRabbit

  • New Features

    • Added integration tests for all major analysis modules with Google BigQuery, ensuring compatibility and robustness.
    • Introduced a GitHub Actions workflow for running BigQuery integration tests on demand.
    • Added support for BigQuery and environment variable management in development dependencies.
  • Documentation

    • Expanded README with detailed instructions for running BigQuery integration tests, prerequisites, and GitHub Actions usage.
  • Tests

    • Comprehensive integration test suite added for cohort analysis, composite rank, cross shop, customer decision hierarchy, HML segmentation, product association, revenue tree, RFM segmentation, segstats segmentation, threshold segmentation, and haversine distance using BigQuery data.
  • Chores

    • Updated pre-commit configuration to exclude BigQuery integration tests from standard test runs.
    • Added environment variable placeholders for Google Cloud configuration.

Copy link

coderabbitai bot commented May 14, 2025

Walkthrough

This update introduces comprehensive integration testing for BigQuery compatibility across multiple analysis modules. It adds BigQuery integration tests, supporting fixtures, and a GitHub Actions workflow for automated testing. Documentation and configuration files are updated to guide usage, manage dependencies, and exclude BigQuery tests from pre-commit checks. Minor code refactoring optimizes conditional logic.

Changes

File(s) Change Summary
.env_sample Added GCP_PROJECT_ID placeholder for Google Cloud Project configuration.
.github/workflows/bigquery-integration.yml Introduced a new GitHub Actions workflow for running parameterized BigQuery integration tests with authentication and concurrency control.
.pre-commit-config.yaml Updated pytest hook to ignore tests/integration/bigquery during pre-commit runs.
README.md Added documentation on running BigQuery integration tests, prerequisites, and workflow usage.
pyproject.toml Added ibis-framework[bigquery] and python-dotenv as development dependencies.
pyretailscience/segmentation/threshold.py Refactored conditional logic to use ibis.cases() with tuple arguments instead of chained .when() calls.
tests/integration/bigquery/conftest.py Added fixtures to load environment variables and initialize BigQuery connections and tables for tests.
tests/integration/bigquery/test_cohort_analysis.py Added integration test for CohortAnalysis using BigQuery data.
tests/integration/bigquery/test_composite_rank.py Added integration tests for CompositeRank with BigQuery data, including tie handling and multiple ranking columns.
tests/integration/bigquery/test_cross_shop.py Added integration test for CrossShop analysis using BigQuery data and dynamic group column selection.
tests/integration/bigquery/test_customer_decision_hierarchy.py Added parameterized integration tests for CustomerDecisionHierarchy with various methods and configurations using BigQuery data.
tests/integration/bigquery/test_hml_segmentation.py Added parameterized integration tests for HMLSegmentation with different zero-value customer handling strategies using BigQuery data.
tests/integration/bigquery/test_product_association.py Added parameterized integration tests for ProductAssociation with and without a target item using BigQuery data.
tests/integration/bigquery/test_revenue_tree.py Added integration tests for RevenueTree and calc_tree_kpis with various grouping and quantity handling scenarios using BigQuery data.
tests/integration/bigquery/test_rfm_segmentation.py Added parameterized integration tests for RFMSegmentation with different current date inputs using BigQuery data.
tests/integration/bigquery/test_segstats_segmentation.py Added parameterized integration tests for SegTransactionStats with different aggregation and total calculation options using BigQuery data.
tests/integration/bigquery/test_threshold_segmentation.py Added parameterized integration tests for ThresholdSegmentation with different zero-value handling using BigQuery data.
tests/integration/bigquery/test_haversine.py Added integration test for the haversine_distance function, verifying output accuracy with fixed coordinates.

Sequence Diagram(s)

sequenceDiagram
    participant Developer
    participant GitHub Actions
    participant GCP
    participant BigQuery
    participant Test Suite

    Developer->>GitHub Actions: Trigger BigQuery Integration Tests (workflow_dispatch)
    GitHub Actions->>GitHub Actions: Set up Python, install dependencies
    GitHub Actions->>GCP: Authenticate using service account key
    GitHub Actions->>BigQuery: Set up SDK and environment
    GitHub Actions->>Test Suite: Run pytest with selected test suite
    Test Suite->>BigQuery: Query test data
    Test Suite-->>GitHub Actions: Report test results
    GitHub Actions-->>Developer: Display results in workflow logs
Loading

Poem

In the warren of code, a new path is paved,
BigQuery tests hop in, robust and brave!
From fixtures to workflows, dependencies align,
Integration now sparkles—oh, isn’t it fine?
With docs and configs, the garden’s complete—
This bunny’s delighted, thumping its feet! 🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented May 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines Coverage Δ
pyretailscience/segmentation/threshold.py 95.12% <100.00%> (-0.12%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mayurkmmt mayurkmmt force-pushed the feature/bigquery-integration-analysis branch from 6a78624 to af4e56f Compare May 14, 2025 07:09
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 9

🧹 Nitpick comments (26)
.env_sample (1)

1-2: Add documentation and ensure proper file format

Consider adding a comment to explain what GCP_PROJECT_ID is used for and how users can obtain their Google Cloud Project ID for running the integration tests.

+# Your Google Cloud Platform project ID for BigQuery integration tests
 GCP_PROJECT_ID =
+
tests/integration/bigquery/conftest.py (2)

30-34: Add error handling for missing table

Consider adding error handling in case the "test_data.transactions" table doesn't exist. This would provide a more helpful error message to users setting up the integration tests for the first time.

 @pytest.fixture(scope="session")
 def transactions_table(bigquery_connection):
     """Get the transactions table for testing."""
-    return bigquery_connection.table("test_data.transactions")
+    try:
+        table = bigquery_connection.table("test_data.transactions")
+        # Verify table exists by executing a simple query
+        table.limit(1).execute()
+        return table
+    except Exception as e:
+        logger.error(f"Failed to access test_data.transactions: {e}")
+        logger.error("Please ensure the test_data.transactions table exists in your BigQuery project")
+        raise

1-34: Consider adding test data schema documentation

It would be helpful to document the expected schema of the "test_data.transactions" table either in this file or in README documentation to help others set up the test environment correctly.

 """BigQuery integration test fixtures."""
 
+"""
+Expected schema for test_data.transactions table:
+- transaction_id: STRING
+- customer_id: STRING
+- transaction_dt: TIMESTAMP
+- store_id: STRING
+- product_id: STRING
+- quantity: NUMERIC
+- amount: NUMERIC
+"""
+
 import os
tests/integration/bigquery/test_customer_decision_hierarchy.py (2)

65-83: Enhance initialization test with more specific assertions

While the test correctly verifies basic initialization with BigQuery data, it could be improved by adding more specific assertions about the initialized object's properties.

 assert cdh is not None
 assert isinstance(cdh, rp.CustomerDecisionHierarchy)
+assert hasattr(cdh, 'pairs_df'), "The CDH instance should have a pairs_df attribute"
+assert hasattr(cdh, 'distances_matrix'), "The CDH instance should have a distances_matrix attribute"

95-114: Add assertion for non-empty pairs DataFrame

The conditional check at line 108 is good for avoiding errors when no pairs are found, but it might mask potential issues if pairs_df is unexpectedly empty. Consider adding an assertion to verify that pairs are actually found.

 )

 assert cols.customer_id in pairs_df.columns
 assert "product_name" in pairs_df.columns
+assert len(pairs_df) > 0, "Expected pairs DataFrame to contain at least some rows"
README.md (1)

286-301: Comprehensive GitHub Actions instructions with inconsistent formatting

The section on running tests via GitHub Actions is well-documented, but there's inconsistent formatting in the list of required secrets.

 To run the workflow in GitHub Actions, add these secrets to your repository:
 
-  - `GCP_SA_KEY`: The entire JSON content of your GCP service account key file
-  - `GCP_PROJECT_ID`: Your GCP project ID
+ - `GCP_SA_KEY`: The entire JSON content of your GCP service account key file
+ - `GCP_PROJECT_ID`: Your GCP project ID
🧰 Tools
🪛 LanguageTool

[uncategorized] ~299-~299: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

tests/integration/bigquery/test_cohort_analysis.py (2)

12-24: Enhance cohort computation test with more specific assertions

While the test correctly verifies basic functionality, it could be improved by adding more specific assertions about the structure of the resulting DataFrame.

 assert not result.empty, "Cohort table should not be empty for valid BigQuery data"
 assert isinstance(result, pd.DataFrame)
+# Verify the structure of the cohort table
+assert result.index.name == "min_period_shopped", "Index should be named 'min_period_shopped'"
+assert result.columns.dtype == 'int64', "Columns should represent period_since as integers"
+assert result.dtypes.nunique() == 1, "All columns should have the same data type"

38-49: Enhance percentage test with stronger assertions

The test verifies that percentage values are bounded by 1.0, but it could be improved by also checking that values are non-negative and that the first column has all 1.0 values (the base cohort).

 assert not result.empty
 assert result.max().max() <= 1.0, "Values should be <= 1 when percentage=True"
+assert result.min().min() >= 0.0, "Values should be >= 0 when percentage=True"
+if 0 in result.columns:
+    assert (result[0] == 1.0).all(), "First period should always be 100% (1.0)"
.github/workflows/bigquery-integration.yml (2)

38-47: Consider adding dependency caching

The workflow installs dependencies each run, which could be optimized by adding a caching step for Python dependencies.

 - name: Setup Python
   uses: actions/setup-python@v5
   with:
     python-version: "3.11"
+    cache: 'pip'
+    cache-dependency-path: |
+      pyproject.toml
+      requirements*.txt

 - name: Install uv Package
   run: |

61-67: Enhance test filtering for robustness

The conditional expression for filtering tests works but could be improved to handle potential issues with special characters in test names.

 - name: Run Integration Tests
   env:
     TEST_SUITE: ${{ inputs.test_suite }}
   run: |
-    uv run pytest tests/integration/bigquery -v \
-      $(if [ "$TEST_SUITE" != "all" ]; then echo "-k $TEST_SUITE"; fi)
+    if [ "$TEST_SUITE" = "all" ]; then
+      uv run pytest tests/integration/bigquery -v
+    else
+      uv run pytest tests/integration/bigquery -v -k "$TEST_SUITE"
+    fi
tests/integration/bigquery/test_composite_rank.py (1)

31-33: Reconsider using customer_id for ranking

Using customer_id as a ranking column doesn't make business sense since it's an identifier, not a metric. In a real-world scenario, you'd typically rank by business metrics like revenue, frequency, or recency.

Consider using meaningful business metrics for ranking instead of identifier columns:

     rank_cols = [
         ("unit_spend", "desc"),
-        ("customer_id", "desc"),
+        ("transaction_frequency", "desc"),  # A more meaningful business metric
         ("spend_per_customer", "desc"),
     ]
tests/integration/bigquery/test_product_association.py (2)

57-58: Add more descriptive error message for association test skip

The current skip message is not informative enough to help diagnose why the test was skipped.

     if len(expected_results_df) == 0:
-        pytest.skip("No association rules found in data")
+        pytest.skip("No association rules found in data - the dataset may not have enough transactions or product variety")

298-320: Consolidate data suitability checks into a utility function

The data suitability checks for category and brand association tests are very similar but repeated with slight differences. This creates potential for inconsistency and makes maintenance harder.

Extract the repeated logic into a utility function:

+    def _check_association_data_suitability(self, df, value_col, skip_message_prefix):
+        """Check if data is suitable for association testing and skip if not."""
+        # Check for missing values
+        if df[value_col].isna().mean() > MISSING_VALUE_THRESHOLD:
+            pytest.skip(f"{skip_message_prefix}: Too many missing values in {value_col}")
+        
+        # Check for enough unique values
+        unique_values = df[value_col].nunique()
+        if unique_values < MIN_CATEGORIES_REQUIRED:
+            pytest.skip(f"{skip_message_prefix}: Not enough unique values ({unique_values}) in {value_col}")
+        
+        # Check for multi-value transactions
+        value_counts = df.groupby("transaction_id")[value_col].nunique()
+        multi_value_transactions = (value_counts > 1).sum()
+        if multi_value_transactions < MIN_MULTI_CATEGORY_TRANSACTIONS:
+            pytest.skip(f"{skip_message_prefix}: Not enough transactions with multiple values ({multi_value_transactions})")
+        
+        return True
+
     def test_real_world_category_association(self, transactions_table):
         """Test real-world category association analysis using BigQuery data."""
         value_col = "category_1_name"
@@ -297,12 +318,7 @@

         df = query.execute()

-        unique_categories = df[value_col].nunique()
-        if unique_categories < MIN_CATEGORIES_REQUIRED:
-            pytest.skip(f"Not enough unique categories ({unique_categories}) for association testing")
-
-        category_counts = df.groupby("transaction_id")[value_col].nunique()
-        multi_category_transactions = (category_counts > 1).sum()
-
-        if multi_category_transactions < MIN_MULTI_CATEGORY_TRANSACTIONS:
-            pytest.skip(f"Not enough transactions with multiple categories ({multi_category_transactions})")
+        # Check if data is suitable for association testing
+        self._check_association_data_suitability(df, value_col, "Category association")

Use the same function in the brand association test with similar logic.

tests/integration/bigquery/test_revenue_tree.py (4)

11-14: Use descriptive constant names instead of magic numbers

The constants have generic names that don't convey their purpose in the tests.

-EXPECTED_LENGTH_2 = 2
-EXPECTED_LENGTH_5 = 5
-EXPECTED_SUM_3 = 3
-EXPECTED_SUM_2 = 2
+PERIODS_COUNT = 2  # Expected number of periods (P1, P2)
+MAX_BRANDS_COUNT = 5  # Expected maximum number of brands in test data
+P1_BRANDS_COUNT = 3  # Expected number of brands in period P1
+P2_BRANDS_COUNT = 2  # Expected number of brands in period P2

173-175: Use a more general solution for calculating tree_data index

Using a hard-coded index of 0 assumes knowledge about how the data will be structured, which could lead to brittle tests if the implementation changes.

-    tree_index = 0
-    tree_data = rt.df.to_dict(orient="records")[tree_index]
+    # Use the first row for simple tests without a group_col
+    if rt.df.index.name is None:
+        tree_data = rt.df.iloc[0].to_dict()
+    else:
+        # With a group_col, get the first group
+        tree_data = rt.df.iloc[0].to_dict()

289-294: Use a more representative test dataset

The test dataset is constructed with arbitrary values that don't reflect real-world scenarios where brands might have clusters of transactions.

Create a more meaningful test dataset:

     test_df = pd.DataFrame(
         {
             "brand_id": [
-                unique_brands[0],
-                unique_brands[0],
-                unique_brands[0],
-                unique_brands[1],
-                unique_brands[1],
-                unique_brands[1],
-                unique_brands[2],
+                unique_brands[0], unique_brands[0], unique_brands[0],  # Brand 0 has transactions in both periods
+                unique_brands[1], unique_brands[1], unique_brands[1],  # Brand 1 has transactions in both periods
+                unique_brands[2], unique_brands[2],                    # Brand 2 only has transactions in P1
             ],
-            "customer_id": [1, 2, 3, 4, 5, 6, 7],
-            "transaction_id": [1, 2, 3, 4, 5, 6, 7],
-            "unit_spend": [100.0, 200.0, 300.0, 400.0, 500.0, 600.0, 700.0],
-            "period": ["P1", "P2", "P1", "P2", "P1", "P2", "P1"],
+            "customer_id": [1, 2, 3, 4, 5, 6, 7, 8],
+            "transaction_id": [1, 2, 3, 4, 5, 6, 7, 8],
+            "unit_spend": [100.0, 200.0, 300.0, 400.0, 500.0, 600.0, 700.0, 800.0],
+            "period": ["P1", "P2", "P1", "P2", "P1", "P2", "P1", "P1"],
         },
     )

359-361: Add tests for edge case price elasticity calculations

The test only verifies a simple case for price elasticity but doesn't test edge cases like zero price change or zero quantity change.

Add test cases for edge scenarios:

     if include_quantity:
         q1, q2 = 10, 12
         p1, p2 = 900.0 / q1, 1100.0 / q2
         expected_elasticity = ((q2 - q1) / ((q2 + q1) / 2)) / ((p2 - p1) / ((p2 + p1) / 2))

         assert round(result["price_elasticity"].iloc[0], 6) == round(expected_elasticity, 6)
+        
+        # Test edge cases
+        edge_cases = [
+            # Same price, different quantity
+            {"qty": [10, 12], "spend": [1000, 1200], "expected": float('inf')},
+            # Same quantity, different price
+            {"qty": [10, 10], "spend": [1000, 1200], "expected": 0.0},
+            # No change in either
+            {"qty": [10, 10], "spend": [1000, 1000], "expected": None},
+        ]
+        
+        for case in edge_cases:
+            edge_df = pd.DataFrame({
+                cols.agg_customer_id: [3, 3],
+                cols.agg_transaction_id: [3, 3],
+                cols.agg_unit_spend: case["spend"],
+                cols.agg_unit_qty: case["qty"],
+            }, index=["p1", "p2"])
+            
+            edge_result = calc_tree_kpis(edge_df, [True, False], [False, True])
+            
+            # For the no-change case, the value might be NaN or infinity
+            if case["expected"] is None:
+                assert pd.isna(edge_result["price_elasticity"].iloc[0]) or np.isinf(edge_result["price_elasticity"].iloc[0])
+            else:
+                assert round(edge_result["price_elasticity"].iloc[0], 6) == round(case["expected"], 6)
tests/integration/bigquery/test_threshold_segmentation.py (2)

207-226: Split test case into two separate tests for clarity

The test combines two different cases that should be tested separately for clarity.

-def test_thresholds_too_few_segments(self, transactions_table):
-    """Test that the method raises an error when there are too few/many segments for the number of thresholds."""
+def test_thresholds_more_than_segments(self, transactions_table):
+    """Test that the method raises an error when there are more thresholds than segments."""
     query = transactions_table.select(
         transactions_table.customer_id,
         transactions_table.unit_spend,
     ).limit(100)

     df = query.execute()

     thresholds = [0.4, 0.6, 0.8, 1]
     segments = ["Low", "High"]

     with pytest.raises(ValueError):
         ThresholdSegmentation(df, thresholds, segments)
+
+def test_thresholds_fewer_than_segments(self, transactions_table):
+    """Test that the method raises an error when there are fewer thresholds than segments."""
+    query = transactions_table.select(
+        transactions_table.customer_id,
+        transactions_table.unit_spend,
+    ).limit(100)
+
+    df = query.execute()

     segments = ["Low", "Medium", "High"]
+    thresholds = [0.4, 0.6, 0.8, 1]

     with pytest.raises(ValueError):
         ThresholdSegmentation(df, thresholds, segments)

227-246: Split test case and improve test names for clarity

Similar to the previous comment, this test combines two different cases and has a confusing name.

-def test_thresholds_too_few_thresholds(self, transactions_table):
-    """Test that the method raises an error when there are too few/many thresholds for the number of segments."""
+def test_fewer_thresholds_than_segments(self, transactions_table):
+    """Test that the method raises an error when there are fewer thresholds than segments."""
     query = transactions_table.select(
         transactions_table.customer_id,
         transactions_table.unit_spend,
     ).limit(100)

     df = query.execute()

     thresholds = [0.4, 1]
     segments = ["Low", "Medium", "High"]

     with pytest.raises(ValueError):
         ThresholdSegmentation(df, thresholds, segments)
+
+def test_more_thresholds_than_segments(self, transactions_table):
+    """Test that the method raises an error when there are more thresholds than segments."""
+    query = transactions_table.select(
+        transactions_table.customer_id,
+        transactions_table.unit_spend,
+    ).limit(100)
+
+    df = query.execute()
+    segments = ["Low", "Medium", "High"]
     thresholds = [0.2, 0.5, 0.6, 0.8, 1]

     with pytest.raises(ValueError):
         ThresholdSegmentation(df, thresholds, segments)
tests/integration/bigquery/test_rfm_segmentation.py (3)

12-12: Add a comment explaining the purpose of the REASONABLE_THRESHOLD constant.

Adding a brief comment would clarify what this threshold is used for and why the value of 0.1 was chosen.

-REASONABLE_THRESHOLD = 0.1
+# Threshold for comparing frequency values to raw transaction counts
+# 0.1 means ratios as extreme as 1:10 are still considered reasonable
+REASONABLE_THRESHOLD = 0.1

106-112: Consider simplifying the ratio comparison logic.

The current implementation to check if values are reasonably close is clever but somewhat complex. A more straightforward approach might improve readability.

-            ratio = max(actual_frequency, 1) / max(raw_transaction_count, 1)
-            inverse_ratio = max(raw_transaction_count, 1) / max(actual_frequency, 1)
-            reasonable = min(ratio, inverse_ratio) >= REASONABLE_THRESHOLD
+            # Check if the values are within a factor of 1/REASONABLE_THRESHOLD of each other
+            max_value = max(actual_frequency, raw_transaction_count, 1)
+            min_value = min(actual_frequency, raw_transaction_count, 1)
+            reasonable = min_value / max_value >= REASONABLE_THRESHOLD

167-184: Consider parameterizing the limit value.

The hard-coded limit of 10,000 records may cause unpredictable test durations. A parameterized or configurable value would provide more flexibility.

+    @pytest.fixture(params=[100, 1000, 10000])
+    def large_data_limit(self, request):
+        """Return different size limits for testing performance."""
+        return request.param
+
-    def test_large_dataset_performance(self, transactions_table):
+    def test_large_dataset_performance(self, transactions_table, large_data_limit):
         """Test RFMSegmentation performance with a larger dataset."""
         query = transactions_table.select(
             transactions_table.transaction_id,
             transactions_table.transaction_date,
             transactions_table.customer_id,
             transactions_table.unit_spend,
-        ).limit(10000)
+        ).limit(large_data_limit)
tests/integration/bigquery/test_segstats_segmentation.py (2)

26-26: Consider explicitly specifying region values instead of using modulo arithmetic.

Using modulo operations on store_id to create test regions makes the test data generation less transparent. Consider using explicit mapping for clarity.

-        df["region"] = np.where(df["store_id"] % 3 == 0, "North", np.where(df["store_id"] % 3 == 1, "South", "East"))
+        # Map store_id to regions using a more readable approach
+        region_map = {0: "North", 1: "South", 2: "East"}
+        df["region"] = df["store_id"].apply(lambda x: region_map[x % 3])

187-190: Add comment explaining the brand_name/brand_id fallback logic.

The fallback from brand_name to brand_id is important for test resilience, but its purpose isn't documented.

+        # Some test datasets might have brand_name, others might use brand_id
+        # We're flexible to support both schemas by creating segments from either
         if "brand_name" in transaction_df.columns:
             transaction_df["brand_segment"] = transaction_df["brand_name"].astype(str).str[:2]
         else:
             transaction_df["brand_segment"] = transaction_df["brand_id"].astype(str).str[:2]
tests/integration/bigquery/test_cross_shop.py (2)

17-31: Consider using the Ibis API instead of raw SQL.

For consistency with other tests, consider using the Ibis API directly rather than SQL strings.

 @pytest.fixture
 def transactions_df(transactions_table):
     """Fetch transaction data for testing from BigQuery."""
-    query = """
-    SELECT
-        customer_id,
-        category_1_name,
-        unit_spend
-    FROM
-        test_data.transactions
-    WHERE
-        category_1_name IN ('Jeans', 'Shoes', 'Dresses', 'Hats')
-    LIMIT 100
-    """
-    return transactions_table.sql(query).execute()
+    return (transactions_table
+            .select(
+                transactions_table.customer_id,
+                transactions_table.category_1_name,
+                transactions_table.unit_spend
+            )
+            .filter(transactions_table.category_1_name.isin(['Jeans', 'Shoes', 'Dresses', 'Hats']))
+            .limit(100)
+            .execute())

156-156: Simplify integer check for improved readability.

The current check for integer values is complex and hard to read. Consider a clearer approach.

-    assert all(isinstance(val, int | float) and val.is_integer() for val in cross_shop_table[cols.customer_id])
+    # Check that all customer_id counts are integers (even if stored as float)
+    for val in cross_shop_table[cols.customer_id]:
+        assert isinstance(val, (int, float))
+        if isinstance(val, float):
+            assert val.is_integer(), f"Expected integer value, got {val}"
🛑 Comments failed to post (9)
tests/integration/bigquery/conftest.py (1)

11-13: ⚠️ Potential issue

Remove hardcoded project ID

There's an inconsistency in how BigQuery clients are created. The bigquery.Client is initialized with a hardcoded project ID "pyretailscience-infra", but it's never used. Meanwhile, the bigquery_connection fixture uses the environment variable.

 load_dotenv()
-client = bigquery.Client(project="pyretailscience-infra")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

load_dotenv()
tests/integration/bigquery/test_composite_rank.py (3)

46-53: 🛠️ Refactor suggestion

Verify rank column results beyond just presence

The test only checks for the presence of rank columns but doesn't verify their values or ranking logic.

Add assertions to verify the actual ranking logic:

     for col in expected_rank_columns:
         assert col in cr.df.columns
+    
+    # Verify that ranks are correctly sorted
+    unit_spend_ranks = cr.df["unit_spend_rank"]
+    assert unit_spend_ranks.is_monotonic_increasing or unit_spend_ranks.is_monotonic_decreasing
+    
+    # Verify that composite rank is calculated correctly (e.g., as the mean of individual ranks)
+    calculated_mean = cr.df[["unit_spend_rank", "customer_id_rank", "spend_per_customer_rank"]].mean(axis=1)
+    assert cr.df["composite_rank"].equals(calculated_mean)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        expected_rank_columns = [
            "unit_spend_rank",
            "customer_id_rank",
            "spend_per_customer_rank",
            "composite_rank",
        ]
        for col in expected_rank_columns:
            assert col in cr.df.columns

        # Verify that ranks are correctly sorted
        unit_spend_ranks = cr.df["unit_spend_rank"]
        assert unit_spend_ranks.is_monotonic_increasing or unit_spend_ranks.is_monotonic_decreasing

        # Verify that composite rank is calculated correctly (e.g., as the mean of individual ranks)
        calculated_mean = cr.df[["unit_spend_rank", "customer_id_rank", "spend_per_customer_rank"]].mean(axis=1)
        assert cr.df["composite_rank"].equals(calculated_mean)

20-22: ⚠️ Potential issue

Check the calculation of spend_per_customer

The calculation divides unit_spend by customer_id, which is problematic since customer_id is typically an identifier rather than a quantity. This division will produce meaningless results that vary based on ID values.

-    if "spend_per_customer" not in df.columns:
-        df["spend_per_customer"] = df["unit_spend"] / df["customer_id"]
+    if "spend_per_customer" not in df.columns:
+        # Calculate spend per customer by first grouping by customer_id
+        customer_counts = df.groupby("customer_id").size()
+        customer_spend = df.groupby("customer_id")["unit_spend"].sum()
+        spend_per_customer_series = customer_spend / customer_counts
+        # Map the values back to each row
+        df["spend_per_customer"] = df["customer_id"].map(spend_per_customer_series)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        if "spend_per_customer" not in df.columns:
            # Calculate spend per customer by first grouping by customer_id
            customer_counts = df.groupby("customer_id").size()
            customer_spend = df.groupby("customer_id")["unit_spend"].sum()
            spend_per_customer_series = customer_spend / customer_counts
            # Map the values back to each row
            df["spend_per_customer"] = df["customer_id"].map(spend_per_customer_series)

76-96: 🛠️ Refactor suggestion

Enhance tie-breaking test to validate actual ranking behavior

The current test only checks for the presence of rank columns but doesn't verify that tie-breaking behavior is different between the two scenarios.

Add assertions to verify the actual tie-breaking behavior:

     assert "unit_spend_rank" in cr_with_ties.df.columns
     assert "unit_spend_rank" in cr_no_ties.df.columns
+    
+    # Find duplicate unit_spend values to check tie behavior
+    spend_counts = test_transactions_df["unit_spend"].value_counts()
+    tied_spend_values = spend_counts[spend_counts > 1].index
+    
+    if len(tied_spend_values) > 0:
+        tied_value = tied_spend_values[0]
+        tied_rows_with_ties = cr_with_ties.df[cr_with_ties.df["unit_spend"] == tied_value]
+        tied_rows_no_ties = cr_no_ties.df[cr_no_ties.df["unit_spend"] == tied_value]
+        
+        # When ignore_ties=False, tied values should have the same rank
+        if len(tied_rows_with_ties) > 1:
+            assert tied_rows_with_ties["unit_spend_rank"].nunique() == 1
+        
+        # When ignore_ties=True, tied values should have different ranks
+        if len(tied_rows_no_ties) > 1:
+            assert tied_rows_no_ties["unit_spend_rank"].nunique() == len(tied_rows_no_ties)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def test_ignore_ties_with_bigquery(self, test_transactions_df):
        """Test tie-breaking behavior with BigQuery data."""
        rank_cols = [("unit_spend", "desc")]

        cr_with_ties = CompositeRank(
            df=test_transactions_df,
            rank_cols=rank_cols,
            agg_func="mean",
            ignore_ties=False,
        )

        cr_no_ties = CompositeRank(
            df=test_transactions_df,
            rank_cols=rank_cols,
            agg_func="mean",
            ignore_ties=True,
        )

        assert "unit_spend_rank" in cr_with_ties.df.columns
        assert "unit_spend_rank" in cr_no_ties.df.columns

        # Find duplicate unit_spend values to check tie behavior
        spend_counts = test_transactions_df["unit_spend"].value_counts()
        tied_spend_values = spend_counts[spend_counts > 1].index

        if len(tied_spend_values) > 0:
            tied_value = tied_spend_values[0]
            tied_rows_with_ties = cr_with_ties.df[
                cr_with_ties.df["unit_spend"] == tied_value
            ]
            tied_rows_no_ties = cr_no_ties.df[
                cr_no_ties.df["unit_spend"] == tied_value
            ]

            # When ignore_ties=False, tied values should have the same rank
            if len(tied_rows_with_ties) > 1:
                assert tied_rows_with_ties["unit_spend_rank"].nunique() == 1

            # When ignore_ties=True, tied values should have different ranks
            if len(tied_rows_no_ties) > 1:
                assert tied_rows_no_ties["unit_spend_rank"].nunique() == len(
                    tied_rows_no_ties
                )
tests/integration/bigquery/test_product_association.py (1)

33-43: 🛠️ Refactor suggestion

Handle potential execution failures in expected_results_df fixture

The fixture doesn't handle potential failure scenarios when executing the association calculation.

Add error handling to prevent test failures caused by issues in the fixture:

     """Dynamically generate expected results based on actual BigQuery data."""
+    try:
         associations_df = ProductAssociation._calc_association(
             df=transactions_df,
             value_col="product",
             group_col="transaction_id",
         )
 
         if not isinstance(associations_df, pd.DataFrame) and hasattr(associations_df, "execute"):
-            associations_df = associations_df.execute()
+            try:
+                associations_df = associations_df.execute()
+            except Exception as e:
+                pytest.skip(f"Error executing association calculation: {str(e)}")
+                return pd.DataFrame()
 
         return associations_df
+    except Exception as e:
+        pytest.skip(f"Error calculating product associations: {str(e)}")
+        return pd.DataFrame()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        """Dynamically generate expected results based on actual BigQuery data."""
        try:
            associations_df = ProductAssociation._calc_association(
                df=transactions_df,
                value_col="product",
                group_col="transaction_id",
            )

            if not isinstance(associations_df, pd.DataFrame) and hasattr(associations_df, "execute"):
                try:
                    associations_df = associations_df.execute()
                except Exception as e:
                    pytest.skip(f"Error executing association calculation: {str(e)}")
                    return pd.DataFrame()

            return associations_df
        except Exception as e:
            pytest.skip(f"Error calculating product associations: {str(e)}")
            return pd.DataFrame()
tests/integration/bigquery/test_revenue_tree.py (1)

232-236: ⚠️ Potential issue

Duplicate assertion check - Same test is performed twice

There's a redundant assertion that checks the same condition twice.

     assert (
         tree_data[cols.calc_spend_per_cust_p2] - tree_data[cols.calc_spend_per_cust_p1]
         == tree_data[cols.calc_spend_per_cust_diff]
     )
-    assert (
-        tree_data[cols.calc_spend_per_cust_p2] - tree_data[cols.calc_spend_per_cust_p1]
-        == tree_data[cols.calc_spend_per_cust_diff]
-    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        )
        assert (
            tree_data[cols.calc_spend_per_cust_p2] - tree_data[cols.calc_spend_per_cust_p1]
            == tree_data[cols.calc_spend_per_cust_diff]
        )
tests/integration/bigquery/test_threshold_segmentation.py (1)

55-73: ⚠️ Potential issue

Test name doesn't match test implementation

The test name suggests it's testing with a single customer, but it's actually testing a mismatch between thresholds and segments count.

-def test_single_customer(self, transactions_table):
-    """Test that the method correctly handles a DataFrame with only one customer."""
+def test_thresholds_segments_mismatch(self, transactions_table):
+    """Test that the method raises an error when thresholds and segments count don't match."""
     query = transactions_table.select(
         transactions_table.customer_id,
         transactions_table.unit_spend,
     ).limit(1)

     df = query.execute()

     thresholds = [0.5, 1]
     segments = ["Low"]

     with pytest.raises(ValueError):
         ThresholdSegmentation(
             df=df,
             thresholds=thresholds,
             segments=segments,
         )

Also, add a separate test for the single customer case:

def test_single_customer(self, transactions_table):
    """Test that the method correctly handles a DataFrame with only one customer."""
    query = transactions_table.select(
        transactions_table.customer_id,
        transactions_table.unit_spend,
    ).limit(1)

    df = query.execute()

    thresholds = [1]
    segments = ["Single"]

    # This should work with a single customer
    seg = ThresholdSegmentation(
        df=df,
        thresholds=thresholds,
        segments=segments,
    )
    
    result_df = seg.df
    assert len(result_df) == 1
    assert result_df["segment_name"].iloc[0] == "Single"
tests/integration/bigquery/test_segstats_segmentation.py (1)

157-162: 🛠️ Refactor suggestion

Avoid catching generic Exception.

Using a broad Exception catch can hide unexpected errors. Specify the exceptions you expect to catch.

         try:
             seg_stats_no_total.plot("spend")
             plotting_no_total_worked = True
-        except Exception as e:  # noqa: BLE001
+        except (ValueError, TypeError, RuntimeError, AttributeError, ImportError) as e:
             plotting_no_total_worked = False
             pytest.fail(f"Plotting with calc_total=False failed: {e!s}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        try:
            seg_stats_no_total.plot("spend")
            plotting_no_total_worked = True
        except (ValueError, TypeError, RuntimeError, AttributeError, ImportError) as e:
            plotting_no_total_worked = False
            pytest.fail(f"Plotting with calc_total=False failed: {e!s}")
tests/integration/bigquery/test_hml_segmentation.py (1)

34-41: 🛠️ Refactor suggestion

Extract the zero-spend customer creation to a separate fixture.

This code is duplicated in three test methods. Consider extracting it into a reusable fixture.

+    @pytest.fixture
+    def ensure_zero_spend_customer(self, customers_spend_df):
+        """Ensure at least one zero-spend customer exists in the DataFrame."""
+        if not (customers_spend_df[cols.unit_spend] == 0).any():
+            zero_customer = pd.DataFrame(
+                {
+                    get_option("column.customer_id"): [999999],
+                    cols.unit_spend: [0],
+                },
+            )
+            customers_spend_df = pd.concat([customers_spend_df, zero_customer], ignore_index=True)
+
+        zero_spend_customer_id = customers_spend_df[customers_spend_df[cols.unit_spend] == 0][
+            get_option("column.customer_id")
+        ].iloc[0]
+        
+        return customers_spend_df, zero_spend_customer_id

Then use it in each test:

def test_handles_zero_spend_customers_are_excluded_in_result(self, ensure_zero_spend_customer):
    """Test that the method correctly handles zero spend customers when zero_value_customers is 'exclude'."""
    customers_spend_df, zero_spend_customer_id = ensure_zero_spend_customer
    
    hml_segmentation = HMLSegmentation(customers_spend_df, zero_value_customers="exclude")
    result_df = hml_segmentation.df
    
    assert zero_spend_customer_id not in result_df.index.values
    
    assert "Heavy" in result_df["segment_name"].values
    assert "Medium" in result_df["segment_name"].values
    assert "Light" in result_df["segment_name"].values

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/integration/bigquery/test_revenue_tree.py (2)

11-14: Consider more descriptive constant names

These constants use numeric suffixes that don't clearly indicate their purpose. Consider using more descriptive names that explain what these values represent in the tests.

-EXPECTED_LENGTH_2 = 2
-EXPECTED_LENGTH_5 = 5
-EXPECTED_SUM_3 = 3
-EXPECTED_SUM_2 = 2
+EXPECTED_PERIODS_COUNT = 2
+EXPECTED_GROUPS_COUNT = 5
+EXPECTED_P1_GROUPS_COUNT = 3
+EXPECTED_P2_GROUPS_COUNT = 2

61-103: Extract test data setup into helper methods

The test data setup logic is repeated across multiple tests. Consider extracting this common setup code into helper methods to reduce duplication.

You could create helper methods for:

  1. Preparing a DataFrame with basic transaction data
  2. Adding period flags
  3. Applying data type conversions

Example refactoring:

def _prepare_base_transaction_df(self, transactions_table, columns, limit=6):
    """Prepare a base transaction DataFrame with common columns."""
    query = transactions_table.select(columns).limit(limit)
    df = query.execute()
    return df

def _add_period_flags(self, df):
    """Add period flags (P1/P2) to the DataFrame."""
    middle_index = len(df) // 2
    df["period"] = ["P1"] * middle_index + ["P2"] * (len(df) - middle_index)
    return df

def _convert_data_types(self, df, cols, include_quantity=False):
    """Apply common data type conversions."""
    df["customer_id"] = df["customer_id"].astype(int)
    df["transaction_id"] = df["transaction_id"].astype(int)
    df["unit_spend"] = df["unit_spend"].astype(float)
    
    if include_quantity and "unit_quantity" in df.columns:
        df[cols.unit_qty] = df["unit_quantity"].astype(int)
    
    return df

Then, test methods could use these helpers:

def test_agg_data_no_group(self, cols: ColumnHelper, include_quantity: bool, transactions_table):
    """Test the _agg_data method with no group_col using BigQuery data."""
    columns = ["customer_id", "transaction_id", "unit_spend", "transaction_date"]
    
    if include_quantity:
        columns.append("unit_quantity")
    
    df = self._prepare_base_transaction_df(transactions_table, columns)
    df = self._add_period_flags(df)
    df = self._convert_data_types(df, cols, include_quantity)
    
    # Rest of the test...
README.md (1)

285-289: Fix spacing in list formatting

There's an extra space after the colon before the list marker.

-To run the workflow in GitHub Actions, add these secrets to your repository:  - `GCP_SA_KEY`: The entire JSON content of your GCP service account key file
-
+To run the workflow in GitHub Actions, add these secrets to your repository:
+- `GCP_SA_KEY`: The entire JSON content of your GCP service account key file
🧰 Tools
🪛 LanguageTool

[uncategorized] ~287-~287: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a78624 and af4e56f.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (16)
  • .env_sample (1 hunks)
  • .github/workflows/bigquery-integration.yml (1 hunks)
  • .pre-commit-config.yaml (1 hunks)
  • README.md (2 hunks)
  • pyproject.toml (1 hunks)
  • tests/integration/bigquery/conftest.py (1 hunks)
  • tests/integration/bigquery/test_cohort_analysis.py (1 hunks)
  • tests/integration/bigquery/test_composite_rank.py (1 hunks)
  • tests/integration/bigquery/test_cross_shop.py (1 hunks)
  • tests/integration/bigquery/test_customer_decision_hierarchy.py (1 hunks)
  • tests/integration/bigquery/test_hml_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_product_association.py (1 hunks)
  • tests/integration/bigquery/test_revenue_tree.py (1 hunks)
  • tests/integration/bigquery/test_rfm_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_segstats_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_threshold_segmentation.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (14)
  • .env_sample
  • pyproject.toml
  • .pre-commit-config.yaml
  • tests/integration/bigquery/test_customer_decision_hierarchy.py
  • tests/integration/bigquery/conftest.py
  • tests/integration/bigquery/test_threshold_segmentation.py
  • tests/integration/bigquery/test_cohort_analysis.py
  • tests/integration/bigquery/test_segstats_segmentation.py
  • tests/integration/bigquery/test_product_association.py
  • .github/workflows/bigquery-integration.yml
  • tests/integration/bigquery/test_composite_rank.py
  • tests/integration/bigquery/test_hml_segmentation.py
  • tests/integration/bigquery/test_cross_shop.py
  • tests/integration/bigquery/test_rfm_segmentation.py
🧰 Additional context used
🪛 LanguageTool
README.md

[uncategorized] ~287-~287: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

🔇 Additional comments (5)
tests/integration/bigquery/test_revenue_tree.py (4)

317-361: Good test design for calc_tree_kpis function

This test is well-structured with good organization:

  1. Clear test data setup with explicit values
  2. Proper handling of the parametrized test cases
  3. Validation of return type and structure
  4. Exhaustive column verification
  5. Advanced verification of elasticity calculations when quantity data is present

1-9: Well-structured imports and module docstring

The imports are properly organized with standard library imports first (math), followed by third-party libraries (pandas, pytest), and then project-specific imports. The module docstring clearly describes the purpose of the test file.


31-44: Good error handling test

This test properly verifies that the RevenueTree class raises appropriate ValueErrors when required columns are missing, including checking the error message content.


260-315: Comprehensive edge case testing

The test successfully verifies an important edge case where some groups are only present in the P1 period but not in P2. This is crucial for ensuring correct handling of imbalanced data across time periods.

README.md (1)

291-309:

❓ Verification inconclusive

Verify expected BigQuery table schema

The documentation lists the required table schema for BigQuery integration tests. Ensure that this matches exactly what the tests expect to avoid potential test failures.


🏁 Script executed:

#!/bin/bash
# Verify all column names mentioned in the test files match those in the README

# Find all required columns mentioned in the test files
echo "Searching for column references in test files..."
# Look for column names in BigQuery test files
rg -o --no-filename "(\"[a-zA-Z0-9_]+\"|\[\"[a-zA-Z0-9_]+\"\])" tests/integration/bigquery/ | sort | uniq

Length of output: 2326


Ensure BigQuery integration test schema aligns with README

The grep across tests/integration/bigquery/ reveals which column names your tests actually reference versus what’s documented. There are mismatches that need reconciliation to avoid test failures and confusion:

• README.md (lines 291–309): currently lists

  • transaction_id
  • transaction_date
  • transaction_time
  • customer_id
  • product_id
  • product_name
  • category_0_name
  • category_0_id
  • category_1_name
  • category_1_id
  • brand_name
  • brand_id
  • unit_quantity
  • unit_cost
  • unit_spend
  • store_id

• Tests reference (among many others):

  • transaction_id, transaction_date
  • customer_id, product_id, product_name
  • brand_id, brand_name
  • store_id, unit_quantity, unit_cost, unit_spend
  • plus derived fields like category_segment, brand_segment, avg_unit_cost, composite_rank, etc.

Key discrepancies:

  • Missing in tests: transaction_time, category_0_id, category_1_id
  • Missing in docs: any derived columns (category_segment, brand_segment, etc.)

Please verify and update either the README schema or the test assertions so that they match exactly.

  • Update README.md to reflect only the columns your tests create/assert
  • Or adjust your tests to cover all documented base columns

@mayurkmmt mayurkmmt force-pushed the feature/bigquery-integration-analysis branch from af4e56f to f51ea78 Compare May 14, 2025 07:19
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
tests/integration/bigquery/test_revenue_tree.py (4)

11-14: Use more descriptive constant names.

The constant names (EXPECTED_LENGTH_2, EXPECTED_LENGTH_5, etc.) are not very descriptive and don't indicate their purpose in the tests. Consider using more meaningful names that explain what these values represent.

-EXPECTED_LENGTH_2 = 2
-EXPECTED_LENGTH_5 = 5
-EXPECTED_SUM_3 = 3
-EXPECTED_SUM_2 = 2
+# Constants for expected array lengths and sums in test assertions
+EXPECTED_PERIODS_LENGTH = 2  # Number of periods (p1, p2)
+EXPECTED_TEST_GROUPS_LENGTH = 5  # Number of test groups in the grouped test data
+EXPECTED_P1_GROUPS_COUNT = 3  # Number of groups that appear in period 1
+EXPECTED_P2_GROUPS_COUNT = 2  # Number of groups that appear in period 2

168-203: The test method could benefit from additional documentation.

This test method is comprehensive but could be clearer about what specific aspects of the contribution calculations are being verified. Consider adding more inline comments to explain the logical flow and purpose of each set of assertions.

@pytest.mark.parametrize("include_quantity", [True, False])
def test_contrib_sum(self, cols: ColumnHelper, include_quantity: bool, transactions_table):
    """Test that the contributions add up to the total using BigQuery data."""
+    # This test verifies that the revenue tree components correctly sum up
+    # to their totals and that the calculated differences match the actual differences
+    # between period values
    columns = [
        "customer_id",
        "transaction_id",
        "unit_spend",
        "transaction_date",
    ]

312-356: Consider improving price elasticity calculation clarity.

The price elasticity calculation in test_calc_tree_kpis_basic is implemented correctly, but its purpose and the formula used could be better documented for future maintainers.

if include_quantity:
    q1, q2 = 10, 12
    p1, p2 = 900.0 / q1, 1100.0 / q2
+    # Calculate price elasticity using arc elasticity formula:
+    # (% change in quantity / % change in price)
+    # Using midpoint formula for percentage changes for more accurate results with large changes
    expected_elasticity = ((q2 - q1) / ((q2 + q1) / 2)) / ((p2 - p1) / ((p2 + p1) / 2))

    assert round(result["price_elasticity"].iloc[0], 6) == round(expected_elasticity, 6)

74-93: Consider parameterizing test data size.

The test uses hardcoded limits for BigQuery data sampling. Consider making these limits configurable parameters for better test flexibility.

-        query = transactions_table.limit(6)
+        sample_size = 6  # Can be adjusted based on test needs
+        query = transactions_table.limit(sample_size)
README.md (3)

287-287: Fix spacing after colon in the bullet list.

There is an extra space after the colon in this line which was flagged by static analysis.

-To run the workflow in GitHub Actions, add these secrets to your repository:  
+To run the workflow in GitHub Actions, add these secrets to your repository:
🧰 Tools
🪛 LanguageTool

[uncategorized] ~287-~287: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)


236-244: Consider adding version requirements for GCP components.

The prerequisites section could be enhanced by specifying any version requirements for GCP components or BigQuery API.

You might want to add information about minimum GCP API versions or client library versions required for the tests to run successfully.


245-271: Consider mentioning test runtime expectations.

The documentation provides clear instructions on running tests but doesn't mention how long they might take to execute or any resource consumption expectations.

Consider adding information about typical test duration and any resource consumption implications when running against BigQuery, as this could help developers plan their testing accordingly.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af4e56f and f51ea78.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (16)
  • .env_sample (1 hunks)
  • .github/workflows/bigquery-integration.yml (1 hunks)
  • .pre-commit-config.yaml (1 hunks)
  • README.md (2 hunks)
  • pyproject.toml (1 hunks)
  • tests/integration/bigquery/conftest.py (1 hunks)
  • tests/integration/bigquery/test_cohort_analysis.py (1 hunks)
  • tests/integration/bigquery/test_composite_rank.py (1 hunks)
  • tests/integration/bigquery/test_cross_shop.py (1 hunks)
  • tests/integration/bigquery/test_customer_decision_hierarchy.py (1 hunks)
  • tests/integration/bigquery/test_hml_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_product_association.py (1 hunks)
  • tests/integration/bigquery/test_revenue_tree.py (1 hunks)
  • tests/integration/bigquery/test_rfm_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_segstats_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_threshold_segmentation.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/integration/bigquery/test_segstats_segmentation.py
🚧 Files skipped from review as they are similar to previous changes (13)
  • pyproject.toml
  • .env_sample
  • .pre-commit-config.yaml
  • .github/workflows/bigquery-integration.yml
  • tests/integration/bigquery/conftest.py
  • tests/integration/bigquery/test_threshold_segmentation.py
  • tests/integration/bigquery/test_customer_decision_hierarchy.py
  • tests/integration/bigquery/test_cohort_analysis.py
  • tests/integration/bigquery/test_rfm_segmentation.py
  • tests/integration/bigquery/test_product_association.py
  • tests/integration/bigquery/test_composite_rank.py
  • tests/integration/bigquery/test_hml_segmentation.py
  • tests/integration/bigquery/test_cross_shop.py
🧰 Additional context used
🪛 LanguageTool
README.md

[uncategorized] ~287-~287: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

🔇 Additional comments (2)
tests/integration/bigquery/test_revenue_tree.py (1)

225-233: The previously identified duplicate assertion has been fixed.

The code no longer contains the duplicate assertion for calc_spend_per_cust_diff that was flagged in previous reviews. Good job addressing this issue.

README.md (1)

213-312: Well-structured and comprehensive documentation for BigQuery integration tests.

The documentation for BigQuery integration tests is very thorough, covering all aspects from overview to prerequisites, setup, and execution instructions. This will be very helpful for developers working with these tests.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~287-~287: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

Comment on lines 58 to 59
- name: Set up Google Cloud SDK
uses: google-github-actions/setup-gcloud@v2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the Google Cloud SDK? Anytime I've used it before it's been pretty heavy weight.

README.md Outdated
Comment on lines 259 to 260
uv pip install -e .
uv pip install "ibis-framework[bigquery]>=10.0.0,<11"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better to just run uv sync no?

Suggested change
uv pip install -e .
uv pip install "ibis-framework[bigquery]>=10.0.0,<11"
uv sync

README.md Outdated
Comment on lines 223 to 234
The integration tests cover the following analysis modules:

- **Cohort Analysis** - Tests customer cohort retention metrics
- **Cross Shop Analysis** - Tests product/category cross-shopping patterns
- **Customer Analysis** - Tests customer lifetime value and purchase frequency metrics
- **Gain Loss Analysis** - Tests comparative performance analysis
- **Haversine Analysis** - Tests geographic distance calculations
- **Product Association Analysis** - Tests market basket analysis
- **Customer Decision Hierarchy** - Tests customer purchase decision patterns
- **Revenue Tree Analysis** - Tests hierarchical revenue breakdowns
- **Composite Rank Analysis** - Tests weighted ranking of entities
- **Segmentation Analysis** - Tests RFM and value-frequency customer segmentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not list all the modules. I'm a bit worried that all the effort additional steps required as we implement new modules.

Suggested change
The integration tests cover the following analysis modules:
- **Cohort Analysis** - Tests customer cohort retention metrics
- **Cross Shop Analysis** - Tests product/category cross-shopping patterns
- **Customer Analysis** - Tests customer lifetime value and purchase frequency metrics
- **Gain Loss Analysis** - Tests comparative performance analysis
- **Haversine Analysis** - Tests geographic distance calculations
- **Product Association Analysis** - Tests market basket analysis
- **Customer Decision Hierarchy** - Tests customer purchase decision patterns
- **Revenue Tree Analysis** - Tests hierarchical revenue breakdowns
- **Composite Rank Analysis** - Tests weighted ranking of entities
- **Segmentation Analysis** - Tests RFM and value-frequency customer segmentation
The integration tests cover the analysis modules.

README.md Outdated
Comment on lines 290 to 309
## Test Data

The tests expect a BigQuery dataset named `test_data` with a table named `transactions` containing the following columns:

- `transaction_id`
- `transaction_date`
- `transaction_time`
- `customer_id`
- `product_id`
- `product_name`
- `category_0_name`
- `category_0_id`
- `category_1_name`
- `category_1_id`
- `brand_name`
- `brand_id`
- `unit_quantity`
- `unit_cost`
- `unit_spend`
- `store_id`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this too. The file should be changing in the next several weeks I hope

Comment on lines 25 to 49
def test_invalid_period(self, transactions_table):
"""Test if an invalid period raises an error."""
invalid_period = "m"
with pytest.raises(
ValueError,
match=f"Invalid period '{invalid_period}'. Allowed values: {CohortAnalysis.VALID_PERIODS}",
):
CohortAnalysis(
df=transactions_table,
aggregation_column="unit_spend",
period=invalid_period,
)

def test_cohort_percentage(self, transactions_table):
"""Tests cohort analysis with percentage=True."""
cohort = CohortAnalysis(
df=transactions_table,
aggregation_column="unit_spend",
agg_func="sum",
period="month",
percentage=True,
)
result = cohort.table
assert not result.empty
assert result.max().max() <= 1.0, "Values should be <= 1 when percentage=True"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first test is enough for now. Basically, it just checks that the computation doesn't error on BigQuery

Suggested change
def test_invalid_period(self, transactions_table):
"""Test if an invalid period raises an error."""
invalid_period = "m"
with pytest.raises(
ValueError,
match=f"Invalid period '{invalid_period}'. Allowed values: {CohortAnalysis.VALID_PERIODS}",
):
CohortAnalysis(
df=transactions_table,
aggregation_column="unit_spend",
period=invalid_period,
)
def test_cohort_percentage(self, transactions_table):
"""Tests cohort analysis with percentage=True."""
cohort = CohortAnalysis(
df=transactions_table,
aggregation_column="unit_spend",
agg_func="sum",
period="month",
percentage=True,
)
result = cohort.table
assert not result.empty
assert result.max().max() <= 1.0, "Values should be <= 1 when percentage=True"

Comment on lines 25 to 53
def test_composite_rank_with_bigquery_data(self, test_transactions_df):
"""Test CompositeRank functionality with real BigQuery data.

This test demonstrates using CompositeRank with BigQuery-sourced data.
"""
rank_cols = [
("unit_spend", "desc"),
("customer_id", "desc"),
("spend_per_customer", "desc"),
]

cr = CompositeRank(
df=test_transactions_df,
rank_cols=rank_cols,
agg_func="mean",
ignore_ties=False,
)

assert "composite_rank" in cr.df.columns
assert len(cr.df) > 0

expected_rank_columns = [
"unit_spend_rank",
"customer_id_rank",
"spend_per_customer_rank",
"composite_rank",
]
for col in expected_rank_columns:
assert col in cr.df.columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this function overlap with the next one in terms of functionality?


def test_different_agg_functions_with_bigquery(self, test_transactions_df):
"""Test different aggregation functions with BigQuery data."""
agg_functions = ["mean", "sum", "min", "max"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you implement this with pytest parameterize instead of a loop in the function please.

"""Test tie-breaking behavior with BigQuery data."""
rank_cols = [("unit_spend", "desc")]

cr_with_ties = CompositeRank(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here can you implement with pytest parameterize please.

Comment on lines 97 to 113
def test_ibis_table_input(self, transactions_table):
"""Explicitly test Ibis table input for CompositeRank."""
cr = CompositeRank(
df=transactions_table,
rank_cols=[("unit_spend", "desc"), ("customer_id", "desc")],
agg_func="mean",
ignore_ties=False,
)

expected_columns = [
"unit_spend_rank",
"customer_id_rank",
"composite_rank",
]

for col in expected_columns:
assert col in cr.df.columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this test for now. I think the others cover my big concerns

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
tests/integration/bigquery/test_haversine.py (1)

20-21: Consider adding a more descriptive assertion message

While the test includes an error message, it could be more descriptive to aid in debugging.

- assert pytest.approx(result, rel=1e-3) == expected_distance, "Distance calculation error"
+ assert pytest.approx(result, rel=1e-3) == expected_distance, f"Distance calculation error: expected {expected_distance}, got {result}"
README.md (1)

275-275: Fix formatting in the list of required secrets

There appears to be extra spacing before the list item.

-To run the workflow in GitHub Actions, add these secrets to your repository:  
+To run the workflow in GitHub Actions, add these secrets to your repository:
🧰 Tools
🪛 LanguageTool

[uncategorized] ~275-~275: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

tests/integration/bigquery/test_cross_shop.py (3)

28-41: Consider adding assertions to the integration test

The test currently verifies that CrossShop can be instantiated without errors, but doesn't assert any expected behavior or results. While this is in line with the previous review comments to focus on Ibis compatibility, adding a basic assertion would strengthen the test.

-    CrossShop(
+    cs = CrossShop(
         df=transactions_df,
         group_1_col="category_1_name",
         group_1_val="Jeans",
         group_2_col="category_1_name",
         group_2_val="Shoes",
     )
+    # Assert that CrossShop object was successfully created with data
+    assert len(cs.cross_shop_df) > 0, "CrossShop should have processed data"

43-58: Consider adding assertions to the three-group test

Similar to the previous test, this one verifies that the three-group configuration doesn't raise errors, but doesn't assert any expected behavior.

-    CrossShop(
+    cs = CrossShop(
         df=transactions_df,
         group_1_col="category_1_name",
         group_1_val="Jeans",
         group_2_col="category_1_name",
         group_2_val="Shoes",
         group_3_col="category_1_name",
         group_3_val="Dresses",
     )
+    # Assert that CrossShop correctly handled three groups
+    assert cs.cross_shop_table_df is not None, "CrossShop should generate cross_shop_table_df"

60-74: Verify custom aggregation results in the test

This test validates that custom aggregation functions like "nunique" work with BigQuery, but doesn't assert the expected behavior. Consider adding a simple assertion to confirm the aggregation was applied correctly.

-    CrossShop(
+    cs = CrossShop(
         df=transactions_df,
         group_1_col="category_1_name",
         group_1_val="Jeans",
         group_2_col="category_1_name",
         group_2_val="Shoes",
         value_col=cols.customer_id,
         agg_func="nunique",
     )
+    # Assert that customer count aggregation was successful
+    assert cs.cross_shop_table_df is not None, "CrossShop should generate table with customer counts"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f51ea78 and 7bec26b.

📒 Files selected for processing (14)
  • .github/workflows/bigquery-integration.yml (1 hunks)
  • README.md (2 hunks)
  • pyretailscience/segmentation/threshold.py (1 hunks)
  • tests/integration/bigquery/test_cohort_analysis.py (1 hunks)
  • tests/integration/bigquery/test_composite_rank.py (1 hunks)
  • tests/integration/bigquery/test_cross_shop.py (1 hunks)
  • tests/integration/bigquery/test_customer_decision_hierarchy.py (1 hunks)
  • tests/integration/bigquery/test_haversine.py (1 hunks)
  • tests/integration/bigquery/test_hml_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_product_association.py (1 hunks)
  • tests/integration/bigquery/test_revenue_tree.py (1 hunks)
  • tests/integration/bigquery/test_rfm_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_segstats_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_threshold_segmentation.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • pyretailscience/segmentation/threshold.py
🚧 Files skipped from review as they are similar to previous changes (10)
  • tests/integration/bigquery/test_revenue_tree.py
  • tests/integration/bigquery/test_customer_decision_hierarchy.py
  • .github/workflows/bigquery-integration.yml
  • tests/integration/bigquery/test_cohort_analysis.py
  • tests/integration/bigquery/test_rfm_segmentation.py
  • tests/integration/bigquery/test_product_association.py
  • tests/integration/bigquery/test_threshold_segmentation.py
  • tests/integration/bigquery/test_hml_segmentation.py
  • tests/integration/bigquery/test_composite_rank.py
  • tests/integration/bigquery/test_segstats_segmentation.py
🧰 Additional context used
🧬 Code Graph Analysis (2)
tests/integration/bigquery/test_haversine.py (1)
pyretailscience/analysis/haversine.py (1)
  • haversine_distance (28-55)
tests/integration/bigquery/test_cross_shop.py (3)
pyretailscience/analysis/cross_shop.py (1)
  • CrossShop (13-200)
pyretailscience/options.py (1)
  • ColumnHelper (408-560)
tests/integration/bigquery/conftest.py (1)
  • transactions_table (31-33)
🪛 LanguageTool
README.md

[uncategorized] ~275-~275: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

🔇 Additional comments (11)
tests/integration/bigquery/test_haversine.py (4)

1-2: Good documentation

Clear docstring that succinctly describes the purpose of this test file.


9-10: Well-named test function with clear docstring

The test function name and docstring clearly communicate the purpose of the test.


11-14: Good use of Ibis literals for test coordinates

Using ibis.literal with specified type "float64" ensures compatibility with BigQuery's type system. The coordinates represent San Francisco and New York, making this a realistic test case.


16-18: Clean expression construction and execution

The code correctly builds and executes the haversine distance expression using the imported function.

README.md (5)

213-219: Well-structured BigQuery Integration Tests documentation section

Clear introduction to the BigQuery integration tests with a concise overview of their purpose.


221-223: Concise test coverage statement

The simplified statement aligns with previous feedback to avoid listing all modules.


225-233: Comprehensive prerequisites section

Good documentation of all requirements needed to run the integration tests, including GCP account, service account, credentials, and test data.


235-259: Clear manual setup instructions

Step-by-step instructions for setting up authentication, installing dependencies, and running tests. The commands are provided with examples for both running all tests and specific test modules.


261-276: Detailed GitHub Actions workflow documentation

Thorough instructions for running tests via GitHub Actions, including the required repository secrets.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~275-~275: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

tests/integration/bigquery/test_cross_shop.py (2)

1-7: Clear imports and docstring

Good documentation and appropriate imports for the CrossShop integration tests.


11-26: Well-implemented fixture for test data

The fixture uses appropriate filtering and limiting to create a manageable test dataset while including the necessary categories for cross-shop analysis.

@mayurkmmt mayurkmmt force-pushed the feature/bigquery-integration-analysis branch from 7bec26b to a9a6b58 Compare May 15, 2025 09:08
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
README.md (1)

275-275: Fix spacing in formatting.

There appears to be an extra space after the colon.

-To run the workflow in GitHub Actions, add these secrets to your repository:  
+To run the workflow in GitHub Actions, add these secrets to your repository:
🧰 Tools
🪛 LanguageTool

[uncategorized] ~275-~275: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 7bec26b and a9a6b58.

📒 Files selected for processing (15)
  • .github/workflows/bigquery-integration.yml (1 hunks)
  • README.md (2 hunks)
  • pyretailscience/segmentation/threshold.py (1 hunks)
  • tests/integration/bigquery/test_cohort_analysis.py (1 hunks)
  • tests/integration/bigquery/test_composite_rank.py (1 hunks)
  • tests/integration/bigquery/test_cross_shop.py (1 hunks)
  • tests/integration/bigquery/test_customer_decision_hierarchy.py (1 hunks)
  • tests/integration/bigquery/test_gain_loss_analysis.py (1 hunks)
  • tests/integration/bigquery/test_haversine.py (1 hunks)
  • tests/integration/bigquery/test_hml_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_product_association.py (1 hunks)
  • tests/integration/bigquery/test_revenue_tree.py (1 hunks)
  • tests/integration/bigquery/test_rfm_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_segstats_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_threshold_segmentation.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/integration/bigquery/test_rfm_segmentation.py
🚧 Files skipped from review as they are similar to previous changes (12)
  • pyretailscience/segmentation/threshold.py
  • tests/integration/bigquery/test_customer_decision_hierarchy.py
  • tests/integration/bigquery/test_cohort_analysis.py
  • tests/integration/bigquery/test_haversine.py
  • tests/integration/bigquery/test_revenue_tree.py
  • tests/integration/bigquery/test_cross_shop.py
  • tests/integration/bigquery/test_threshold_segmentation.py
  • tests/integration/bigquery/test_composite_rank.py
  • tests/integration/bigquery/test_hml_segmentation.py
  • tests/integration/bigquery/test_product_association.py
  • .github/workflows/bigquery-integration.yml
  • tests/integration/bigquery/test_segstats_segmentation.py
🧰 Additional context used
🪛 LanguageTool
README.md

[uncategorized] ~275-~275: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

🔇 Additional comments (10)
README.md (3)

213-277: Comprehensive BigQuery integration tests documentation added.

This is a well-structured documentation section that clearly explains the purpose, prerequisites, and methods for running the BigQuery integration tests. The section provides both manual setup instructions and GitHub Actions workflow guidance.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~275-~275: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)


248-249: Command matches previous reviewer's suggestion.

The command uv sync aligns with murray-ds's previous comment suggesting this format.


223-223: Documentation follows previous reviewer's suggestion.

You've implemented murray-ds's suggestion to not list all the modules, which helps reduce maintenance overhead as new modules are added.

tests/integration/bigquery/test_gain_loss_analysis.py (7)

1-11: Well-structured imports and module documentation.

The file has appropriate imports and clear module-level documentation explaining the purpose of these integration tests.


16-38: Robust fixture with proper error handling.

The BigQuery connection fixture includes appropriate error handling for GoogleCloudError exceptions, ensuring tests are skipped gracefully when data cannot be loaded.


41-104: Comprehensive instantiation test with thorough assertions.

This test effectively verifies that the GainLoss class works with BigQuery data, including proper validation of the resulting dataframes and expected columns.


106-158: Well-designed parametrized group column test.

The test covers different grouping scenarios and properly verifies the structure of both regular column lists and MultiIndex columns in the results.


160-256: Thorough testing of customer group processing logic.

This parametrized test provides comprehensive coverage of various customer movement scenarios, with appropriate assertions for each case.


258-312: Good coverage of aggregation functions and value columns.

The test ensures that GainLoss works correctly with different aggregation functions and value columns, including proper handling of missing columns.


277-281: Handles missing columns gracefully.

The code appropriately creates the transaction_count column when needed and skips tests when required columns are missing, providing good test robustness.

Comment on lines +50 to +51
uv sync
uv sync --group dev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need both here. uv sync will also install group dev

Suggested change
uv sync
uv sync --group dev
uv sync

@@ -25,7 +25,7 @@ repos:
hooks:
- id: pytest
name: pytest
entry: uv run pytest --cov=pyretailscience --cov-report=xml --cov-branch tests
entry: uv run pytest --cov=pyretailscience --cov-report=xml --cov-branch tests --ignore=tests/integration/bigquery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we'll also have other DB integration tests. Maybe we just ignore the whole tests/integration directory?

Suggested change
entry: uv run pytest --cov=pyretailscience --cov-report=xml --cov-branch tests --ignore=tests/integration/bigquery
entry: uv run pytest --cov=pyretailscience --cov-report=xml --cov-branch tests --ignore=tests/integration

@@ -83,14 +83,11 @@ def __init__(
window = ibis.window(order_by=ibis.asc(df[value_col]))
df = df.mutate(ptile=ibis.percent_rank().over(window))

case = ibis.case()

case_args = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do this with list comprehension, ie

Suggested change
case_args = []
case_args = [(df["ptile"] <= quantile, segment) for quantile, segment in zip(thresholds, segments, strict=True)]

.env_sample Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please rename this file to .test_env_sample so it's clear it's only for testing purposes.

Comment on lines 25 to 36
def test_invalid_period(self, transactions_table):
"""Test if an invalid period raises an error."""
invalid_period = "m"
with pytest.raises(
ValueError,
match=f"Invalid period '{invalid_period}'. Allowed values: {CohortAnalysis.VALID_PERIODS}",
):
CohortAnalysis(
df=transactions_table,
aggregation_column="unit_spend",
period=invalid_period,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this test? This is only testing ibis functionality correct?

Comment on lines 216 to 247
def test_ibis_bigquery_brand_association(self, transactions_table):
"""Test brand association analysis using BigQuery backend.

This ensures that brand-level association works with BigQuery.
"""
value_col = "brand_name"
query = transactions_table.select(
"transaction_id",
value_col,
).limit(1000)

df = query.execute()

if df[value_col].isna().mean() > MAX_NULL_RATIO or df[value_col].nunique() < MIN_UNIQUE_CATEGORIES:
pytest.skip("Brand data not suitable for association testing")

associations = ProductAssociation(
df=df,
value_col=value_col,
group_col="transaction_id",
min_cooccurrences=1,
)

result = associations.df
if hasattr(result, "execute"):
result = result.execute()

assert isinstance(result, pd.DataFrame)

if len(result) > 0:
assert f"{value_col}_1" in result.columns
assert f"{value_col}_2" in result.columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this test already covered by the above?

"transaction_date",
]
query = transactions_table.select(required_cols).limit(10)
df = query.execute()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't testing the function with BQ. You need to pass the Ibis table so it does the aggregations in the RevenueTree

Comment on lines 74 to 98
def test_ibis_bigquery_with_quantity(self, cols, transactions_table):
"""Test that RevenueTree works with additional quantity column in BigQuery."""
required_cols = [
"customer_id",
"unit_spend",
"transaction_id",
"transaction_date",
"unit_quantity",
]
query = transactions_table.select(required_cols).limit(15)
df = query.execute()

middle_index = len(df) // 2
df["period"] = ["P1"] * middle_index + ["P2"] * (len(df) - middle_index)

rt = RevenueTree(
df=df,
period_col="period",
p1_value="P1",
p2_value="P2",
)

assert rt.df is not None
assert cols.calc_units_per_trans_contrib in rt.df.columns
assert cols.calc_price_per_unit_contrib in rt.df.columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this tests anything additional.

Comment on lines 100 to 123
def test_ibis_bigquery_uneven_periods(self, cols, transactions_table):
"""Test that RevenueTree works with uneven period data in BigQuery."""
required_cols = [
"customer_id",
"unit_spend",
"transaction_id",
"transaction_date",
"brand_id",
]
query = transactions_table.select(required_cols).limit(15)
df = query.execute()

df["period"] = ["P1"] * 10 + ["P2"] * (len(df) - 10)

rt = RevenueTree(
df=df,
period_col="period",
p1_value="P1",
p2_value="P2",
group_col="brand_id",
)

assert rt.df is not None
assert len(rt.df) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why might uneven periods cause an issue with the BQ backend?

Comment on lines 125 to 151
def test_ibis_bigquery_column_selection(self, cols, transactions_table):
"""Test that RevenueTree works with specific column selection in BigQuery."""
query = transactions_table.select(
[
"customer_id",
"unit_spend",
"transaction_id",
"transaction_date",
],
).limit(10)
df = query.execute()

middle_index = len(df) // 2
df["period"] = ["P1"] * middle_index + ["P2"] * (len(df) - middle_index)

rt = RevenueTree(
df=df,
period_col="period",
p1_value="P1",
p2_value="P2",
)

assert rt.df is not None
assert cols.agg_customer_id_p1 in rt.df.columns
assert cols.agg_customer_id_p2 in rt.df.columns
assert cols.agg_unit_spend_p1 in rt.df.columns
assert cols.agg_unit_spend_p2 in rt.df.columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to test this?

@mayurkmmt mayurkmmt force-pushed the feature/bigquery-integration-analysis branch from a9a6b58 to 00783ab Compare May 15, 2025 13:21
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
tests/integration/bigquery/test_customer_decision_hierarchy.py (3)

11-19: Good use of parametrization to test various combinations.

The test effectively covers multiple product columns, both methods (truncated_svd and yules_q), and both values for exclude_same_transaction_products as requested in the previous review comment.

Consider adding one more parameter combination to test yules_q with exclude_same_transaction=False for more complete coverage.


36-42: Consider validating the output structure.

Currently, the test only verifies that the class initializes without errors. Consider adding assertions to verify that the resulting object's structure is as expected (e.g., check that distances are calculated, pairs_df is populated).

Example addition:

        CustomerDecisionHierarchy(
            df=transactions_df,
            product_col=product_col,
            exclude_same_transaction_products=exclude_same_transaction,
            method=method,
            min_var_explained=min_var_explained if min_var_explained is not None else 0.8,
        )
+       # Add assertions to verify that the object was initialized correctly
+       cdh = CustomerDecisionHierarchy(
+           df=transactions_df,
+           product_col=product_col,
+           exclude_same_transaction_products=exclude_same_transaction,
+           method=method,
+           min_var_explained=min_var_explained if min_var_explained is not None else 0.8,
+       )
+       # Verify that the object has the expected structure
+       assert cdh.pairs_df is not None and len(cdh.pairs_df) > 0, "pairs_df should be populated"
+       assert cdh.distances is not None, "distances should be calculated"

43-44: Consider catching more specific exceptions.

While using a broad exception handler is acceptable for tests, catching more specific exceptions would provide better insights into what's failing.

-    except Exception as e:  # noqa: BLE001
+    except (ValueError, TypeError, ImportError) as e:
         pytest.fail(f"CustomerDecisionHierarchy failed with product_col={product_col}, method={method}: {e}")
README.md (3)

232-233: Consider adding more details about the test dataset.

It would be helpful to provide more information about the required test dataset structure or schema to help users set up their test environment correctly.

Consider adding:

  • Information about the schema of the transactions table
  • Minimum required columns based on the tests
  • A sample schema definition or data generation script

240-243: Clarify environment variable usage.

The manual setup uses GOOGLE_APPLICATION_CREDENTIALS while GitHub Actions uses GCP_SA_KEY. Consider explaining the reason for this difference.

- Set up authentication:
+ Set up authentication (for manual testing):

```bash
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service-account-key.json
export GCP_PROJECT_ID=your-project-id
  • Note: GOOGLE_APPLICATION_CREDENTIALS is the standard environment variable used by Google Cloud client libraries, while GitHub Actions uses GCP_SA_KEY to store the same information as a repository secret.

---

`272-276`: **Fix formatting of list items.**

There's a formatting issue with the list of GitHub secrets.


```diff
To run the workflow in GitHub Actions, add these secrets to your repository:

- `GCP_SA_KEY`: The entire JSON content of your GCP service account key file
- `GCP_PROJECT_ID`: Your GCP project ID
🧰 Tools
🪛 LanguageTool

[uncategorized] ~275-~275: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between a9a6b58 and 00783ab.

📒 Files selected for processing (15)
  • .github/workflows/bigquery-integration.yml (1 hunks)
  • README.md (2 hunks)
  • pyretailscience/segmentation/threshold.py (1 hunks)
  • tests/integration/bigquery/test_cohort_analysis.py (1 hunks)
  • tests/integration/bigquery/test_composite_rank.py (1 hunks)
  • tests/integration/bigquery/test_cross_shop.py (1 hunks)
  • tests/integration/bigquery/test_customer_decision_hierarchy.py (1 hunks)
  • tests/integration/bigquery/test_gain_loss_analysis.py (1 hunks)
  • tests/integration/bigquery/test_haversine.py (1 hunks)
  • tests/integration/bigquery/test_hml_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_product_association.py (1 hunks)
  • tests/integration/bigquery/test_revenue_tree.py (1 hunks)
  • tests/integration/bigquery/test_rfm_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_segstats_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_threshold_segmentation.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (13)
  • pyretailscience/segmentation/threshold.py
  • tests/integration/bigquery/test_threshold_segmentation.py
  • tests/integration/bigquery/test_cross_shop.py
  • tests/integration/bigquery/test_segstats_segmentation.py
  • tests/integration/bigquery/test_haversine.py
  • tests/integration/bigquery/test_rfm_segmentation.py
  • tests/integration/bigquery/test_product_association.py
  • .github/workflows/bigquery-integration.yml
  • tests/integration/bigquery/test_cohort_analysis.py
  • tests/integration/bigquery/test_hml_segmentation.py
  • tests/integration/bigquery/test_composite_rank.py
  • tests/integration/bigquery/test_gain_loss_analysis.py
  • tests/integration/bigquery/test_revenue_tree.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/integration/bigquery/test_customer_decision_hierarchy.py (3)
pyretailscience/analysis/customer_decision_hierarchy.py (1)
  • CustomerDecisionHierarchy (16-289)
pyretailscience/options.py (1)
  • ColumnHelper (408-560)
tests/integration/bigquery/conftest.py (1)
  • transactions_table (31-33)
🪛 LanguageTool
README.md

[uncategorized] ~275-~275: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

🔇 Additional comments (2)
tests/integration/bigquery/test_customer_decision_hierarchy.py (1)

33-33:

❓ Verification inconclusive

Verify data limit sufficiency.

The 5000 row limit may be sufficient for testing initialization, but consider whether it provides enough data density for meaningful analysis results, especially for sparse customer-product relationships.

You could run a quick check to ensure the data contains sufficient variety:


🏁 Script executed:

#!/bin/bash
# Check if the limited dataset has enough variety for meaningful tests
count_unique=$(cat <<EOF | python -
import sys
import pandas as pd

# Create a simple dataframe with sample data for illustration
df = pd.DataFrame({
    'customer_id': range(10),
    'product_name': [f'product_{i % 3}' for i in range(10)],
    'transaction_id': range(10)
})

# Print example counts that would be useful to check in the actual test data
print("Sample counts that would be useful to check in your actual BigQuery data:")
print(f"- Unique customers: {df['customer_id'].nunique()}")
print(f"- Unique products: {df['product_name'].nunique()}")
print(f"- Unique transactions: {df['transaction_id'].nunique()}")
print(f"- Average products per customer: {len(df) / df['customer_id'].nunique():.2f}")
EOF
)

echo "$count_unique"

Length of output: 164


Verify dataset variety in tests/integration/bigquery/test_customer_decision_hierarchy.py:33

The hard‐coded limit of 5 000 rows may not cover enough unique customer–product relationships for meaningful assertions. Please manually run a query against your test BigQuery table to confirm:

WITH sample AS (
  SELECT * 
  FROM `your_project.your_dataset.transactions_table`
  LIMIT 5000
)
SELECT
  COUNT(DISTINCT customer_id)      AS unique_customers,
  COUNT(DISTINCT product_name)     AS unique_products,
  AVG(txn_count)                   AS avg_txns_per_customer
FROM (
  SELECT customer_id, COUNT(*) AS txn_count
  FROM sample
  GROUP BY customer_id
);

Adjust the row limit or add more diverse test data if these metrics fall below your expected thresholds.

README.md (1)

217-219: Clear and concise introduction to the BigQuery integration tests.

The overview effectively explains the purpose of these tests and their importance in ensuring compatibility with BigQuery.

Comment on lines 16 to 18
df = transactions_table.to_pandas()

df["spend_per_customer"] = df["unit_spend"] / df["customer_id"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be done in Ibis and passed as an ibis table.

("product_col", "method", "min_var_explained", "exclude_same_transaction"),
[
("product_name", "truncated_svd", 0.8, True),
("category_1_name", "yules_q", None, True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a version where method="yules_q" and exclude_same_transaction=False



@pytest.mark.parametrize(
("aggregation_column", "agg_func", "period", "percentage"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test with different columns? Does that affect ibis functionality? I get that it might be the wrong answer we're not correctly switching the column, but I don't think that will cause an Ibis failure.

Comment on lines 15 to 20
("brand_name", "category_1_name", None, "unit_spend", "sum"),
("category_0_name", "brand_name", None, "unit_quantity", "max"),
("category_1_name", "category_0_name", None, "unit_cost", "mean"),
# Test with 3 groups
("brand_name", "category_0_name", "category_1_name", "unit_spend", "count"),
("category_0_name", "category_1_name", "brand_name", "unit_quantity", "mean"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test with different columns? Does that affect ibis functionality? I get that it might be the wrong answer we're not correctly switching the column, but I don't think that will cause an Ibis failure.



@pytest.mark.parametrize(
("focus_brand_id", "comparison_brand_id", "group_col", "value_col,agg_func"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test with different group columns? Does that affect ibis functionality? I get that it might be the wrong answer we're not correctly switching the column, but I don't think that will cause an Ibis failure.

Same thing with the different brand IDs

@pytest.mark.parametrize(
("value_col", "group_col", "target_item"),
[
("product_name", "customer_id", None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test with different columns? Does that affect ibis functionality? I get that it might be the wrong answer we're not correctly switching the column, but I don't think that will cause an Ibis failure. Target item is good though

Comment on lines 45 to 52
@pytest.mark.parametrize(
("min_occurrences", "min_cooccurrences", "min_support", "min_confidence", "min_uplift"),
[
(1, 1, 0.0, 0.0, 0.0),
(5, 3, 0.01, 0.05, 1.0),
(10, 5, 0.02, 0.1, 1.5),
],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's enough to test with one set of values so we don't need parameterize here.

Comment on lines 16 to 17
("transaction_date", "brand_name"),
("transaction_date", "store_id"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's enough to test with None and one group_col value

limited_transactions = transactions_table.limit(10000)

try:
period_values = limited_transactions.select(period_col).distinct().limit(2).execute()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be an ibis table

@mayurkmmt mayurkmmt force-pushed the feature/bigquery-integration-analysis branch from 6fe7d14 to 54db275 Compare May 16, 2025 13:19
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/integration/bigquery/test_threshold_segmentation.py (1)

1-39: Test design is good, but consider enhancing assertions.

The parameterized test correctly verifies that ThresholdSegmentation can process BigQuery data with different zero_value_handling options without exceptions. The test is properly structured and uses parameterization as requested in previous reviews.

However, the test could be enhanced to verify that the segmentation produces expected results rather than just checking for exceptions. Consider adding assertions that validate the segment distribution or other key properties of the segmentation output.

You could improve this by:

        ThresholdSegmentation(
            df=limited_table,
            thresholds=[0.33, 0.66],
            segments=["Low", "High"],
            value_col=cols.unit_spend,
            agg_func="mean",
            zero_value_customers=zero_value_handling,
-        )
+        ).table.to_pandas().head()  # Verify we can execute the table
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 6fe7d14 and 54db275.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (18)
  • .env_sample (1 hunks)
  • .github/workflows/bigquery-integration.yml (1 hunks)
  • .pre-commit-config.yaml (1 hunks)
  • README.md (2 hunks)
  • pyproject.toml (1 hunks)
  • pyretailscience/segmentation/threshold.py (1 hunks)
  • tests/integration/bigquery/conftest.py (1 hunks)
  • tests/integration/bigquery/test_cohort_analysis.py (1 hunks)
  • tests/integration/bigquery/test_composite_rank.py (1 hunks)
  • tests/integration/bigquery/test_cross_shop.py (1 hunks)
  • tests/integration/bigquery/test_customer_decision_hierarchy.py (1 hunks)
  • tests/integration/bigquery/test_haversine.py (1 hunks)
  • tests/integration/bigquery/test_hml_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_product_association.py (1 hunks)
  • tests/integration/bigquery/test_revenue_tree.py (1 hunks)
  • tests/integration/bigquery/test_rfm_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_segstats_segmentation.py (1 hunks)
  • tests/integration/bigquery/test_threshold_segmentation.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (15)
  • .env_sample
  • pyproject.toml
  • tests/integration/bigquery/test_hml_segmentation.py
  • .pre-commit-config.yaml
  • tests/integration/bigquery/test_cohort_analysis.py
  • pyretailscience/segmentation/threshold.py
  • tests/integration/bigquery/test_customer_decision_hierarchy.py
  • tests/integration/bigquery/test_rfm_segmentation.py
  • tests/integration/bigquery/test_product_association.py
  • tests/integration/bigquery/test_cross_shop.py
  • tests/integration/bigquery/test_segstats_segmentation.py
  • tests/integration/bigquery/test_haversine.py
  • tests/integration/bigquery/test_revenue_tree.py
  • tests/integration/bigquery/conftest.py
  • .github/workflows/bigquery-integration.yml
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/integration/bigquery/test_threshold_segmentation.py (3)
pyretailscience/options.py (1)
  • ColumnHelper (408-560)
pyretailscience/segmentation/threshold.py (2)
  • ThresholdSegmentation (22-102)
  • df (98-102)
tests/integration/bigquery/conftest.py (1)
  • transactions_table (31-33)
🪛 LanguageTool
README.md

[uncategorized] ~275-~275: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

🔇 Additional comments (1)
README.md (1)

213-277: Documentation for BigQuery integration tests is clear and complete.

The new section provides comprehensive guidance on running the BigQuery integration tests, both manually and through GitHub Actions. The instructions are well-structured with clear prerequisites, setup steps, and execution commands.

The changes address previous review feedback by using uv sync for dependency installation and not listing all individual modules covered by the tests.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~275-~275: Loose punctuation mark.
Context: ...rets to your repository: - GCP_SA_KEY: The entire JSON content of your GCP ser...

(UNLIKELY_OPENING_PUNCTUATION)

Comment on lines +16 to +18
ibis_table = transactions_table.mutate(
spend_per_customer=transactions_table.unit_spend / transactions_table.customer_id,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Creating spend_per_customer using division by customer_id looks incorrect.

Dividing unit_spend by customer_id is mathematically questionable, as customer IDs are typically not meaningful numerical values but rather identifiers. This calculation won't produce a meaningful "spend per customer" metric.

Consider one of these alternatives:

-        ibis_table = transactions_table.mutate(
-            spend_per_customer=transactions_table.unit_spend / transactions_table.customer_id,
-        )
+        # Option 1: Create a more meaningful derived metric
+        ibis_table = transactions_table.mutate(
+            spend_per_customer=transactions_table.unit_spend / transactions_table.quantity,
+        )

Or if you need a synthetic column just for testing:

-        ibis_table = transactions_table.mutate(
-            spend_per_customer=transactions_table.unit_spend / transactions_table.customer_id,
-        )
+        # Option 2: Create a synthetic column that's more clearly for testing purposes
+        ibis_table = transactions_table.mutate(
+            spend_per_customer=transactions_table.unit_spend * 0.5,  # Arbitrary calculation for testing
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ibis_table = transactions_table.mutate(
spend_per_customer=transactions_table.unit_spend / transactions_table.customer_id,
)
# Option 1: Create a more meaningful derived metric
ibis_table = transactions_table.mutate(
spend_per_customer=transactions_table.unit_spend / transactions_table.quantity,
)
Suggested change
ibis_table = transactions_table.mutate(
spend_per_customer=transactions_table.unit_spend / transactions_table.customer_id,
)
# Option 2: Create a synthetic column for testing purposes
ibis_table = transactions_table.mutate(
spend_per_customer=transactions_table.unit_spend * 0.5, # Arbitrary calculation for testing
)
🤖 Prompt for AI Agents
In tests/integration/bigquery/test_composite_rank.py around lines 16 to 18, the
calculation of spend_per_customer by dividing unit_spend by customer_id is
incorrect because customer_id is an identifier, not a numeric value for
division. To fix this, replace the division by customer_id with a meaningful
aggregation or calculation, such as summing unit_spend grouped by customer_id or
using a constant or synthetic value if the column is only needed for testing
purposes.

Comment on lines +45 to +58
@pytest.mark.parametrize("ignore_ties", [False, True])
def test_tie_handling(test_transactions_table, ignore_ties):
"""Test handling of ties during rank calculation."""
rank_cols = [("unit_spend", "desc")]
try:
result = CompositeRank(
df=test_transactions_table,
rank_cols=rank_cols,
agg_func="mean",
ignore_ties=ignore_ties,
)
assert result is not None
except Exception as e: # noqa: BLE001
pytest.fail(f"CompositeRank failed with ignore_ties={ignore_ties}: {e}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Parameterization is well implemented, but test lacks meaningful assertions.

The parameterized test for tie handling is a good approach but, like the basic test, it only verifies that the CompositeRank class can be instantiated without errors. It does not verify that the tie handling is working as expected with different ignore_ties values.

Consider adding assertions that specifically validate the tie handling behavior:

        result = CompositeRank(
            df=test_transactions_table,
            rank_cols=rank_cols,
            agg_func="mean",
            ignore_ties=ignore_ties,
        )
        assert result is not None
+        
+        # Execute the table to get results
+        result_df = result.table.to_pandas()
+        
+        # Find records with tied values to verify tie handling
+        if ignore_ties:
+            # With ignore_ties=True, ranks should be unique
+            assert result_df["rank"].nunique() == len(result_df)
+        else:
+            # With ignore_ties=False, we might have duplicate ranks for tied values
+            # This is a basic check to ensure the behavior is different
+            # A more comprehensive test would look for specific tied values
+            group_sizes = result_df.groupby("unit_spend").size()
+            has_ties = any(size > 1 for size in group_sizes)
+            if has_ties:
+                # If there are ties in the data, we should have some duplicate ranks
+                assert result_df["rank"].nunique() < len(result_df)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@pytest.mark.parametrize("ignore_ties", [False, True])
def test_tie_handling(test_transactions_table, ignore_ties):
"""Test handling of ties during rank calculation."""
rank_cols = [("unit_spend", "desc")]
try:
result = CompositeRank(
df=test_transactions_table,
rank_cols=rank_cols,
agg_func="mean",
ignore_ties=ignore_ties,
)
assert result is not None
except Exception as e: # noqa: BLE001
pytest.fail(f"CompositeRank failed with ignore_ties={ignore_ties}: {e}")
@pytest.mark.parametrize("ignore_ties", [False, True])
def test_tie_handling(test_transactions_table, ignore_ties):
"""Test handling of ties during rank calculation."""
rank_cols = [("unit_spend", "desc")]
try:
result = CompositeRank(
df=test_transactions_table,
rank_cols=rank_cols,
agg_func="mean",
ignore_ties=ignore_ties,
)
assert result is not None
# Execute the table to get results
result_df = result.table.to_pandas()
# Find records with tied values to verify tie handling
if ignore_ties:
# With ignore_ties=True, ranks should be unique
assert result_df["rank"].nunique() == len(result_df)
else:
# With ignore_ties=False, we might have duplicate ranks for tied values
# This is a basic check to ensure the behavior is different
# A more comprehensive test would look for specific tied values
group_sizes = result_df.groupby("unit_spend").size()
has_ties = any(size > 1 for size in group_sizes)
if has_ties:
# If there are ties in the data, we should have some duplicate ranks
assert result_df["rank"].nunique() < len(result_df)
except Exception as e: # noqa: BLE001
pytest.fail(f"CompositeRank failed with ignore_ties={ignore_ties}: {e}")
🤖 Prompt for AI Agents
In tests/integration/bigquery/test_composite_rank.py around lines 45 to 58, the
test_tie_handling function only checks that CompositeRank instantiates without
error but does not verify correct tie handling behavior. Enhance the test by
adding assertions that confirm the ranking results reflect the expected behavior
when ignore_ties is True versus False, such as checking if tied ranks are
assigned correctly or if ties are ignored as intended. This will ensure the
parameterized test validates the functional impact of the ignore_ties parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants