Skip to content

Commit f51ea78

Browse files
committed
feat: Setup GCP BigQuery Integration Tests for Analysis Modules
1 parent 446b876 commit f51ea78

17 files changed

+2665
-2
lines changed

.env_sample

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
GCP_PROJECT_ID =
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
name: BigQuery Integration Tests
2+
3+
on:
4+
workflow_dispatch:
5+
inputs:
6+
test_suite:
7+
type: choice
8+
description: Test Suite to Run
9+
default: "all"
10+
options:
11+
- all
12+
- cohort_analysis
13+
- composite_rank
14+
- cross_shop
15+
- customer_decision_hierarchy
16+
- hml_segmentation
17+
- product_association
18+
- revenue_tree
19+
- rfm_segmentation
20+
- segstats_segmentation
21+
- threshold_segmentation
22+
23+
permissions:
24+
contents: read
25+
26+
concurrency:
27+
group: "bigquery-tests"
28+
cancel-in-progress: true
29+
30+
jobs:
31+
integration-tests:
32+
name: Run BigQuery Integration Tests
33+
runs-on: ubuntu-latest
34+
steps:
35+
- name: Checkout
36+
uses: actions/checkout@v4
37+
38+
- name: Setup Python
39+
uses: actions/setup-python@v5
40+
with:
41+
python-version: "3.11"
42+
43+
- name: Install uv Package
44+
run: |
45+
pip install --upgrade pip
46+
pip install uv==0.5.30
47+
48+
- name: Install Dependencies
49+
run: |
50+
uv sync
51+
uv sync --group dev
52+
53+
- name: Set up GCP Authentication
54+
uses: google-github-actions/auth@v2
55+
with:
56+
credentials_json: ${{ secrets.GCP_SA_KEY }}
57+
58+
- name: Set up Google Cloud SDK
59+
uses: google-github-actions/setup-gcloud@v2
60+
61+
- name: Run Integration Tests
62+
env:
63+
TEST_SUITE: ${{ inputs.test_suite }}
64+
run: |
65+
uv run pytest tests/integration/bigquery -v \
66+
$(if [ "$TEST_SUITE" != "all" ]; then echo "-k $TEST_SUITE"; fi)

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ repos:
2525
hooks:
2626
- id: pytest
2727
name: pytest
28-
entry: uv run pytest --cov=pyretailscience --cov-report=xml --cov-branch tests
28+
entry: uv run pytest --cov=pyretailscience --cov-report=xml --cov-branch tests --ignore=tests/integration/bigquery
2929
language: system
3030
types: [python]
3131
pass_filenames: false

README.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
<!-- README.md -->
12
![PyRetailScience Logo](https://raw.githubusercontent.com/Data-Simply/pyretailscience/main/readme_assets/logo.png)
23

34
# PyRetailScience
@@ -208,3 +209,101 @@ Built with expertise doing analytics and data science for scale-ups to multi-nat
208209
## License
209210

210211
This project is licensed under the Elastic License 2.0 - see the [LICENSE](LICENSE) file for details.
212+
213+
# BigQuery Integration Tests
214+
215+
## Overview
216+
217+
This directory contains integration tests that verify all PyRetailScience analysis modules
218+
work correctly with Google BigQuery as a backend. These tests confirm that the Ibis-based
219+
code paths function correctly when connected to BigQuery.
220+
221+
## Test Coverage
222+
223+
The integration tests cover the following analysis modules:
224+
225+
- **Cohort Analysis** - Tests customer cohort retention metrics
226+
- **Cross Shop Analysis** - Tests product/category cross-shopping patterns
227+
- **Customer Analysis** - Tests customer lifetime value and purchase frequency metrics
228+
- **Gain Loss Analysis** - Tests comparative performance analysis
229+
- **Haversine Analysis** - Tests geographic distance calculations
230+
- **Product Association Analysis** - Tests market basket analysis
231+
- **Customer Decision Hierarchy** - Tests customer purchase decision patterns
232+
- **Revenue Tree Analysis** - Tests hierarchical revenue breakdowns
233+
- **Composite Rank Analysis** - Tests weighted ranking of entities
234+
- **Segmentation Analysis** - Tests RFM and value-frequency customer segmentation
235+
236+
## Prerequisites
237+
238+
To run these tests, you need:
239+
240+
1. Access to a Google Cloud Platform account
241+
2. A service account with BigQuery permissions
242+
3. The service account key JSON file
243+
4. The test dataset must be loaded in BigQuery (dataset: `test_data`, table: `transactions`)
244+
245+
## Running the Tests
246+
247+
### Manual Setup
248+
249+
- Set up authentication:
250+
251+
```bash
252+
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service-account-key.json
253+
export GCP_PROJECT_ID=your-project-id
254+
```
255+
256+
- Install dependencies:
257+
258+
```bash
259+
uv pip install -e .
260+
uv pip install "ibis-framework[bigquery]>=10.0.0,<11"
261+
```
262+
263+
- Run the tests:
264+
265+
```bash
266+
# Run all tests
267+
uv run pytest tests/integration/bigquery -v
268+
269+
# Run specific test module
270+
uv run pytest tests/integration/bigquery/test_cohort_analysis.py -v
271+
```
272+
273+
## Using GitHub Actions
274+
275+
These tests can be run manually in GitHub Actions via the "BigQuery Integration Tests" workflow. To run:
276+
277+
1. Go to the "Actions" tab in the GitHub repository
278+
2. Select the "BigQuery Integration Tests" workflow
279+
3. Click "Run workflow"
280+
4. Optionally enter a test filter pattern (e.g., "test_cohort_analysis")
281+
5. Click "Run workflow"
282+
283+
### Required Secrets
284+
285+
To run the workflow in GitHub Actions, add these secrets to your repository:
286+
287+
- `GCP_SA_KEY`: The entire JSON content of your GCP service account key file
288+
- `GCP_PROJECT_ID`: Your GCP project ID
289+
290+
## Test Data
291+
292+
The tests expect a BigQuery dataset named `test_data` with a table named `transactions` containing the following columns:
293+
294+
- `transaction_id`
295+
- `transaction_date`
296+
- `transaction_time`
297+
- `customer_id`
298+
- `product_id`
299+
- `product_name`
300+
- `category_0_name`
301+
- `category_0_id`
302+
- `category_1_name`
303+
- `category_1_id`
304+
- `brand_name`
305+
- `brand_id`
306+
- `unit_quantity`
307+
- `unit_cost`
308+
- `unit_spend`
309+
- `store_id`

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,13 @@ name = "Murray Vanwyk"
2727
[dependency-groups]
2828
dev = [
2929
"freezegun>=1.5.1,<2",
30+
"ibis-framework[bigquery]>=10.0.0,<11",
3031
"nbstripout>=0.7.1,<0.8",
3132
"pre-commit>=3.6.2,<4",
3233
"pytest-cov>=4.1.0,<5",
3334
"pytest-mock>=3.14.0,<4",
3435
"pytest>=8.0.0,<9",
36+
"python-dotenv>=1.0.0,<2",
3537
"ruff>=0.9,<0.10",
3638
"tomlkit>=0.12,<1",
3739
]
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
"""BigQuery integration test fixtures."""
2+
3+
import os
4+
5+
import ibis
6+
import pytest
7+
from dotenv import load_dotenv
8+
from google.cloud import bigquery
9+
from loguru import logger
10+
11+
load_dotenv()
12+
client = bigquery.Client(project="pyretailscience-infra")
13+
14+
15+
@pytest.fixture(scope="session")
16+
def bigquery_connection():
17+
"""Connect to BigQuery for integration tests."""
18+
try:
19+
conn = ibis.bigquery.connect(
20+
project_id=os.environ.get("GCP_PROJECT_ID"),
21+
)
22+
logger.info("Connected to BigQuery")
23+
except Exception as e:
24+
logger.error(f"Failed to connect to BigQuery: {e}")
25+
raise
26+
else:
27+
return conn
28+
29+
30+
@pytest.fixture(scope="session")
31+
def transactions_table(bigquery_connection):
32+
"""Get the transactions table for testing."""
33+
return bigquery_connection.table("test_data.transactions")
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
"""Integration tests for Cohort Analysis with BigQuery."""
2+
3+
import pandas as pd
4+
import pytest
5+
6+
from pyretailscience.analysis.cohort import CohortAnalysis
7+
8+
9+
class TestCohortAnalysisBigQuery:
10+
"""Integration tests for Cohort Analysis using real BigQuery data."""
11+
12+
def test_cohort_computation_bigquery(self, transactions_table):
13+
"""Tests cohort computation logic using BigQuery data."""
14+
cohort = CohortAnalysis(
15+
df=transactions_table,
16+
aggregation_column="unit_spend",
17+
agg_func="nunique",
18+
period="month",
19+
percentage=False,
20+
)
21+
result = cohort.table
22+
assert not result.empty, "Cohort table should not be empty for valid BigQuery data"
23+
assert isinstance(result, pd.DataFrame)
24+
25+
def test_invalid_period(self, transactions_table):
26+
"""Test if an invalid period raises an error."""
27+
invalid_period = "m"
28+
with pytest.raises(
29+
ValueError,
30+
match=f"Invalid period '{invalid_period}'. Allowed values: {CohortAnalysis.VALID_PERIODS}",
31+
):
32+
CohortAnalysis(
33+
df=transactions_table,
34+
aggregation_column="unit_spend",
35+
period=invalid_period,
36+
)
37+
38+
def test_cohort_percentage(self, transactions_table):
39+
"""Tests cohort analysis with percentage=True."""
40+
cohort = CohortAnalysis(
41+
df=transactions_table,
42+
aggregation_column="unit_spend",
43+
agg_func="sum",
44+
period="month",
45+
percentage=True,
46+
)
47+
result = cohort.table
48+
assert not result.empty
49+
assert result.max().max() <= 1.0, "Values should be <= 1 when percentage=True"
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
"""Integration tests for Composite Rank Analysis with BigQuery."""
2+
3+
import pytest
4+
5+
from pyretailscience.analysis.composite_rank import CompositeRank
6+
7+
8+
class TestCompositeRank:
9+
"""Tests for the CompositeRank class."""
10+
11+
@pytest.fixture(scope="class")
12+
def test_transactions_df(self, transactions_table):
13+
"""Fetch test transactions data from BigQuery and convert to DataFrame.
14+
15+
This fixture assumes a table with columns like product_id, spend, customers, etc.
16+
Modify the query and column names as per your actual BigQuery table structure.
17+
"""
18+
df = transactions_table.to_pandas()
19+
20+
if "spend_per_customer" not in df.columns:
21+
df["spend_per_customer"] = df["unit_spend"] / df["customer_id"]
22+
23+
return df
24+
25+
def test_composite_rank_with_bigquery_data(self, test_transactions_df):
26+
"""Test CompositeRank functionality with real BigQuery data.
27+
28+
This test demonstrates using CompositeRank with BigQuery-sourced data.
29+
"""
30+
rank_cols = [
31+
("unit_spend", "desc"),
32+
("customer_id", "desc"),
33+
("spend_per_customer", "desc"),
34+
]
35+
36+
cr = CompositeRank(
37+
df=test_transactions_df,
38+
rank_cols=rank_cols,
39+
agg_func="mean",
40+
ignore_ties=False,
41+
)
42+
43+
assert "composite_rank" in cr.df.columns
44+
assert len(cr.df) > 0
45+
46+
expected_rank_columns = [
47+
"unit_spend_rank",
48+
"customer_id_rank",
49+
"spend_per_customer_rank",
50+
"composite_rank",
51+
]
52+
for col in expected_rank_columns:
53+
assert col in cr.df.columns
54+
55+
def test_different_agg_functions_with_bigquery(self, test_transactions_df):
56+
"""Test different aggregation functions with BigQuery data."""
57+
agg_functions = ["mean", "sum", "min", "max"]
58+
59+
rank_cols = [
60+
("unit_spend", "desc"),
61+
("customer_id", "desc"),
62+
("spend_per_customer", "desc"),
63+
]
64+
65+
for agg_func in agg_functions:
66+
cr = CompositeRank(
67+
df=test_transactions_df,
68+
rank_cols=rank_cols,
69+
agg_func=agg_func,
70+
ignore_ties=False,
71+
)
72+
73+
assert "composite_rank" in cr.df.columns
74+
assert len(cr.df) > 0
75+
76+
def test_ignore_ties_with_bigquery(self, test_transactions_df):
77+
"""Test tie-breaking behavior with BigQuery data."""
78+
rank_cols = [("unit_spend", "desc")]
79+
80+
cr_with_ties = CompositeRank(
81+
df=test_transactions_df,
82+
rank_cols=rank_cols,
83+
agg_func="mean",
84+
ignore_ties=False,
85+
)
86+
87+
cr_no_ties = CompositeRank(
88+
df=test_transactions_df,
89+
rank_cols=rank_cols,
90+
agg_func="mean",
91+
ignore_ties=True,
92+
)
93+
94+
assert "unit_spend_rank" in cr_with_ties.df.columns
95+
assert "unit_spend_rank" in cr_no_ties.df.columns
96+
97+
def test_ibis_table_input(self, transactions_table):
98+
"""Explicitly test Ibis table input for CompositeRank."""
99+
cr = CompositeRank(
100+
df=transactions_table,
101+
rank_cols=[("unit_spend", "desc"), ("customer_id", "desc")],
102+
agg_func="mean",
103+
ignore_ties=False,
104+
)
105+
106+
expected_columns = [
107+
"unit_spend_rank",
108+
"customer_id_rank",
109+
"composite_rank",
110+
]
111+
112+
for col in expected_columns:
113+
assert col in cr.df.columns

0 commit comments

Comments
 (0)