Skip to content

Commit 2517ec9

Browse files
authored
Merge pull request #244 from Data-Simply/feature/bigquery-integration-analysis
Setup GCP BigQuery Integration Tests for Analysis Modules
2 parents fcf1d91 + 0cd7587 commit 2517ec9

22 files changed

+939
-10
lines changed

.env_sample

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
GCP_PROJECT_ID =
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
name: BigQuery Integration Tests
2+
3+
on:
4+
workflow_dispatch:
5+
inputs:
6+
test_suite:
7+
type: choice
8+
description: Test Suite to Run
9+
default: "all"
10+
options:
11+
- all
12+
- cohort_analysis
13+
- composite_rank
14+
- cross_shop
15+
- customer_decision_hierarchy
16+
- haversine
17+
- hml_segmentation
18+
- product_association
19+
- revenue_tree
20+
- rfm_segmentation
21+
- segstats_segmentation
22+
- threshold_segmentation
23+
24+
permissions:
25+
contents: read
26+
27+
concurrency:
28+
group: "bigquery-tests"
29+
cancel-in-progress: true
30+
31+
jobs:
32+
integration-tests:
33+
name: Run BigQuery Integration Tests
34+
runs-on: ubuntu-latest
35+
steps:
36+
- name: Checkout
37+
uses: actions/checkout@v4
38+
39+
- name: Setup Python
40+
uses: actions/setup-python@v5
41+
with:
42+
python-version: "3.11"
43+
44+
- name: Install uv Package
45+
run: |
46+
pip install --upgrade pip
47+
pip install uv==0.5.30
48+
49+
- name: Install Dependencies
50+
run: |
51+
uv sync
52+
53+
- name: Set up GCP Authentication
54+
uses: google-github-actions/auth@v2
55+
with:
56+
credentials_json: ${{ secrets.GCP_SA_KEY }}
57+
58+
- name: Run Integration Tests
59+
env:
60+
TEST_SUITE: ${{ inputs.test_suite }}
61+
run: |
62+
uv run pytest tests/integration/bigquery -v \
63+
$(if [ "$TEST_SUITE" != "all" ]; then echo "-k $TEST_SUITE"; fi)

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,8 @@ repos:
2525
hooks:
2626
- id: pytest
2727
name: pytest
28-
entry: uv run pytest --cov=pyretailscience --cov-report=xml --cov-branch tests
2928
language: system
29+
entry: uv run pytest --cov=pyretailscience --cov-report=xml --cov-branch tests --ignore=tests/integration
3030
types: [python]
3131
pass_filenames: false
3232
always_run: true

README.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
<!-- README.md -->
12
![PyRetailScience Logo](https://raw.githubusercontent.com/Data-Simply/pyretailscience/main/readme_assets/logo.png)
23

34
# PyRetailScience
@@ -208,3 +209,68 @@ Built with expertise doing analytics and data science for scale-ups to multi-nat
208209
## License
209210

210211
This project is licensed under the Elastic License 2.0 - see the [LICENSE](LICENSE) file for details.
212+
213+
# BigQuery Integration Tests
214+
215+
## Overview
216+
217+
This directory contains integration tests that verify all PyRetailScience analysis modules
218+
work correctly with Google BigQuery as a backend. These tests confirm that the Ibis-based
219+
code paths function correctly when connected to BigQuery.
220+
221+
## Test Coverage
222+
223+
The integration tests cover the analysis modules.
224+
225+
## Prerequisites
226+
227+
To run these tests, you need:
228+
229+
1. Access to a Google Cloud Platform account
230+
2. A service account with BigQuery permissions
231+
3. The service account key JSON file
232+
4. The test dataset must be loaded in BigQuery (dataset: `test_data`, table: `transactions`)
233+
234+
## Running the Tests
235+
236+
### Manual Setup
237+
238+
- Set up authentication:
239+
240+
```bash
241+
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service-account-key.json
242+
export GCP_PROJECT_ID=your-project-id
243+
```
244+
245+
- Install dependencies:
246+
247+
```bash
248+
uv sync
249+
```
250+
251+
- Run the tests:
252+
253+
```bash
254+
# Run all tests
255+
uv run pytest tests/integration/bigquery -v
256+
257+
# Run specific test module
258+
uv run pytest tests/integration/bigquery/test_cohort_analysis.py -v
259+
```
260+
261+
## Using GitHub Actions
262+
263+
These tests can be run manually in GitHub Actions via the "BigQuery Integration Tests" workflow. To run:
264+
265+
1. Go to the "Actions" tab in the GitHub repository
266+
2. Select the "BigQuery Integration Tests" workflow
267+
3. Click "Run workflow"
268+
4. Optionally enter a test filter pattern (e.g., "test_cohort_analysis")
269+
5. Click "Run workflow"
270+
271+
### Required Secrets
272+
273+
To run the workflow in GitHub Actions, add these secrets to your repository:
274+
275+
- `GCP_SA_KEY`: The entire JSON content of your GCP service account key file
276+
- `GCP_PROJECT_ID`: Your GCP project ID

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,13 @@ name = "Murray Vanwyk"
2727
[dependency-groups]
2828
dev = [
2929
"freezegun>=1.5.1,<2",
30+
"ibis-framework[bigquery]>=10.0.0,<11",
3031
"nbstripout>=0.7.1,<0.8",
3132
"pre-commit>=3.6.2,<4",
3233
"pytest-cov>=4.1.0,<5",
3334
"pytest-mock>=3.14.0,<4",
3435
"pytest>=8.0.0,<9",
36+
"python-dotenv>=1.0.0,<2",
3537
"ruff>=0.9,<0.10",
3638
"tomlkit>=0.12,<1",
3739
]

pyretailscience/segmentation/rfm.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,9 @@ def _compute_rfm(self, df: ibis.Table, current_date: datetime.date) -> ibis.Tabl
104104
current_date_expr = ibis.literal(current_date)
105105

106106
customer_metrics = df.group_by(cols.customer_id).aggregate(
107-
recency_days=(current_date_expr - df[cols.transaction_date].max().cast("date")).cast("int32"),
107+
recency_days=current_date_expr.delta(df[cols.transaction_date].max().cast("date"), unit="day").cast(
108+
"int32",
109+
),
108110
frequency=df[cols.transaction_id].nunique(),
109111
monetary=df[cols.unit_spend].sum(),
110112
)

pyretailscience/segmentation/threshold.py

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -83,14 +83,9 @@ def __init__(
8383
window = ibis.window(order_by=ibis.asc(df[value_col]))
8484
df = df.mutate(ptile=ibis.percent_rank().over(window))
8585

86-
case = ibis.case()
86+
case_args = [(df["ptile"] <= quantile, segment) for quantile, segment in zip(thresholds, segments, strict=True)]
8787

88-
for quantile, segment in zip(thresholds, segments, strict=True):
89-
case = case.when(df["ptile"] <= quantile, segment)
90-
91-
case = case.end()
92-
93-
df = df.mutate(segment_name=case).drop(["ptile"])
88+
df = df.mutate(segment_name=ibis.cases(*case_args)).drop(["ptile"])
9489

9590
if zero_value_customers == "separate_segment":
9691
df = ibis.union(df, zero_df)
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
"""BigQuery integration test fixtures."""
2+
3+
import os
4+
5+
import ibis
6+
import pytest
7+
from dotenv import load_dotenv
8+
from google.cloud import bigquery
9+
from loguru import logger
10+
11+
load_dotenv()
12+
client = bigquery.Client(project="pyretailscience-infra")
13+
14+
15+
@pytest.fixture(scope="session")
16+
def bigquery_connection():
17+
"""Connect to BigQuery for integration tests."""
18+
try:
19+
conn = ibis.bigquery.connect(
20+
project_id=os.environ.get("GCP_PROJECT_ID"),
21+
)
22+
logger.info("Connected to BigQuery")
23+
except Exception as e:
24+
logger.error(f"Failed to connect to BigQuery: {e}")
25+
raise
26+
else:
27+
return conn
28+
29+
30+
@pytest.fixture(scope="session")
31+
def transactions_table(bigquery_connection):
32+
"""Get the transactions table for testing."""
33+
return bigquery_connection.table("test_data.transactions")
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
"""Integration tests for Cohort Analysis with BigQuery."""
2+
3+
from pyretailscience.analysis.cohort import CohortAnalysis
4+
5+
6+
def test_cohort_analysis_with_bigquery(transactions_table):
7+
"""Integration test for CohortAnalysis using BigQuery backend and Ibis table.
8+
9+
This test ensures that the CohortAnalysis class initializes and executes successfully
10+
using BigQuery data with various combinations of aggregation parameters.
11+
"""
12+
limited_table = transactions_table.limit(5000)
13+
14+
CohortAnalysis(
15+
df=limited_table,
16+
aggregation_column="unit_spend",
17+
agg_func="sum",
18+
period="week",
19+
percentage=True,
20+
)
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
"""Integration tests for Composite Rank Analysis with BigQuery."""
2+
3+
import pytest
4+
5+
from pyretailscience.analysis.composite_rank import CompositeRank
6+
7+
8+
@pytest.mark.parametrize("ignore_ties", [False, True])
9+
def test_tie_handling(transactions_table, ignore_ties):
10+
"""Test handling of ties during rank calculation."""
11+
rank_cols = [
12+
("unit_spend", "desc"),
13+
("customer_id", "desc"),
14+
]
15+
result = CompositeRank(
16+
df=transactions_table,
17+
rank_cols=rank_cols,
18+
agg_func="mean",
19+
ignore_ties=ignore_ties,
20+
)
21+
assert result is not None
22+
executed_result = result.df
23+
assert executed_result is not None

0 commit comments

Comments
 (0)