Skip to content

docs: add data structures guide for preparing retail data#515

Open
murray-ds wants to merge 7 commits into
mainfrom
claude/trusting-hypatia-0r661z
Open

docs: add data structures guide for preparing retail data#515
murray-ds wants to merge 7 commits into
mainfrom
claude/trusting-hypatia-0r661z

Conversation

@murray-ds

@murray-ds murray-ds commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Adds docs/getting_started/data_structures.md covering data granularity
(line-item vs transaction level), star-schema denormalization, the
single-column ID requirement and composite-ID creation, data-quality
expectations (types, nulls, returns/refunds), and a validation checklist.
Wires the page into the MkDocs "Getting Started" nav.

Closes #338

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx

claude added 2 commits June 22, 2026 19:17
Adds docs/getting_started/data_structures.md covering data granularity
(line-item vs transaction level), star-schema denormalization, the
single-column ID requirement and composite-ID creation, data-quality
expectations, a per-function required-columns reference matrix with
backend support, and a validation checklist. Wires the page into the
MkDocs "Getting Started" nav.

Closes #338

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx
Code review of the data structures guide surfaced two factual
inaccuracies against the source:

- Only DaysBetweenPurchases and TransactionChurn enforce temporal /
  timezone-naive date columns (via ensure_tznaive_datetime in
  analysis/customer.py). RFMSegmentation and CohortAnalysis do not, so
  the Types section no longer claims they validate and reject
  non-temporal columns.
- ProductAssociation's group_col defaults to customer_id, so the default
  unit of co-occurrence is the customer, not the basket. The matrix
  Level is corrected from "basket" to "customer".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx
@claude

claude Bot commented Jun 22, 2026

Copy link
Copy Markdown

Claude finished @murray-ds's task in 2m 44s —— View job


PR Review

  • Gather context
  • Read docs/getting_started/data_structures.md
  • Read mkdocs.yml changes
  • Provide review feedback

This is a high-quality documentation addition and has clearly been well-tended through five prior review rounds. Content is accurate, the writing is clear, and the structure flows logically. A few small observations:

Confirmed correct

  • API example is valid. SegTransactionStats(data=q1, segment_col="category_0_name", grouping_sets="total") and .df all match the actual implementation (segstats.py:213, segstats.py:1031). grouping_sets="total" is an accepted literal (segstats.py:484).
  • data/transactions.parquet exists at the repo root — the con.read_parquet("data/transactions.parquet") call will work when run from the project root, consistent with other getting-started examples.
  • String date bounds in .between("2023-01-01", "2023-03-31") are consistent with connecting_to_data.md (same pattern appears four times there). The data-quality section is correctly about column dtypes, not filter literals.
  • mkdocs.yml changes are minimal and correct: nav entry wired in, pymdownx.tasklist added for the checklist to render.

Minor / non-blocking

PR description is stale — it still says the guide includes "a per-function required-columns reference matrix with backend support", but that section was removed in commit cab7038. The description doesn't gate merging, but it creates a mismatch with what reviewers and future git-searchers will find. Worth a quick update to the PR body.

ibis.duckdb.connect() vs ibis.duckdb.connect("warehouse.ddb") — the star-schema section example (line 103) uses "warehouse.ddb" while the closing example (line 263) uses the in-memory form. Both are valid Ibis calls, but the .ddb extension is uncommon (.duckdb or no extension is the typical default). Not wrong, just worth noting.


No blocking issues. The documentation is ready to merge.

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

🚀 Cloudflare Pages Deployment

✅ Preview deployed successfully!

Preview URL: https://f3c3cbb3.pyretailscience-docs.pages.dev
Deployment ID: f3c3cbb3-1fc3-4068-b85d-6a6fde99129c


This preview will be updated automatically when you push new changes to this PR.

@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
see 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Use the group_col role name (not its customer_id default) for CrossShop
  and ProductAssociation in the requirements matrix, consistent with how
  other you-named roles are listed; add a note that group_col defaults to
  customer_id and how to change the unit of analysis.
- Show a .select() in the Ibis star-schema join example so it
  demonstrates the lightweight-join guidance in the following paragraph.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx

Copy link
Copy Markdown
Contributor Author

Thanks for the review. Addressed in 5224c98:

  1. ProductAssociation matrix row — fixed, and applied the same correction to CrossShop (its group_col likewise defaults to customer_id). Both now list the group_col role rather than its default, with a note explaining the default and how to switch the unit of analysis (group_col="transaction_id" for per-basket).

  2. Ibis join example — added a .select(...) so the example demonstrates the lightweight-join pattern the following paragraph recommends.

  3. String date bounds in .between(...) — keeping as-is. The guide's temporal-type guidance is about stored column dtypes (a transaction_date column stored as strings), not filter literals. String bounds on a date column are idiomatic and coerced by Ibis/SQL, and this exact form is the established pattern in the sibling connecting_to_data.md; switching to ibis.date(...) here would diverge from it. Happy to revisit if you'd prefer the explicit-literal style across both guides.

Markdownlint passes and the docs build renders cleanly.


Generated by Claude Code

- Drop group_2_col from CrossShop's required columns (it defaults to
  group_1_col); add a note explaining the group_n_col defaulting.
- Note that the standard column-names table lists recognised names, not
  a per-dataset required set (e.g. unit_price is absent from the sample).
- Use singular "product" for the CustomerDecisionHierarchy level for
  consistency with the other rows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx

Copy link
Copy Markdown
Contributor Author

Thanks — second-round feedback addressed in 82cff11:

  • Issue 2 (CrossShop group_2_col) — confirmed at cross_shop.py:178-179 (if group_2_col is None: group_2_col = group_1_col). Dropped group_2_col from the required columns and added a note that group_2_col/group_3_col default to group_1_col, so comparing values within one column needs only group_1_col.
  • Nit (unit_price not in sample) — added a line clarifying the table lists recognised names, not a per-dataset required set (calling out that the sample has no unit_price).
  • Nit (products casing) — changed the CustomerDecisionHierarchy level to singular product.

On Issue 1 (unit_quantity missing from the standard names table) — this one I'll leave, as unit_quantity is already present in that table (between product_id and unit_price); looks like it was counted as missing by mistake.


Generated by Claude Code

…note

- Enable pymdownx.tasklist (custom_checkbox) so the data-structures
  validation checklist renders as checkboxes instead of literal [ ] text.
- Reword the line-item aggregation sentence to note that the segment or
  period level is set by each function's column arguments.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx

Copy link
Copy Markdown
Contributor Author

Thanks — third-round feedback addressed in 02b6ddb:

  • Issue 1 (checklist not rendering) — good catch, real bug. Added pymdownx.tasklist (with custom_checkbox: true) to mkdocs.yml. Verified with a local build: the validation checklist now renders as a proper task-list with checkbox inputs for all 7 items.
  • Nit (L65 RevenueTree) — reworded so it no longer implies an identical granularity argument: the line now says these functions aggregate to the segment or period level "their column arguments define."

On Issue 2 (CrossShop matrix) — keeping the current entry. As you noted, this is a deliberate, consistent choice: the "Required columns" column documents the effective set of columns that must exist in the data, using role names for you-named columns. By that rule CrossShop genuinely needs group_col, value_col, and group_1_col to exist (the first two via their defaults), and the footnotes explain the defaults — consistent with how ProductAssociation is listed. Listing only group_1_col would be less actionable for someone preparing their data.


Generated by Claude Code

- Remove the function requirements matrix section per maintainer request.
- Add up-front framing that the data expectations come from working with
  dozens of retailers and follow industry-standard patterns.
- Tighten flow and cut duplication (reviewed via subagents): drop the
  Troubleshooting section (it restated errors already covered under
  Column requirements and Data quality), fold the validation checklist
  and the runnable example into a single closing "Putting it together"
  section, and condense the intro admonition.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx
Negative unit_spend rows (returns/refunds) are common in retail data and
can skew spend aggregations; add a one-line note that they are valid data
but should be deliberately included or excluded before analysis.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx

Copy link
Copy Markdown
Contributor Author

Thanks for the review. Addressed in 1a152a6:

  • Returns/refunds (negative unit_spend) — good call, real gap. Added a note in the Data quality → Nulls section that negative-spend rows are valid returns/refunds but should be deliberately included or excluded, since most spend metrics assume positive transactions.

On the nav ordering suggestion (Data Structures before Connecting to Your Data) — keeping the current order. It's a deliberate "get connected, then learn the data shape, then configure column names" flow, and it was approved as "the right order logically" / "correct" in the two prior review rounds. Happy to swap if the maintainer prefers the preparation-first ordering, but I'll leave it rather than churn on a non-blocking, self-reversing nit.

Everything else in the review was confirmation of accuracy. Lint passes and the docs build renders.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create Data Structures Guide for PyRetailScience

2 participants