docs: add data structures guide for preparing retail data by murray-ds · Pull Request #515 · Data-Simply/openretailscience

murray-ds · 2026-06-22T19:41:13Z

Adds docs/getting_started/data_structures.md covering data granularity
(line-item vs transaction level), star-schema denormalization, the
single-column ID requirement and composite-ID creation, data-quality
expectations (types, nulls, returns/refunds), and a validation checklist.
Wires the page into the MkDocs "Getting Started" nav.

Closes #338

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx

Adds docs/getting_started/data_structures.md covering data granularity (line-item vs transaction level), star-schema denormalization, the single-column ID requirement and composite-ID creation, data-quality expectations, a per-function required-columns reference matrix with backend support, and a validation checklist. Wires the page into the MkDocs "Getting Started" nav. Closes #338 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx

Code review of the data structures guide surfaced two factual inaccuracies against the source: - Only DaysBetweenPurchases and TransactionChurn enforce temporal / timezone-naive date columns (via ensure_tznaive_datetime in analysis/customer.py). RFMSegmentation and CohortAnalysis do not, so the Types section no longer claims they validate and reject non-temporal columns. - ProductAssociation's group_col defaults to customer_id, so the default unit of co-occurrence is the customer, not the basket. The matrix Level is corrected from "basket" to "customer". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx

claude · 2026-06-22T19:41:37Z

Claude finished @murray-ds's task in 2m 44s —— View job

PR Review

Gather context
Read docs/getting_started/data_structures.md
Read mkdocs.yml changes
Provide review feedback

This is a high-quality documentation addition and has clearly been well-tended through five prior review rounds. Content is accurate, the writing is clear, and the structure flows logically. A few small observations:

Confirmed correct

API example is valid. SegTransactionStats(data=q1, segment_col="category_0_name", grouping_sets="total") and .df all match the actual implementation (segstats.py:213, segstats.py:1031). grouping_sets="total" is an accepted literal (segstats.py:484).
data/transactions.parquet exists at the repo root — the con.read_parquet("data/transactions.parquet") call will work when run from the project root, consistent with other getting-started examples.
String date bounds in .between("2023-01-01", "2023-03-31") are consistent with connecting_to_data.md (same pattern appears four times there). The data-quality section is correctly about column dtypes, not filter literals.
mkdocs.yml changes are minimal and correct: nav entry wired in, pymdownx.tasklist added for the checklist to render.

Minor / non-blocking

PR description is stale — it still says the guide includes "a per-function required-columns reference matrix with backend support", but that section was removed in commit cab7038. The description doesn't gate merging, but it creates a mismatch with what reviewers and future git-searchers will find. Worth a quick update to the PR body.

ibis.duckdb.connect() vs ibis.duckdb.connect("warehouse.ddb") — the star-schema section example (line 103) uses "warehouse.ddb" while the closing example (line 263) uses the in-memory form. Both are valid Ibis calls, but the .ddb extension is uncommon (.duckdb or no extension is the typical default). Not wrong, just worth noting.

No blocking issues. The documentation is ready to merge.

github-actions · 2026-06-22T19:42:35Z

🚀 Cloudflare Pages Deployment

✅ Preview deployed successfully!

Preview URL: https://f3c3cbb3.pyretailscience-docs.pages.dev
Deployment ID: f3c3cbb3-1fc3-4068-b85d-6a6fde99129c

This preview will be updated automatically when you push new changes to this PR.

codecov · 2026-06-22T19:44:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
see 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Use the group_col role name (not its customer_id default) for CrossShop and ProductAssociation in the requirements matrix, consistent with how other you-named roles are listed; add a note that group_col defaults to customer_id and how to change the unit of analysis. - Show a .select() in the Ibis star-schema join example so it demonstrates the lightweight-join guidance in the following paragraph. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx

murray-ds · 2026-06-22T19:48:29Z

Thanks for the review. Addressed in 5224c98:

ProductAssociation matrix row — fixed, and applied the same correction to CrossShop (its group_col likewise defaults to customer_id). Both now list the group_col role rather than its default, with a note explaining the default and how to switch the unit of analysis (group_col="transaction_id" for per-basket).
Ibis join example — added a .select(...) so the example demonstrates the lightweight-join pattern the following paragraph recommends.
String date bounds in .between(...) — keeping as-is. The guide's temporal-type guidance is about stored column dtypes (a transaction_date column stored as strings), not filter literals. String bounds on a date column are idiomatic and coerced by Ibis/SQL, and this exact form is the established pattern in the sibling connecting_to_data.md; switching to ibis.date(...) here would diverge from it. Happy to revisit if you'd prefer the explicit-literal style across both guides.

Markdownlint passes and the docs build renders cleanly.

Generated by Claude Code

- Drop group_2_col from CrossShop's required columns (it defaults to group_1_col); add a note explaining the group_n_col defaulting. - Note that the standard column-names table lists recognised names, not a per-dataset required set (e.g. unit_price is absent from the sample). - Use singular "product" for the CustomerDecisionHierarchy level for consistency with the other rows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx

murray-ds · 2026-06-22T19:53:35Z

Thanks — second-round feedback addressed in 82cff11:

Issue 2 (CrossShop group_2_col) — confirmed at cross_shop.py:178-179 (if group_2_col is None: group_2_col = group_1_col). Dropped group_2_col from the required columns and added a note that group_2_col/group_3_col default to group_1_col, so comparing values within one column needs only group_1_col.
Nit (unit_price not in sample) — added a line clarifying the table lists recognised names, not a per-dataset required set (calling out that the sample has no unit_price).
Nit (products casing) — changed the CustomerDecisionHierarchy level to singular product.

On Issue 1 (unit_quantity missing from the standard names table) — this one I'll leave, as unit_quantity is already present in that table (between product_id and unit_price); looks like it was counted as missing by mistake.

Generated by Claude Code

…note - Enable pymdownx.tasklist (custom_checkbox) so the data-structures validation checklist renders as checkboxes instead of literal [ ] text. - Reword the line-item aggregation sentence to note that the segment or period level is set by each function's column arguments. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx

murray-ds · 2026-06-22T19:59:34Z

Thanks — third-round feedback addressed in 02b6ddb:

Issue 1 (checklist not rendering) — good catch, real bug. Added pymdownx.tasklist (with custom_checkbox: true) to mkdocs.yml. Verified with a local build: the validation checklist now renders as a proper task-list with checkbox inputs for all 7 items.
Nit (L65 RevenueTree) — reworded so it no longer implies an identical granularity argument: the line now says these functions aggregate to the segment or period level "their column arguments define."

On Issue 2 (CrossShop matrix) — keeping the current entry. As you noted, this is a deliberate, consistent choice: the "Required columns" column documents the effective set of columns that must exist in the data, using role names for you-named columns. By that rule CrossShop genuinely needs group_col, value_col, and group_1_col to exist (the first two via their defaults), and the footnotes explain the defaults — consistent with how ProductAssociation is listed. Listing only group_1_col would be less actionable for someone preparing their data.

Generated by Claude Code

- Remove the function requirements matrix section per maintainer request. - Add up-front framing that the data expectations come from working with dozens of retailers and follow industry-standard patterns. - Tighten flow and cut duplication (reviewed via subagents): drop the Troubleshooting section (it restated errors already covered under Column requirements and Data quality), fold the validation checklist and the runnable example into a single closing "Putting it together" section, and condense the intro admonition. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx

Negative unit_spend rows (returns/refunds) are common in retail data and can skew spend aggregations; add a one-line note that they are valid data but should be deliberately included or excluded before analysis. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx

murray-ds · 2026-06-26T19:01:07Z

Thanks for the review. Addressed in 1a152a6:

Returns/refunds (negative unit_spend) — good call, real gap. Added a note in the Data quality → Nulls section that negative-spend rows are valid returns/refunds but should be deliberately included or excluded, since most spend metrics assume positive transactions.

On the nav ordering suggestion (Data Structures before Connecting to Your Data) — keeping the current order. It's a deliberate "get connected, then learn the data shape, then configure column names" flow, and it was approved as "the right order logically" / "correct" in the two prior review rounds. Happy to swap if the maintainer prefers the preparation-first ordering, but I'll leave it rather than churn on a non-blocking, self-reversing nit.

Everything else in the review was confirmation of accuracy. Lint passes and the docs build renders.

Generated by Claude Code

claude added 2 commits June 22, 2026 19:17

github-actions Bot deployed to preview June 22, 2026 19:42 View deployment

github-actions Bot deployed to preview June 22, 2026 19:49 View deployment

github-actions Bot deployed to preview June 22, 2026 19:54 View deployment

github-actions Bot deployed to preview June 22, 2026 20:00 View deployment

github-actions Bot deployed to preview June 26, 2026 18:58 View deployment

github-actions Bot deployed to preview June 26, 2026 19:02 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: add data structures guide for preparing retail data#515

docs: add data structures guide for preparing retail data#515
murray-ds wants to merge 7 commits into
mainfrom
claude/trusting-hypatia-0r661z

murray-ds commented Jun 22, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

murray-ds commented Jun 22, 2026

Uh oh!

murray-ds commented Jun 22, 2026

Uh oh!

murray-ds commented Jun 22, 2026

Uh oh!

murray-ds commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

murray-ds commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Confirmed correct

Minor / non-blocking

Uh oh!

github-actions Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Cloudflare Pages Deployment

Uh oh!

codecov Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

murray-ds commented Jun 22, 2026

Uh oh!

murray-ds commented Jun 22, 2026

Uh oh!

murray-ds commented Jun 22, 2026

Uh oh!

murray-ds commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

murray-ds commented Jun 22, 2026 •

edited

Loading

claude Bot commented Jun 22, 2026 •

edited

Loading

github-actions Bot commented Jun 22, 2026 •

edited

Loading

codecov Bot commented Jun 22, 2026 •

edited

Loading