docs: add data structures guide for preparing retail data#515
docs: add data structures guide for preparing retail data#515murray-ds wants to merge 7 commits into
Conversation
Adds docs/getting_started/data_structures.md covering data granularity (line-item vs transaction level), star-schema denormalization, the single-column ID requirement and composite-ID creation, data-quality expectations, a per-function required-columns reference matrix with backend support, and a validation checklist. Wires the page into the MkDocs "Getting Started" nav. Closes #338 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx
Code review of the data structures guide surfaced two factual inaccuracies against the source: - Only DaysBetweenPurchases and TransactionChurn enforce temporal / timezone-naive date columns (via ensure_tznaive_datetime in analysis/customer.py). RFMSegmentation and CohortAnalysis do not, so the Types section no longer claims they validate and reject non-temporal columns. - ProductAssociation's group_col defaults to customer_id, so the default unit of co-occurrence is the customer, not the basket. The matrix Level is corrected from "basket" to "customer". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx
|
Claude finished @murray-ds's task in 2m 44s —— View job PR Review
This is a high-quality documentation addition and has clearly been well-tended through five prior review rounds. Content is accurate, the writing is clear, and the structure flows logically. A few small observations: Confirmed correct
Minor / non-blockingPR description is stale — it still says the guide includes "a per-function required-columns reference matrix with backend support", but that section was removed in commit
No blocking issues. The documentation is ready to merge. |
🚀 Cloudflare Pages Deployment✅ Preview deployed successfully! Preview URL: https://f3c3cbb3.pyretailscience-docs.pages.dev This preview will be updated automatically when you push new changes to this PR. |
Codecov Report✅ All modified and coverable lines are covered by tests. 🚀 New features to boost your workflow:
|
- Use the group_col role name (not its customer_id default) for CrossShop and ProductAssociation in the requirements matrix, consistent with how other you-named roles are listed; add a note that group_col defaults to customer_id and how to change the unit of analysis. - Show a .select() in the Ibis star-schema join example so it demonstrates the lightweight-join guidance in the following paragraph. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx
|
Thanks for the review. Addressed in 5224c98:
Markdownlint passes and the docs build renders cleanly. Generated by Claude Code |
- Drop group_2_col from CrossShop's required columns (it defaults to group_1_col); add a note explaining the group_n_col defaulting. - Note that the standard column-names table lists recognised names, not a per-dataset required set (e.g. unit_price is absent from the sample). - Use singular "product" for the CustomerDecisionHierarchy level for consistency with the other rows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx
|
Thanks — second-round feedback addressed in 82cff11:
On Issue 1 ( Generated by Claude Code |
…note - Enable pymdownx.tasklist (custom_checkbox) so the data-structures validation checklist renders as checkboxes instead of literal [ ] text. - Reword the line-item aggregation sentence to note that the segment or period level is set by each function's column arguments. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx
|
Thanks — third-round feedback addressed in 02b6ddb:
On Issue 2 (CrossShop matrix) — keeping the current entry. As you noted, this is a deliberate, consistent choice: the "Required columns" column documents the effective set of columns that must exist in the data, using role names for you-named columns. By that rule Generated by Claude Code |
- Remove the function requirements matrix section per maintainer request. - Add up-front framing that the data expectations come from working with dozens of retailers and follow industry-standard patterns. - Tighten flow and cut duplication (reviewed via subagents): drop the Troubleshooting section (it restated errors already covered under Column requirements and Data quality), fold the validation checklist and the runnable example into a single closing "Putting it together" section, and condense the intro admonition. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx
Negative unit_spend rows (returns/refunds) are common in retail data and can skew spend aggregations; add a one-line note that they are valid data but should be deliberately included or excluded before analysis. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx
|
Thanks for the review. Addressed in 1a152a6:
On the nav ordering suggestion (Data Structures before Connecting to Your Data) — keeping the current order. It's a deliberate "get connected, then learn the data shape, then configure column names" flow, and it was approved as "the right order logically" / "correct" in the two prior review rounds. Happy to swap if the maintainer prefers the preparation-first ordering, but I'll leave it rather than churn on a non-blocking, self-reversing nit. Everything else in the review was confirmation of accuracy. Lint passes and the docs build renders. Generated by Claude Code |
Adds docs/getting_started/data_structures.md covering data granularity
(line-item vs transaction level), star-schema denormalization, the
single-column ID requirement and composite-ID creation, data-quality
expectations (types, nulls, returns/refunds), and a validation checklist.
Wires the page into the MkDocs "Getting Started" nav.
Closes #338
Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01Ei3maqUvpytVcemk9cJYcx