-
Notifications
You must be signed in to change notification settings - Fork 1
refactor: move data simulation to another package #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe recent changes primarily focus on shifting the data workflow from generating simulated data to loading pre-simulated data. This impacts multiple notebooks and documentation files, altering the instructions and examples accordingly. Additionally, the Changes
Poem
Tip AI model upgrade
|
PR Reviewer Guide 🔍
|
PR Code Suggestions ✨
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (2)
data/transactions.parquet
is excluded by!**/*.parquet
poetry.lock
is excluded by!**/*.lock
Files selected for processing (8)
- .gitignore (1 hunks)
- README.md (1 hunks)
- docs/examples/cross_shop.ipynb (7 hunks)
- docs/examples/data_contracts.ipynb (6 hunks)
- docs/examples/gain_loss.ipynb (7 hunks)
- docs/examples/retention.ipynb (3 hunks)
- mkdocs.yml (1 hunks)
- pyproject.toml (2 hunks)
Files skipped from review due to trivial changes (4)
- .gitignore
- README.md
- mkdocs.yml
- pyproject.toml
Additional comments not posted (23)
docs/examples/data_contracts.ipynb (7)
13-13
: LGTM!The change from "creating and simulating some data" to "loading some simulated data" is consistent with the PR objective.
62-181
: LGTM!The transaction data table has been updated with new data values. These changes align with the PR objective of using pre-simulated data.
192-192
: LGTM!The code now loads data from a Parquet file, which is consistent with the PR objective.
812-817
: LGTM!The function signature now includes type annotations, which enhance code readability and maintainability.
882-891
: LGTM!The new class
CustomCustomerLevelContract
is well-documented and follows best practices for extending data contracts.
Line range hint
890-918
: LGTM!The
__init__
method now includes type annotations and additional expectations for thetotal_price
column, improving code readability and maintainability.
812-817
: LGTM!The code cell validates the custom contract and clips the
total_price
values to meet the contract expectations, ensuring data integrity.docs/examples/gain_loss.ipynb (7)
17-17
: LGTM!The markdown cell correctly reflects the change in data loading.
66-69
: LGTM!The displayed table data is consistent and correctly formatted.
Line range hint
84-88
: LGTM!The displayed table data is consistent and correctly formatted.
Line range hint
102-106
: LGTM!The displayed table data is consistent and correctly formatted.
118-185
: LGTM!The displayed table data is consistent and correctly formatted.
198-199
: LGTM!The code correctly updates to load data from a Parquet file.
263-263
: LGTM!The code correctly reassigns rows and applies discounts based on the new data.
docs/examples/cross_shop.ipynb (7)
14-14
: Update text to reflect loading of simulated data.The text change correctly reflects the new approach of loading simulated data instead of generating it.
63-63
: Verify data formatting and consistency.Ensure that the displayed data values are correctly formatted and consistent with the rest of the dataset.
81-81
: Verify data formatting and consistency.Ensure that the displayed data values are correctly formatted and consistent with the rest of the dataset.
99-99
: Verify data formatting and consistency.Ensure that the displayed data values are correctly formatted and consistent with the rest of the dataset.
115-182
: Verify data formatting and consistency.Ensure that the displayed data values are correctly formatted and consistent with the rest of the dataset.
193-196
: Import necessary libraries and load data from a Parquet file.The imports and data loading code appear to be correct.
Ensure that the data file path
"../../data/transactions.parquet"
is valid and accessible.
247-247
: Randomly assign category name for shoes.The code appears to correctly randomly assign the category name "Shoes" or "Jeans" to the rows where the category name is currently "Shoes".
Ensure that the random assignment logic is correct and necessary.
docs/examples/retention.ipynb (2)
213-214
: LGTM!The output text provides useful statistics about the dataset.
311-311
: LGTM!The changes to the plot aesthetics are appropriate and enhance the visualization.
" <td>AMD</td>\n", | ||
" <td>102</td>\n", | ||
" <td>120.00</td>\n", | ||
" <td>3</td>\n", | ||
" <td>360.00</td>\n", | ||
" <td>4</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>3</th>\n", | ||
" <td>4553</td>\n", | ||
" <td>2023-02-05 09:31:42</td>\n", | ||
" <td>1</td>\n", | ||
" <td>735</td>\n", | ||
" <td>Linden Wood Paneled Mirror</td>\n", | ||
" <td>Home</td>\n", | ||
" <td>5</td>\n", | ||
" <td>Home Decor</td>\n", | ||
" <td>30</td>\n", | ||
" <td>Pottery Barn</td>\n", | ||
" <td>147</td>\n", | ||
" <td>599.00</td>\n", | ||
" <td>1</td>\n", | ||
" <td>599.00</td>\n", | ||
" <td>4</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>4</th>\n", | ||
" <td>4553</td>\n", | ||
" <td>2023-02-05 09:31:42</td>\n", | ||
" <td>1</td>\n", | ||
" <td>1107</td>\n", | ||
" <td>Pro-V Daily Moisture Renewal Conditioner</td>\n", | ||
" <td>Beauty</td>\n", | ||
" <td>7</td>\n", | ||
" <td>Hair Care</td>\n", | ||
" <td>45</td>\n", | ||
" <td>Pantene</td>\n", | ||
" <td>222</td>\n", | ||
" <td>4.99</td>\n", | ||
" <td>1</td>\n", | ||
" <td>4.99</td>\n", | ||
" <td>4</td>\n", | ||
" </tr>\n", | ||
" </tbody>\n", | ||
"</table>\n", | ||
"</div>" | ||
], | ||
"text/plain": [ | ||
" transaction_id transaction_datetime customer_id product_id \\\n", | ||
"0 7108 2023-01-12 17:44:29 1 15 \n", | ||
"1 7108 2023-01-12 17:44:29 1 1317 \n", | ||
"2 4553 2023-02-05 09:31:42 1 509 \n", | ||
"3 4553 2023-02-05 09:31:42 1 735 \n", | ||
"4 4553 2023-02-05 09:31:42 1 1107 \n", | ||
"\n", | ||
" product_name category_0_name category_0_id \\\n", | ||
"0 Spawn Figure Toys 1 \n", | ||
"1 Gone Girl Books 8 \n", | ||
"2 Ryzen 3 3300X Electronics 3 \n", | ||
"3 Linden Wood Paneled Mirror Home 5 \n", | ||
"4 Pro-V Daily Moisture Renewal Conditioner Beauty 7 \n", | ||
"\n", | ||
" category_1_name category_1_id brand_name brand_id unit_price \\\n", | ||
"0 Action Figures 1 McFarlane Toys 3 27.99 \n", | ||
"1 Mystery & Thrillers 53 Alfred A. Knopf 264 10.49 \n", | ||
"2 Computer Components 21 AMD 102 120.00 \n", | ||
"3 Home Decor 30 Pottery Barn 147 599.00 \n", | ||
"4 Hair Care 45 Pantene 222 4.99 \n", | ||
"\n", | ||
" quantity total_price store_id \n", | ||
"0 2 55.98 6 \n", | ||
"1 1 10.49 6 \n", | ||
"2 3 360.00 4 \n", | ||
"3 1 599.00 4 \n", | ||
"4 1 4.99 4 " | ||
] | ||
}, | ||
"execution_count": null, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"import strictyaml as yaml\n", | ||
"\n", | ||
"from pyretailscience.data.simulation import config_schema\n", | ||
"import matplotlib.pyplot as plt\n", | ||
"import pandas as pd\n", | ||
"\n", | ||
"config_file = \"../../data/default_data_config.yaml\"\n", | ||
"with open(config_file, \"r\") as f:\n", | ||
" config = yaml.load(f.read(), config_schema).data\n", | ||
"from pyretailscience import customer\n", | ||
"\n", | ||
"config[\"transactions\"][\"start_date\"] = datetime(year=2023, month=1, day=1)\n", | ||
"sim = Simulation(seed=42, config=config)\n", | ||
"sim.run()\n", | ||
"df = pd.DataFrame(sim.transactions)" | ||
"df = pd.read_parquet(\"../../data/transactions.parquet\")\n", | ||
"df.head()" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tip
Codebase Verification
File not found: ../../data/transactions.parquet
The Parquet file ../../data/transactions.parquet
does not exist in the specified path. Please ensure that the file is available or update the path accordingly to avoid errors when loading the data in the notebook.
Analysis chain
LGTM! Verify the existence of the Parquet file.
The changes to display the dataframe and the imports are appropriate.
However, ensure that the Parquet file ../../data/transactions.parquet
exists and is accessible.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the existence of the Parquet file.
# Test: Check if the Parquet file exists. Expect: File exists.
if [ -f "../../data/transactions.parquet" ]; then
echo "File exists."
else
echo "File does not exist."
fi
Length of output: 95
PR Type
Enhancement, Documentation
Description
data_contracts.ipynb
.Changes walkthrough 📝
data_contracts.ipynb
Refactor data contracts example to load data from parquet file
docs/examples/data_contracts.ipynb
top_customers
function.retention.ipynb
Refactor retention example to load data from parquet file
docs/examples/retention.ipynb
gain_loss.ipynb
Refactor gain/loss example to load data from parquet file
docs/examples/gain_loss.ipynb
cross_shop.ipynb
Refactor cross-shop example to load data from parquet file
docs/examples/cross_shop.ipynb
README.md
Update README to remove data simulation instructions
README.md
mkdocs.yml
Update mkdocs configuration to reflect new examples structure
mkdocs.yml
segmentation.ipynb
...
docs/examples/segmentation.ipynb
...
Summary by CodeRabbit
New Features
Documentation
mkdocs.yml
for better clarity and structure.Chores
.gitignore
to exclude.csv
files instead of.parquet
files.click
dependency and a script entry inpyproject.toml
.