Skip to content

Conversation

mvanwyk
Copy link
Contributor

@mvanwyk mvanwyk commented Jul 3, 2024

PR Type

Enhancement, Documentation


Description

  • Refactored multiple Jupyter Notebook examples to load data from a parquet file instead of simulating data.
  • Updated transaction data examples in notebooks to reflect new data.
  • Improved exception handling and added type annotations in data_contracts.ipynb.
  • Removed data simulation instructions from README and added placeholder text.
  • Updated mkdocs configuration to reflect new examples structure.

Changes walkthrough 📝

Relevant files
Enhancement
data_contracts.ipynb
Refactor data contracts example to load data from parquet file

docs/examples/data_contracts.ipynb

  • Replaced data simulation code with data loading from a parquet file.
  • Updated transaction data examples to reflect new data.
  • Improved exception handling in top_customers function.
  • Added type annotations and docstrings for better clarity.
  • +101/-110
    retention.ipynb
    Refactor retention example to load data from parquet file

    docs/examples/retention.ipynb

  • Removed data simulation setup.
  • Added data loading from a parquet file.
  • Updated transaction data examples to reflect new data.
  • Minor formatting improvements.
  • +173/-33
    gain_loss.ipynb
    Refactor gain/loss example to load data from parquet file

    docs/examples/gain_loss.ipynb

  • Replaced data simulation code with data loading from a parquet file.
  • Updated transaction data examples to reflect new data.
  • Minor formatting improvements.
  • +66/-41 
    cross_shop.ipynb
    Refactor cross-shop example to load data from parquet file

    docs/examples/cross_shop.ipynb

  • Replaced data simulation code with data loading from a parquet file.
  • Updated transaction data examples to reflect new data.
  • Minor formatting improvements.
  • +67/-42 
    Documentation
    README.md
    Update README to remove data simulation instructions         

    README.md

  • Removed section on generating simulated data.
  • Added placeholder text for future updates.
  • +1/-27   
    Configuration changes
    mkdocs.yml
    Update mkdocs configuration to reflect new examples structure

    mkdocs.yml

  • Reorganized examples section.
  • Removed reference to data simulation example.
  • +1/-3     
    Additional files (token-limit)
    segmentation.ipynb
    ...                                                                                                           

    docs/examples/segmentation.ipynb

    ...

    +294/-269

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Summary by CodeRabbit

    • New Features

      • Updated documentation to reflect the transition from simulating to loading pre-simulated data.
    • Documentation

      • README.md now mentions that simulated transaction data functionality is "Coming Soon."
      • Updated multiple example notebooks to load pre-simulated data instead of generating it.
      • Revised navigation in mkdocs.yml for better clarity and structure.
    • Chores

      • Updated .gitignore to exclude .csv files instead of .parquet files.
      • Removed click dependency and a script entry in pyproject.toml.

    Copy link

    coderabbitai bot commented Jul 3, 2024

    Walkthrough

    The recent changes primarily focus on shifting the data workflow from generating simulated data to loading pre-simulated data. This impacts multiple notebooks and documentation files, altering the instructions and examples accordingly. Additionally, the .gitignore file was updated to exclude .csv instead of .parquet files, and the pyproject.toml was modified to remove certain dependencies and script entries. Navigation in mkdocs.yml was also restructured for better clarity and organization.

    Changes

    Files Change Summary
    .gitignore Updated to exclude .csv instead of *.parquet files.
    README.md Removed the section on generating simulated transaction data; replaced with "Coming Soon."
    docs/examples/cross_shop.ipynb Changed from simulating data to loading pre-simulated data; updated displayed data.
    docs/examples/data_contracts.ipynb Updated text and functionality for loading data; added new class and type hints in function parameters.
    docs/examples/gain_loss.ipynb Switched from simulating to loading pre-simulated data; updated brand names and prices.
    docs/examples/retention.ipynb Significant changes to load data from a file; included new imports and updated output visualizations.
    …/examples/… (multiple files) Grouped similar changes across multiple notebook files for brevity.
    mkdocs.yml Rearranged the navigation structure; removed outdated sections and links.
    pyproject.toml Removed click dependency; reordered some package versions; removed a script entry.

    Poem

    In the realm where data flows,
    Files transformed and notebooks glowed,
    From simulating days to pre-simulated ways,
    Cleaner paths now boldly showed.
    CSVs we shall hide,
    In structured lines, our progress pried.
    🌟🚀 A celebratory leap, with code we keep! 🚀🌟

    Tip

    AI model upgrade

    gpt-4o model for reviews and chat is now live

    OpenAI claims that this model is better at understanding and generating code than the previous models. Please join our Discord Community to provide any feedback or to report any issues.


    Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

    Share
    Tips

    Chat

    There are 3 ways to chat with CodeRabbit:

    • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
      • I pushed a fix in commit <commit_id>.
      • Generate unit testing code for this file.
      • Open a follow-up GitHub issue for this discussion.
    • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
      • @coderabbitai generate unit testing code for this file.
      • @coderabbitai modularize this function.
    • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
      • @coderabbitai generate interesting stats about this repository and render them as a table.
      • @coderabbitai show all the console.log statements in this repository.
      • @coderabbitai read src/utils.ts and generate unit testing code.
      • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
      • @coderabbitai help me debug CodeRabbit configuration file.

    Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

    CodeRabbit Commands (invoked as PR comments)

    • @coderabbitai pause to pause the reviews on a PR.
    • @coderabbitai resume to resume the paused reviews.
    • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
    • @coderabbitai full review to do a full review from scratch and review all the files again.
    • @coderabbitai summary to regenerate the summary of the PR.
    • @coderabbitai resolve resolve all the CodeRabbit review comments.
    • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
    • @coderabbitai help to get help.

    Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

    CodeRabbit Configration File (.coderabbit.yaml)

    • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
    • Please see the configuration documentation for more information.
    • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

    Documentation and Community

    • Visit our Documentation for detailed information on how to use CodeRabbit.
    • Join our Discord Community to get help, request features, and share feedback.
    • Follow us on X/Twitter for updates and announcements.

    @qodo-merge-pro qodo-merge-pro bot added documentation Improvements or additions to documentation enhancement New feature or request Review effort [1-5]: 3 labels Jul 3, 2024
    Copy link
    Contributor

    qodo-merge-pro bot commented Jul 3, 2024

    PR Reviewer Guide 🔍

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Key issues to review

    Data Consistency:
    Ensure that the new data source (parquet files) maintains consistency with the previous simulated data, especially in terms of data structure and content.

    Exception Handling:
    Review the changes in exception handling in data_contracts.ipynb to ensure they are appropriate and provide clear error messages.

    Documentation Updates:
    Verify that all documentation and comments accurately reflect the changes made, especially in Jupyter notebooks and the README file.

    Copy link
    Contributor

    qodo-merge-pro bot commented Jul 3, 2024

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Possible bug
    Correct the case sensitivity in the DataFrame type hint

    Replace the use of pd.Dataframe with pd.DataFrame to correct the case sensitivity issue in
    the type hint, which could lead to runtime errors or issues with static type checkers.

    docs/examples/data_contracts.ipynb [812]

    -def top_customers(df: pd.Dataframe, n: int=5) -> pd.DataFrame:
    +def top_customers(df: pd.DataFrame, n: int=5) -> pd.DataFrame:
     
    • Apply this suggestion
    Suggestion importance[1-10]: 10

    Why: The correction from pd.Dataframe to pd.DataFrame is crucial as it prevents potential runtime errors and issues with static type checkers, ensuring the code functions correctly.

    10
    Best practice
    Add data validation after loading the dataframe to ensure it contains all expected columns

    It's recommended to validate the data loaded from external sources to ensure it meets
    expected formats and constraints. This can prevent issues arising from malformed or
    unexpected data.

    docs/examples/segmentation.ipynb [197-198]

     df = pd.read_parquet("../../data/transactions.parquet")
    +# Ensure the dataframe contains expected columns
    +expected_columns = {'transaction_id', 'transaction_datetime', 'customer_id', 'product_id', 'product_name', 'category_0_name', 'category_0_id', 'category_1_name', 'category_1_id', 'brand_name', 'brand_id', 'unit_price', 'quantity', 'total_price', 'store_id'}
    +assert expected_columns.issubset(df.columns), "Dataframe is missing one or more expected columns"
     df.head()
     
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: This suggestion adds a crucial validation step to ensure the data meets expected formats, which can prevent downstream errors due to malformed data.

    9
    Set the random seed outside the function call for consistent outputs

    Ensure that the random seed is set outside the function call for reproducibility. This
    practice helps in maintaining consistent outputs for the random choices made in the
    notebook.

    docs/examples/cross_shop.ipynb [246-248]

    -df.loc[shoes_idx, "category_1_name"] = np.random.RandomState(42).choice(
    +rng = np.random.RandomState(42)
    +df.loc[shoes_idx, "category_1_name"] = rng.choice(
         ["Shoes", "Jeans"], size=shoes_idx.sum(), p=[0.5, 0.5],
     )
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: This suggestion ensures reproducibility, which is important for consistent results, especially in a notebook setting.

    8
    Use a custom exception for clearer error handling

    Instead of raising a generic ValueError, raise a more specific custom exception to provide
    clearer error handling specific to the domain or application.

    docs/examples/data_contracts.ipynb [817]

    -raise ValueError(msg)
    +class ContractValidationError(Exception):
    +    pass
    +raise ContractValidationError(msg)
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Using a custom exception improves error handling by providing clearer and more specific error messages, which is a best practice for maintainable code.

    7
    Robustness
    Add error handling around the file reading operation to manage potential exceptions

    Consider adding error handling for file reading operations to manage exceptions that may
    occur if the file is missing or corrupt.

    docs/examples/segmentation.ipynb [197]

    -df = pd.read_parquet("../../data/transactions.parquet")
    +try:
    +    df = pd.read_parquet("../../data/transactions.parquet")
    +except Exception as e:
    +    print(f"Failed to read data: {e}")
    +    # Handle the error appropriately, possibly re-raise or log
     
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: Adding error handling improves the robustness of the code by managing exceptions that may occur during file reading operations, preventing the program from crashing unexpectedly.

    9
    Enhancement
    Add a data type expectation for the 'total_price' column

    Ensure that the ExpectationConfiguration for the 'total_price' column includes a check for
    the column's data type, enhancing data validation and consistency.

    docs/examples/data_contracts.ipynb [895-897]

     ExpectationConfiguration(
         expectation_type="expect_column_to_exist",
         kwargs={"column": "total_price"},
     ),
    +ExpectationConfiguration(
    +    expectation_type="expect_column_values_to_be_of_type",
    +    kwargs={"column": "total_price", "type_": "float"},
    +),
     
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: Including a data type expectation for the 'total_price' column enhances data validation and consistency, ensuring that the data meets expected standards.

    9
    Add a check for an empty DataFrame to prevent errors

    Add a check to ensure that the DataFrame df is not empty before proceeding with sorting
    and returning the top customers. This prevents potential errors when operating on an empty
    DataFrame.

    docs/examples/data_contracts.ipynb [819]

    +if df.empty:
    +    return df
     return df.sort_values("total_price", ascending=False).head(n).reset_index(drop=True)
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: Adding a check for an empty DataFrame enhances the robustness of the function by preventing potential errors when operating on an empty DataFrame.

    8
    Use pandas to_html for dynamic HTML table generation

    Replace the hard-coded HTML table with a dynamic generation using pandas DataFrame to_html
    method, which can be customized with CSS classes and other HTML attributes. This approach
    enhances code readability and maintainability.

    docs/examples/retention.ipynb [36-148]

    -<table border="1" class="dataframe">
    -    <thead>
    -        ...
    -    </thead>
    -    <tbody>
    -        ...
    -    </tbody>
    -</table>
    +df.to_html(classes='dataframe', border=1)
     
    Suggestion importance[1-10]: 8

    Why: This suggestion enhances code readability and maintainability by leveraging pandas' built-in functionality, reducing the need for hard-coded HTML.

    8
    Possible issue
    Add a check to ensure the DataFrame is not empty to prevent runtime errors

    To ensure that the DataFrame is not empty before performing operations, add a check to
    confirm that df is not empty after loading the data. This check prevents potential errors
    in subsequent operations if the data file is missing or empty.

    docs/examples/cross_shop.ipynb [195-196]

     df = pd.read_parquet("../../data/transactions.parquet")
    +if df.empty:
    +    raise ValueError("Data file is empty or not found.")
     df.head()
     
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: This suggestion addresses a potential runtime error, which is crucial for ensuring the robustness of the code.

    9
    Maintainability
    Replace hardcoded file paths with environment variables for better flexibility and maintainability

    To avoid hardcoding file paths, consider using a configuration file or environment
    variables to manage file paths, making the code more flexible and easier to maintain
    across different environments.

    docs/examples/segmentation.ipynb [197]

    -df = pd.read_parquet("../../data/transactions.parquet")
    +import os
    +data_path = os.getenv('DATA_PATH', '../../data/')
    +df = pd.read_parquet(data_path + "transactions.parquet")
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: Using environment variables for file paths enhances the flexibility and maintainability of the code, making it easier to adapt to different environments.

    8
    Encapsulate data loading logic into a function for improved readability and reusability

    For better readability and maintenance, consider using a function to encapsulate the data
    loading logic, especially if similar data loading patterns are used multiple times in the
    notebook.

    docs/examples/segmentation.ipynb [197-198]

    -df = pd.read_parquet("../../data/transactions.parquet")
    +def load_data(file_path):
    +    return pd.read_parquet(file_path)
    +
    +df = load_data("../../data/transactions.parquet")
     df.head()
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Encapsulating the data loading logic into a function enhances code readability and reusability, especially if similar patterns are used multiple times in the notebook.

    7
    Replace inline CSS with external CSS file for DataFrame styling

    Consider using CSS classes instead of inline styles for the DataFrame HTML representation
    to improve maintainability and separation of concerns. This change will make it easier to
    manage styles globally and reduce redundancy in the notebook.

    docs/examples/retention.ipynb [23-35]

    -<style scoped>
    -    .dataframe tbody tr th:only-of-type {
    -        vertical-align: middle;
    -    }
    -    ...
    -</style>
    +<link rel="stylesheet" type="text/css" href="dataframe_style.css">
     
    Suggestion importance[1-10]: 7

    Why: Using an external CSS file improves maintainability and separation of concerns, but it requires additional setup to ensure the CSS file is available and correctly linked.

    7
    Use a variable for the file path to enhance flexibility and maintainability

    Replace the hard-coded file path with a variable that can be set at the top of the
    notebook. This change makes the notebook more flexible and easier to maintain, especially
    when the data source changes or when the notebook is used in different environments.

    docs/examples/cross_shop.ipynb [195]

    -df = pd.read_parquet("../../data/transactions.parquet")
    +data_file_path = "../../data/transactions.parquet"
    +df = pd.read_parquet(data_file_path)
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Using a variable for the file path makes the code more flexible and easier to maintain, which is a good practice but not critical.

    7
    Improve variable naming for better readability

    Consider using a more descriptive variable name instead of shoes_idx to enhance code
    readability. For example, shoes_category_filter would provide more context about the
    purpose of the variable.

    docs/examples/cross_shop.ipynb [245]

    -shoes_idx = df["category_1_name"] == "Shoes"
    +shoes_category_filter = df["category_1_name"] == "Shoes"
     
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: The suggestion improves code readability by using a more descriptive variable name, which is beneficial for maintainability but not critical.

    6
    Readability
    Improve DataFrame text display formatting in the notebook

    Ensure the DataFrame display in 'text/plain' output is properly formatted for better
    readability. Consider using pd.set_option to adjust display settings like max_columns,
    max_rows, or precision.

    docs/examples/retention.ipynb [153-179]

    -"   transaction_id transaction_datetime  customer_id  product_id  \\\n",
    -"0            7108  2023-01-12 17:44:29            1          15   \n",
    -...
    +pd.set_option('display.max_columns', None)
    +pd.set_option('display.precision', 2)
    +df.head()
     
    Suggestion importance[1-10]: 6

    Why: Adjusting display settings can improve readability, but the current formatting is already fairly readable. This is a minor enhancement.

    6
    Use Python dictionary syntax for arrow properties to enhance readability

    Replace the manual HTML arrow properties dictionary with a more readable format by using
    Python's dictionary syntax, which enhances code readability and maintainability.

    docs/examples/retention.ipynb [311]

    -"arrowprops={\"facecolor\": \"black\", \"arrowstyle\": \"-|>\", \"connectionstyle\": \"arc3,rad=-0.25\", \"mutation_scale\": 25},\n",
    +"arrowprops=dict(facecolor='black', arrowstyle='-|>', connectionstyle='arc3,rad=-0.25', mutation_scale=25),\n",
     
    • Apply this suggestion
    Suggestion importance[1-10]: 5

    Why: The existing code is already quite readable, and this change offers only a slight improvement in readability and maintainability.

    5

    Copy link

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 1

    Review details

    Configuration used: CodeRabbit UI
    Review profile: CHILL

    Commits

    Files that changed from the base of the PR and between 9a3a0b9 and 405b312.

    Files ignored due to path filters (2)
    • data/transactions.parquet is excluded by !**/*.parquet
    • poetry.lock is excluded by !**/*.lock
    Files selected for processing (8)
    • .gitignore (1 hunks)
    • README.md (1 hunks)
    • docs/examples/cross_shop.ipynb (7 hunks)
    • docs/examples/data_contracts.ipynb (6 hunks)
    • docs/examples/gain_loss.ipynb (7 hunks)
    • docs/examples/retention.ipynb (3 hunks)
    • mkdocs.yml (1 hunks)
    • pyproject.toml (2 hunks)
    Files skipped from review due to trivial changes (4)
    • .gitignore
    • README.md
    • mkdocs.yml
    • pyproject.toml
    Additional comments not posted (23)
    docs/examples/data_contracts.ipynb (7)

    13-13: LGTM!

    The change from "creating and simulating some data" to "loading some simulated data" is consistent with the PR objective.


    62-181: LGTM!

    The transaction data table has been updated with new data values. These changes align with the PR objective of using pre-simulated data.


    192-192: LGTM!

    The code now loads data from a Parquet file, which is consistent with the PR objective.


    812-817: LGTM!

    The function signature now includes type annotations, which enhance code readability and maintainability.


    882-891: LGTM!

    The new class CustomCustomerLevelContract is well-documented and follows best practices for extending data contracts.


    Line range hint 890-918: LGTM!

    The __init__ method now includes type annotations and additional expectations for the total_price column, improving code readability and maintainability.


    812-817: LGTM!

    The code cell validates the custom contract and clips the total_price values to meet the contract expectations, ensuring data integrity.

    docs/examples/gain_loss.ipynb (7)

    17-17: LGTM!

    The markdown cell correctly reflects the change in data loading.


    66-69: LGTM!

    The displayed table data is consistent and correctly formatted.


    Line range hint 84-88: LGTM!

    The displayed table data is consistent and correctly formatted.


    Line range hint 102-106: LGTM!

    The displayed table data is consistent and correctly formatted.


    118-185: LGTM!

    The displayed table data is consistent and correctly formatted.


    198-199: LGTM!

    The code correctly updates to load data from a Parquet file.


    263-263: LGTM!

    The code correctly reassigns rows and applies discounts based on the new data.

    docs/examples/cross_shop.ipynb (7)

    14-14: Update text to reflect loading of simulated data.

    The text change correctly reflects the new approach of loading simulated data instead of generating it.


    63-63: Verify data formatting and consistency.

    Ensure that the displayed data values are correctly formatted and consistent with the rest of the dataset.


    81-81: Verify data formatting and consistency.

    Ensure that the displayed data values are correctly formatted and consistent with the rest of the dataset.


    99-99: Verify data formatting and consistency.

    Ensure that the displayed data values are correctly formatted and consistent with the rest of the dataset.


    115-182: Verify data formatting and consistency.

    Ensure that the displayed data values are correctly formatted and consistent with the rest of the dataset.


    193-196: Import necessary libraries and load data from a Parquet file.

    The imports and data loading code appear to be correct.

    Ensure that the data file path "../../data/transactions.parquet" is valid and accessible.


    247-247: Randomly assign category name for shoes.

    The code appears to correctly randomly assign the category name "Shoes" or "Jeans" to the rows where the category name is currently "Shoes".

    Ensure that the random assignment logic is correct and necessary.

    docs/examples/retention.ipynb (2)

    213-214: LGTM!

    The output text provides useful statistics about the dataset.


    311-311: LGTM!

    The changes to the plot aesthetics are appropriate and enhance the visualization.

    Comment on lines +20 to +194
    " <td>AMD</td>\n",
    " <td>102</td>\n",
    " <td>120.00</td>\n",
    " <td>3</td>\n",
    " <td>360.00</td>\n",
    " <td>4</td>\n",
    " </tr>\n",
    " <tr>\n",
    " <th>3</th>\n",
    " <td>4553</td>\n",
    " <td>2023-02-05 09:31:42</td>\n",
    " <td>1</td>\n",
    " <td>735</td>\n",
    " <td>Linden Wood Paneled Mirror</td>\n",
    " <td>Home</td>\n",
    " <td>5</td>\n",
    " <td>Home Decor</td>\n",
    " <td>30</td>\n",
    " <td>Pottery Barn</td>\n",
    " <td>147</td>\n",
    " <td>599.00</td>\n",
    " <td>1</td>\n",
    " <td>599.00</td>\n",
    " <td>4</td>\n",
    " </tr>\n",
    " <tr>\n",
    " <th>4</th>\n",
    " <td>4553</td>\n",
    " <td>2023-02-05 09:31:42</td>\n",
    " <td>1</td>\n",
    " <td>1107</td>\n",
    " <td>Pro-V Daily Moisture Renewal Conditioner</td>\n",
    " <td>Beauty</td>\n",
    " <td>7</td>\n",
    " <td>Hair Care</td>\n",
    " <td>45</td>\n",
    " <td>Pantene</td>\n",
    " <td>222</td>\n",
    " <td>4.99</td>\n",
    " <td>1</td>\n",
    " <td>4.99</td>\n",
    " <td>4</td>\n",
    " </tr>\n",
    " </tbody>\n",
    "</table>\n",
    "</div>"
    ],
    "text/plain": [
    " transaction_id transaction_datetime customer_id product_id \\\n",
    "0 7108 2023-01-12 17:44:29 1 15 \n",
    "1 7108 2023-01-12 17:44:29 1 1317 \n",
    "2 4553 2023-02-05 09:31:42 1 509 \n",
    "3 4553 2023-02-05 09:31:42 1 735 \n",
    "4 4553 2023-02-05 09:31:42 1 1107 \n",
    "\n",
    " product_name category_0_name category_0_id \\\n",
    "0 Spawn Figure Toys 1 \n",
    "1 Gone Girl Books 8 \n",
    "2 Ryzen 3 3300X Electronics 3 \n",
    "3 Linden Wood Paneled Mirror Home 5 \n",
    "4 Pro-V Daily Moisture Renewal Conditioner Beauty 7 \n",
    "\n",
    " category_1_name category_1_id brand_name brand_id unit_price \\\n",
    "0 Action Figures 1 McFarlane Toys 3 27.99 \n",
    "1 Mystery & Thrillers 53 Alfred A. Knopf 264 10.49 \n",
    "2 Computer Components 21 AMD 102 120.00 \n",
    "3 Home Decor 30 Pottery Barn 147 599.00 \n",
    "4 Hair Care 45 Pantene 222 4.99 \n",
    "\n",
    " quantity total_price store_id \n",
    "0 2 55.98 6 \n",
    "1 1 10.49 6 \n",
    "2 3 360.00 4 \n",
    "3 1 599.00 4 \n",
    "4 1 4.99 4 "
    ]
    },
    "execution_count": null,
    "metadata": {},
    "output_type": "execute_result"
    }
    ],
    "source": [
    "import strictyaml as yaml\n",
    "\n",
    "from pyretailscience.data.simulation import config_schema\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "\n",
    "config_file = \"../../data/default_data_config.yaml\"\n",
    "with open(config_file, \"r\") as f:\n",
    " config = yaml.load(f.read(), config_schema).data\n",
    "from pyretailscience import customer\n",
    "\n",
    "config[\"transactions\"][\"start_date\"] = datetime(year=2023, month=1, day=1)\n",
    "sim = Simulation(seed=42, config=config)\n",
    "sim.run()\n",
    "df = pd.DataFrame(sim.transactions)"
    "df = pd.read_parquet(\"../../data/transactions.parquet\")\n",
    "df.head()"
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Tip

    Codebase Verification

    File not found: ../../data/transactions.parquet

    The Parquet file ../../data/transactions.parquet does not exist in the specified path. Please ensure that the file is available or update the path accordingly to avoid errors when loading the data in the notebook.

    Analysis chain

    LGTM! Verify the existence of the Parquet file.

    The changes to display the dataframe and the imports are appropriate.

    However, ensure that the Parquet file ../../data/transactions.parquet exists and is accessible.

    Scripts executed

    The following scripts were executed for the analysis:

    Script:

    #!/bin/bash
    # Description: Verify the existence of the Parquet file.
    
    # Test: Check if the Parquet file exists. Expect: File exists.
    if [ -f "../../data/transactions.parquet" ]; then
      echo "File exists."
    else
      echo "File does not exist."
    fi
    

    Length of output: 95

    @mvanwyk mvanwyk merged commit 1a55803 into main Jul 3, 2024
    @mvanwyk mvanwyk deleted the split_out_data_sim branch July 3, 2024 18:43
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    documentation Improvements or additions to documentation enhancement New feature or request Review effort [1-5]: 3
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant