This project simulates real-world data analysis using Pandas. Your task is to analyze a provided dataset of e-commerce orders and answer questions about customer behavior and purchase trends.
Dataset:
-
You will be given a CSV file named
orders.csv
containing data on customer orders, including:
order_id
: Unique identifier for each ordercustomer_id
: Unique identifier for each customerproduct_id
: Unique identifier for each productproduct_name
: Name of the productcategory
: Category the product belongs to (e.g., electronics, clothing, home)price
: Price of the productorder_date
: Date of the order (format: YYYY-MM-DD)city
: Customer's citycountry
: Customer's country
Tasks:
-
Import Libraries and Load Data:
- Begin by importing pandas and reading the
orders.csv
file into a pandas DataFrame.
- Begin by importing pandas and reading the
-
Initial Data Exploration:
- Get basic information about the DataFrame (shape, data types, summary statistics).
- Identify and handle any missing values present in the data.
-
Data Cleaning and Preprocessing:
-
Check for inconsistencies or errors in the data (e.g., invalid dates, negative prices).
-
Clean the data by correcting errors, removing outliers, or imputing missing values (if applicable).
-
Convert the
order_date
column to a datetime format for further analysis. -
Decide whether to fill, drop, or leave the missing values based on your analysis.
-
Ensure all data types are appropriate for analysis (e.g., convert
price
to numeric). -
Remove any duplicates if present.
-
-
Data Manipulation:
- Filter Data: Filter orders based on specific criteria like order date range, product category, or customer location.
- Sort Data: Sort orders by various attributes like total order value, order date, or product price.
- Group Data: Group orders by customer, product category, or city and calculate aggregations like total sales, average order value, or most frequent purchases.
-
Advanced Analysis:
- Calculate the number of orders per customer and identify the most frequent buyers.
- Analyze purchase trends by category or city over time (e.g., monthly sales).
- Explore the relationship between order value and customer location.
-
Data Visualization:
Create various visualizations to explore the data, such as:
- Line plots for trends over time (if applicable).
- Scatter plots to explore relationships between numerical variables.
- Bar plots for categorical data counts.
- Histograms to understand the distribution of numerical data.
- Box plots to detect outliers.
-
Exploratory Analysis:
Answer key questions about the data:
- What is the price distribution of products?
- Are there any patterns in product purchases over time?
- How do prices vary by city and country?
- Which product category has the highest total sales?
- In which city do customers tend to place the most expensive orders?
- Which customers are the most frequent buyers?
-
Save Results:
- Save the cleaned and analyzed DataFrame to a Parquet file (
orders_cleaned.parquet
) for efficient storage and later use.
- Save the cleaned and analyzed DataFrame to a Parquet file (
-
Additional Tips
-
Comment your code to explain each step.
-
Make sure to handle any potential edge cases, such as missing data or incorrect data types.
-
Test your code with different subsets of the data to ensure it works correctly in various scenarios.
-
Deliverables:
- A Jupyter Notebook or Python script documenting your analysis process, including code for each task.
This project allows students to practice essential Pandas techniques like data loading, cleaning, filtering, sorting, grouping, aggregation, and saving to Parquet files. It encourages them to explore real-world data analysis scenarios and gain practical experience working with e-commerce datasets.
The objective of this assignment is to perform an exploratory data analysis (EDA) on a given dataset using Pandas, NumPy, and Matplotlib. You will load the dataset, clean and preprocess the data, and create various visualizations to uncover insights and patterns.
-
Dataset Selection:
- Choose one of the following datasets for your analysis. These datasets are available on Kaggle:
-
Loading the Dataset:
- Import the necessary libraries (
pandas
,numpy
,matplotlib.pyplot
). - Load the dataset into a Pandas DataFrame.
- Display the first few rows of the DataFrame.
- Import the necessary libraries (
-
Data Cleaning and Preprocessing:
- Check for missing values and handle them appropriately (e.g., filling, dropping).
- Convert categorical variables to numerical if necessary (e.g., using
pd.get_dummies
orLabelEncoder
). - Remove any duplicates if present.
-
Descriptive Statistics:
- Generate descriptive statistics for numerical columns (mean, median, standard deviation, etc.).
- Provide summary statistics for categorical columns.
-
Data Visualization:
- Create at least five different types of visualizations using Matplotlib:
- Line Plot
- Scatter Plot
- Bar Plot
- Histogram
- Box Plot
- Customize the plots with appropriate titles, labels, and legends.
- Create at least five different types of visualizations using Matplotlib:
-
Exploratory Questions:
- Answer the following questions using the visualizations and analyses:
- What are the key characteristics of the dataset?
- Are there any noticeable patterns or trends in the data?
- How do different features relate to each other?
- Are there any outliers or anomalies in the data?
- What are the distributions of the numerical features?
- Answer the following questions using the visualizations and analyses:
-
Reporting:
- Write a summary report (500-1000 words) detailing your findings from the EDA.
- Include the visualizations and describe the insights gained from each.
- Submit a Jupyter Notebook containing all the code and visualizations.
- Completeness: All steps of the assignment are completed.
- Code Quality: Code is clean, well-documented, and follows best practices.
- Visualizations: Plots are clear, well-labeled, and provide meaningful insights.
- Analysis: Answers to exploratory questions are thorough and demonstrate a good understanding of the data.
https://www.kaggle.com/code/imoore/intro-to-exploratory-data-analysis-eda-in-python https://www.kaggle.com/code/spscientist/a-simple-tutorial-on-exploratory-data-analysis https://www.kaggle.com/code/ekami66/detailed-exploratory-data-analysis-with-python