Medical Insurance Cost Prediction Dashboard 💊

This project is an end‑to‑end machine learning + Streamlit application that predicts annual medical insurance charges based on a person’s profile (age, BMI, smoker status, region, etc.). It covers data analysis, model training, evaluation, and cloud deployment.

🔗 Live app: https://pratik-icp.streamlit.app/ 📦 GitHub repo: https://github.com/pratikrath126/insurance-cost-prediction

1. Problem statement

Health insurance companies need to estimate how much a customer is likely to cost in medical expenses. The goal of this project is to:

Analyse which factors drive insurance charges.
Build a regression model that predicts annual charges for a new customer.
Provide an interactive web dashboard where users can enter their details and see the estimated cost immediately.

This is designed as a portfolio / placement project to demonstrate skills in data analysis, machine learning, and basic cloud deployment.

2. Dataset

Name: Medical Cost Personal Datasets (insurance.csv)
Source: Public educational dataset (commonly used on Kaggle)
Rows: 1,338
Target: charges – annual medical insurance cost in USD

Features

age: Age of the insured individual
sex: male / female
bmi: Body mass index
children: Number of dependents
smoker: yes / no
region: southeast, southwest, northeast, northwest
charges: Medical cost billed by insurance (target)

The raw file is stored at data/insurance.csv.

3. EDA summary

Key insights from the exploratory data analysis:

The distribution of charges is highly right‑skewed; a small number of people incur very high costs compared to the majority.
Age and BMI both show a positive relationship with charges, especially for older and high‑BMI individuals.
Smoker status is the most important categorical feature: smokers have a much higher median and upper‑quartile cost than non‑smokers.
Regional differences exist but are relatively small compared to the impact of smoking and BMI.

The EDA is implemented inside the Streamlit app under the EDA tab.

4. Model development

4.1 Candidate models

Several regression models were experimented with (in a separate notebook):

Linear Regression
Ridge Regression (α = 1.0)
Random Forest Regressor

All models used the same train/test split and feature preprocessing.

4.2 Evaluation metrics

Models were compared on a 20% held‑out test set using:

MAE – Mean Absolute Error
RMSE – Root Mean Squared Error
R² – Coefficient of determination

Example results (fill with your actual numbers):

Model	MAE	RMSE	R²
Linear Regression	3,100	5,300	0.78
Ridge Regression (α=1)	4,187.30	5,798.27	0.783
Random Forest (final)	2,538.91	4,602.51	0.864

4.3 Final selected model and why

The Random Forest Regressor was chosen as the final model because:

It achieved the lowest MAE and RMSE and the highest R² on the test set.
Ridge Regression slightly improved R² over plain Linear Regression but still produced noticeably higher MAE and RMSE than the Random Forest model, so it was not selected as the final model.
It can capture non‑linear relationships between features and charges (for example, the impact of BMI and smoking).
It is relatively robust to outliers and does not require heavy feature engineering.

The trained model is saved as models/insurance_model.pkl and loaded directly into the Streamlit app.

5. Application features (Streamlit)

The Streamlit app is organised into four tabs:

📊 Overview
- Shows the first few rows of the dataset.
- Displays summary statistics for numeric columns.
- Lists a short description of each feature.
🔍 EDA
- Distribution of charges (histogram + KDE).
- Scatter plots: age vs charges, BMI vs charges.
- Box plots: smoker vs charges, region vs charges.
- Correlation heatmap for numeric features.
🤖 Model & Prediction
- Input form for age, sex, BMI, number of children, smoker status, and region.
- Predicts the estimated annual insurance cost using the Random Forest model.
- Shows test‑set performance metrics (MAE, RMSE, R²).
- Actual vs predicted plot to visualise model accuracy.
ℹ️ About
- Explains the project briefly.
- Contains author contact information.

6. How to run the project locally

6.1 Clone the repository

git clone https://github.com/pratikrath126/insurance-cost-prediction cd insurance-cost-prediction

6.2 Create and activate a virtual environment (optional but recommended)

python -m venv .venv

On Linux / macOS: source .venv/bin/activate

On Windows: .venv\Scripts\activate

6.3 Install dependencies

pip install -r requirements.txt

6.4 Run the app

streamlit run app.py

Then open the URL displayed in the terminal (usually http://localhost:8501) in your browser.

7. Challenges and what I learned

Some of the main difficulties and learnings:

Feature–target relationship:
Understanding that the charges distribution is skewed and that smoker status dominates other features helped in choosing a model that can handle non‑linearities.
Model selection and overfitting:
Linear Regression underfit the data, while more complex models risked overfitting; cross‑validation and test metrics guided the choice of Random Forest.
Deployment issues:
Preparing a clean requirements.txt, keeping file paths relative (data/insurance.csv, models/insurance_model.pkl), and ensuring the app works both locally and on Streamlit Community Cloud were key steps.
User experience:
Designing the EDA and prediction sections so that non‑technical users can understand what the model is doing improved the overall usefulness of the dashboard.

8. Possible future improvements

Hyperparameter tuning (GridSearchCV / RandomizedSearchCV) to further improve the model.
Trying gradient boosting models (XGBoost, LightGBM, CatBoost).
Adding prediction intervals / uncertainty estimates.
Logging user inputs and predictions (with consent) to build a larger dataset.
Deploying the model as a REST API (FastAPI + AWS / other cloud provider).

9. Author

Name: Pratik Rath
College: KIIT University
Email: pratikrath28@gmail.com
GitHub: https://github.com/pratikrath126

Feel free to reach out for feedback, collaborations, or internship opportunities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Insurance Cost Prediction Dashboard 💊

1. Problem statement

2. Dataset

3. EDA summary

4. Model development

4.1 Candidate models

4.2 Evaluation metrics

4.3 Final selected model and why

5. Application features (Streamlit)

6. How to run the project locally

6.1 Clone the repository

6.2 Create and activate a virtual environment (optional but recommended)

6.3 Install dependencies

6.4 Run the app

7. Challenges and what I learned

8. Possible future improvements

9. Author

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.devcontainer		.devcontainer
data		data
models		models
README.md		README.md
app.py		app.py
eda.ipynb		eda.ipynb
requirements.txt		requirements.txt

pratikrath126/insurance-cost-prediction

Folders and files

Latest commit

History

Repository files navigation

Medical Insurance Cost Prediction Dashboard 💊

1. Problem statement

2. Dataset

3. EDA summary

4. Model development

4.1 Candidate models

4.2 Evaluation metrics

4.3 Final selected model and why

5. Application features (Streamlit)

6. How to run the project locally

6.1 Clone the repository

6.2 Create and activate a virtual environment (optional but recommended)

6.3 Install dependencies

6.4 Run the app

7. Challenges and what I learned

8. Possible future improvements

9. Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages