Skip to content

This project is an end‑to‑end machine learning + Streamlit application that predicts annual medical insurance charges based on a person’s profile (age, BMI, smoker status, region, etc.). It covers data analysis, model training, evaluation, and cloud deployment.

Notifications You must be signed in to change notification settings

pratikrath126/insurance-cost-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Medical Insurance Cost Prediction Dashboard 💊

This project is an end‑to‑end machine learning + Streamlit application that predicts annual medical insurance charges based on a person’s profile (age, BMI, smoker status, region, etc.). It covers data analysis, model training, evaluation, and cloud deployment.

🔗 Live app: https://pratik-icp.streamlit.app/ 📦 GitHub repo: https://github.com/pratikrath126/insurance-cost-prediction


1. Problem statement

Health insurance companies need to estimate how much a customer is likely to cost in medical expenses. The goal of this project is to:

  • Analyse which factors drive insurance charges.
  • Build a regression model that predicts annual charges for a new customer.
  • Provide an interactive web dashboard where users can enter their details and see the estimated cost immediately.

This is designed as a portfolio / placement project to demonstrate skills in data analysis, machine learning, and basic cloud deployment.


2. Dataset

  • Name: Medical Cost Personal Datasets (insurance.csv)
  • Source: Public educational dataset (commonly used on Kaggle)
  • Rows: 1,338
  • Target: charges – annual medical insurance cost in USD

Features

  • age: Age of the insured individual
  • sex: male / female
  • bmi: Body mass index
  • children: Number of dependents
  • smoker: yes / no
  • region: southeast, southwest, northeast, northwest
  • charges: Medical cost billed by insurance (target)

The raw file is stored at data/insurance.csv.


3. EDA summary

Key insights from the exploratory data analysis:

  • The distribution of charges is highly right‑skewed; a small number of people incur very high costs compared to the majority.
  • Age and BMI both show a positive relationship with charges, especially for older and high‑BMI individuals.
  • Smoker status is the most important categorical feature: smokers have a much higher median and upper‑quartile cost than non‑smokers.
  • Regional differences exist but are relatively small compared to the impact of smoking and BMI.

The EDA is implemented inside the Streamlit app under the EDA tab.


4. Model development

4.1 Candidate models

Several regression models were experimented with (in a separate notebook):

  • Linear Regression
  • Ridge Regression (α = 1.0)
  • Random Forest Regressor

All models used the same train/test split and feature preprocessing.

4.2 Evaluation metrics

Models were compared on a 20% held‑out test set using:

  • MAE – Mean Absolute Error
  • RMSE – Root Mean Squared Error
  • – Coefficient of determination

Example results (fill with your actual numbers):

Model MAE RMSE
Linear Regression 3,100 5,300 0.78
Ridge Regression (α=1) 4,187.30 5,798.27 0.783
Random Forest (final) 2,538.91 4,602.51 0.864

4.3 Final selected model and why

The Random Forest Regressor was chosen as the final model because:

  • It achieved the lowest MAE and RMSE and the highest R² on the test set.
  • Ridge Regression slightly improved R² over plain Linear Regression but still produced noticeably higher MAE and RMSE than the Random Forest model, so it was not selected as the final model.
  • It can capture non‑linear relationships between features and charges (for example, the impact of BMI and smoking).
  • It is relatively robust to outliers and does not require heavy feature engineering.

The trained model is saved as models/insurance_model.pkl and loaded directly into the Streamlit app.


5. Application features (Streamlit)

The Streamlit app is organised into four tabs:

  1. 📊 Overview

    • Shows the first few rows of the dataset.
    • Displays summary statistics for numeric columns.
    • Lists a short description of each feature.
  2. 🔍 EDA

    • Distribution of charges (histogram + KDE).
    • Scatter plots: age vs charges, BMI vs charges.
    • Box plots: smoker vs charges, region vs charges.
    • Correlation heatmap for numeric features.
  3. 🤖 Model & Prediction

    • Input form for age, sex, BMI, number of children, smoker status, and region.
    • Predicts the estimated annual insurance cost using the Random Forest model.
    • Shows test‑set performance metrics (MAE, RMSE, R²).
    • Actual vs predicted plot to visualise model accuracy.
  4. ℹ️ About

    • Explains the project briefly.
    • Contains author contact information.

6. How to run the project locally

6.1 Clone the repository

git clone https://github.com/pratikrath126/insurance-cost-prediction cd insurance-cost-prediction

6.2 Create and activate a virtual environment (optional but recommended)

python -m venv .venv

On Linux / macOS: source .venv/bin/activate

On Windows: .venv\Scripts\activate

6.3 Install dependencies

pip install -r requirements.txt

6.4 Run the app

streamlit run app.py

Then open the URL displayed in the terminal (usually http://localhost:8501) in your browser.


7. Challenges and what I learned

Some of the main difficulties and learnings:

  • Feature–target relationship:
    Understanding that the charges distribution is skewed and that smoker status dominates other features helped in choosing a model that can handle non‑linearities.

  • Model selection and overfitting:
    Linear Regression underfit the data, while more complex models risked overfitting; cross‑validation and test metrics guided the choice of Random Forest.

  • Deployment issues:
    Preparing a clean requirements.txt, keeping file paths relative (data/insurance.csv, models/insurance_model.pkl), and ensuring the app works both locally and on Streamlit Community Cloud were key steps.

  • User experience:
    Designing the EDA and prediction sections so that non‑technical users can understand what the model is doing improved the overall usefulness of the dashboard.


8. Possible future improvements

  • Hyperparameter tuning (GridSearchCV / RandomizedSearchCV) to further improve the model.
  • Trying gradient boosting models (XGBoost, LightGBM, CatBoost).
  • Adding prediction intervals / uncertainty estimates.
  • Logging user inputs and predictions (with consent) to build a larger dataset.
  • Deploying the model as a REST API (FastAPI + AWS / other cloud provider).

9. Author

Feel free to reach out for feedback, collaborations, or internship opportunities.

About

This project is an end‑to‑end machine learning + Streamlit application that predicts annual medical insurance charges based on a person’s profile (age, BMI, smoker status, region, etc.). It covers data analysis, model training, evaluation, and cloud deployment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published