This project is an end‑to‑end machine learning + Streamlit application that predicts annual medical insurance charges based on a person’s profile (age, BMI, smoker status, region, etc.). It covers data analysis, model training, evaluation, and cloud deployment.
🔗 Live app: https://pratik-icp.streamlit.app/ 📦 GitHub repo: https://github.com/pratikrath126/insurance-cost-prediction
Health insurance companies need to estimate how much a customer is likely to cost in medical expenses. The goal of this project is to:
- Analyse which factors drive insurance charges.
- Build a regression model that predicts annual charges for a new customer.
- Provide an interactive web dashboard where users can enter their details and see the estimated cost immediately.
This is designed as a portfolio / placement project to demonstrate skills in data analysis, machine learning, and basic cloud deployment.
- Name: Medical Cost Personal Datasets (
insurance.csv) - Source: Public educational dataset (commonly used on Kaggle)
- Rows: 1,338
- Target:
charges– annual medical insurance cost in USD
Features
age: Age of the insured individualsex:male/femalebmi: Body mass indexchildren: Number of dependentssmoker:yes/noregion:southeast,southwest,northeast,northwestcharges: Medical cost billed by insurance (target)
The raw file is stored at data/insurance.csv.
Key insights from the exploratory data analysis:
- The distribution of charges is highly right‑skewed; a small number of people incur very high costs compared to the majority.
- Age and BMI both show a positive relationship with charges, especially for older and high‑BMI individuals.
- Smoker status is the most important categorical feature: smokers have a much higher median and upper‑quartile cost than non‑smokers.
- Regional differences exist but are relatively small compared to the impact of smoking and BMI.
The EDA is implemented inside the Streamlit app under the EDA tab.
Several regression models were experimented with (in a separate notebook):
- Linear Regression
- Ridge Regression (α = 1.0)
- Random Forest Regressor
All models used the same train/test split and feature preprocessing.
Models were compared on a 20% held‑out test set using:
- MAE – Mean Absolute Error
- RMSE – Root Mean Squared Error
- R² – Coefficient of determination
Example results (fill with your actual numbers):
| Model | MAE | RMSE | R² |
|---|---|---|---|
| Linear Regression | 3,100 | 5,300 | 0.78 |
| Ridge Regression (α=1) | 4,187.30 | 5,798.27 | 0.783 |
| Random Forest (final) | 2,538.91 | 4,602.51 | 0.864 |
The Random Forest Regressor was chosen as the final model because:
- It achieved the lowest MAE and RMSE and the highest R² on the test set.
- Ridge Regression slightly improved R² over plain Linear Regression but still produced noticeably higher MAE and RMSE than the Random Forest model, so it was not selected as the final model.
- It can capture non‑linear relationships between features and charges (for example, the impact of BMI and smoking).
- It is relatively robust to outliers and does not require heavy feature engineering.
The trained model is saved as models/insurance_model.pkl and loaded directly
into the Streamlit app.
The Streamlit app is organised into four tabs:
-
📊 Overview
- Shows the first few rows of the dataset.
- Displays summary statistics for numeric columns.
- Lists a short description of each feature.
-
🔍 EDA
- Distribution of charges (histogram + KDE).
- Scatter plots: age vs charges, BMI vs charges.
- Box plots: smoker vs charges, region vs charges.
- Correlation heatmap for numeric features.
-
🤖 Model & Prediction
- Input form for age, sex, BMI, number of children, smoker status, and region.
- Predicts the estimated annual insurance cost using the Random Forest model.
- Shows test‑set performance metrics (MAE, RMSE, R²).
- Actual vs predicted plot to visualise model accuracy.
-
ℹ️ About
- Explains the project briefly.
- Contains author contact information.
git clone https://github.com/pratikrath126/insurance-cost-prediction cd insurance-cost-prediction
python -m venv .venv
On Linux / macOS: source .venv/bin/activate
On Windows: .venv\Scripts\activate
pip install -r requirements.txt
streamlit run app.py
Then open the URL displayed in the terminal (usually
http://localhost:8501) in your browser.
Some of the main difficulties and learnings:
-
Feature–target relationship:
Understanding that the charges distribution is skewed and that smoker status dominates other features helped in choosing a model that can handle non‑linearities. -
Model selection and overfitting:
Linear Regression underfit the data, while more complex models risked overfitting; cross‑validation and test metrics guided the choice of Random Forest. -
Deployment issues:
Preparing a cleanrequirements.txt, keeping file paths relative (data/insurance.csv,models/insurance_model.pkl), and ensuring the app works both locally and on Streamlit Community Cloud were key steps. -
User experience:
Designing the EDA and prediction sections so that non‑technical users can understand what the model is doing improved the overall usefulness of the dashboard.
- Hyperparameter tuning (GridSearchCV / RandomizedSearchCV) to further improve the model.
- Trying gradient boosting models (XGBoost, LightGBM, CatBoost).
- Adding prediction intervals / uncertainty estimates.
- Logging user inputs and predictions (with consent) to build a larger dataset.
- Deploying the model as a REST API (FastAPI + AWS / other cloud provider).
- Name: Pratik Rath
- College: KIIT University
- Email: pratikrath28@gmail.com
- GitHub: https://github.com/pratikrath126
Feel free to reach out for feedback, collaborations, or internship opportunities.