There are the data provided by Arvato. As a result of the analyzes to be made using these data, the following are targeted:
- Implementation of unsupervised models for Customer and general population segmentation
- mplementation of supervised models to predict future company campaigns and make them more efficient.
- Installation
- Project Motivation
- Files
- Support files
- Instructions
- Methodology
- Result
- Post of Medium
- Source, Licensing, Authors, and Acknowledgements
- Conclusion
To run the Jupyter notebooks and python scripts, you will need a standard installation of Anaconda with Python 3.7.x and additional libraries needed on below:
- matplotlib
- seaborn
- H2o
- sklearn
- xgboost
- imblearn
- lightgbm
- xgboost
- catboost
H2o.XGBootsClassifier not supported by Windows OS. Because of you must use different platform. I used Google-Colab
In this project, the purpose was to characterize what types of individuals are more likely to be customers of a mail-order retailer and predict which customers would respond positively to marketing campaigns.
The information to be used in this project is provided by Arvato for the project. There are 4 data files and 2 information of attributes files associated with this project:
Udacity_AZDIAS_052018.csv:
Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).Udacity_CUSTOMERS_052018.csv:
Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).Udacity_MAILOUT_052018_TRAIN.csv:
Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).Udacity_MAILOUT_052018_TEST.csv:
Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).DIAS Attributes - Values.xls:
Gives the meaning of the column names.DIAS Information Levels - Attributes.xls:
Gives what the values in each column mean.
Note: The data used for this project not publicly available. It was given for a short time, only to the participants in the competition.
In this project, some of the functions used are developed inside the file utils.py.
To make use of the project, you must access to the repository notebooks and execute the commands presented in it. This project uses 3 jupyter notebooks and a python file, which must be executed in the order indicated:
- ../000_Preprocessing.ipynb : Contain data preprocessing and feature engineering.
- ../001_Unsupervised_Learning.ipynb : Contain Unsupervised learning techniques.
- ../002_SupervisedLearning.ipynb : Contain supervised learning models, metrics evaluation, and prediction for Kaggle submission.
- ../myutils/utils.py : Contain Supported file. For project, some of the functions used are developed inside this file.
- ../Arvato-Report of CS.pdf : Contain report of this project
- ../Last/data/kaggle_submission_file.csv: Kaggle Submission file for compedition of predictions
Analysis process inside the project consists of 4 main sections.
In this first section, an initial display of the relevant data and metrics were carried out, their cleaning as well as feature engineering for further steps.
Using the Kmean model, creating high-potential customer classes with Unsupervised Learning method within the general population.
Implementation of supervised models such as Lightgbm, XGBoost, Catboost, Random Forest, Logistic Regression, and finally VotingClassifier for the forecasting of future company campaigns seeking to improve their performance.
This used the chosen model to make predictions on the campaign data as part of a Kaggle Competition.
- This process is classifier. Because of this, I choose and tried Lightgbm, XGBoost, CatBoost, Random Forest, Logistic Regression classifier models by the library of Sklearn. My best result for ROC-AUC Score from these models is 0.80574 from Lightgbm.
- After new oversampling to data set improved best ROC-AUC scored around 0.815448 from Lightgbm and other one is 0.871796 from VotingClassifier. Use predicting of the test label using this VotingClassifier model.
- Also Mailout-Test results are {1: 3658, 0 : 33542} with VotingClassifier model.
- Kaggle Submissoin files -> '../Last/data/kaggle_submission_file.csv' Click it
- The detailed analysis of the results can be read in this Medium post or in Arvato-Report of Customer Segmentation.pdf
The dataset owner is Bertelsmann-Arvato The data used for this project not publicly available. It was given for a short time, only to the participants in the competition. You may use only software code pages.
Huseyin ELCI
Github | Kaggle | Linkedin
Thanks to Bertelsmann-Arvato for providing cool data with which we can create a cutting edge project.
- Trained a K-means model on the general-customers population data sets. Used the model to cluster the customer data for the customer segmentation and then was compared distributions of clusters.
- Stacking and Voting were useful than a single model result.
- It would be nice if we present our findings to the customer and receive feedback.
- Two challenges of this project is the large data size and the data imbalance. Cleaning of this big data and applying GridSearcheCV to the models also requires serious time and machine performance. For the solution, it should spend some more time to get to know the columns and seek high performance without disabling the important columns. Implementing GridSearchCV with 10–128 variations for each model was a mistake for me. Was be wiser to focus only on LGBM and XGBoost models.
It was instructive, it was worth it. You may touch the code. Have a enjoy. :)
