Customer review sentiment analysis with Python and NLP.
The project uses a synthetic review dataset (positive, neutral, negative), applies text preprocessing (cleaning, tokenization, stopwords removal, lemmatization), converts text to TF-IDF features, and trains classifiers (Naive Bayes, Logistic Regression, Random Forest).
The best model is selected based on macro F1-score, and results are visualized with confusion matrix, word clouds, and top TF-IDF features.
- Generate synthetic review dataset
- Text preprocessing:
- lowercasing, URL & punctuation removal
- stopwords filtering
- lemmatization
- TF-IDF vectorization (unigrams + bigrams)
- Models: Multinomial Naive Bayes, Logistic Regression, Random Forest
- Evaluation: accuracy, precision, recall, F1-score
- Visuals: confusion matrix, word clouds, top features per class
- Saved artifacts: best model + vectorizer (
joblib), metrics JSON
sentiment-analysis-nlp/
├─ README.md
├─ LICENSE
├─ requirements.txt
├─ data/
│ └─ generate_reviews.py
├─ src/
│ ├─ train_nlp.py
│ └─ utils.py
└─ outputs/
└─ figures & reports (auto-created)
python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate
pip install -r requirements.txtpython data/generate_reviews.py --n 8000 --seed 42 --out data/reviews.csvpython src/train_nlp.py --input data/reviews.csv --outdir outputs --test-size 0.2 --seed 42Outputs
metrics.json– per-model scores & best modelclassification_report.txtconfusion_matrix.pngwordcloud_positive.png,wordcloud_negative.pngtop_features.txtbest_model.joblib,vectorizer.joblib
Best model performance across classes:

File: outputs/top_features.txt
Shows top discriminative words/phrases learned by the classifier for each class.
| column | description |
|---|---|
| review_id | unique id |
| text | raw review text |
| label | sentiment {negative, neutral, positive} |