Sentiment Analysis of N.Y. Times Articles about 2020 U.S. Presidential Candidates

Experiment Abstract: Application of Sentiment Analysis on Presidential Candidates using the N.Y. Times

This python-based Sentiment Analysis project helps perform sentiment analysis of U.S. Presidential Candidates, based on content from N.Y. Times articles. I'm an avid reader of the N.Y.Times and after reading numerous articles about presidential polling results, I wondered what is the general sentiment of these candidates based on the content in the articles that iswritten about them?

Now we know the N.Y. Times is politically left-leaning, so there is inherent bias going into this exercise. However, the availability of data via the N.Y.Times Article Search API, the journalistic quality of their writing as well as the the analysis of political elections still make the N.Y. Times a viable option as a data source for this experiment.

The package performs EDA on data & trains a stacked machine learning model to perform sentiment analysis of U.S. Presidential Candidates, using the N.Y. Times Article Search API, so it uses 11,000 N.Y. Times article abstracts as the text to predict sentiment. This is a multi-class classification problem, predicting positive, neutral and negative sentiment using Natural Language Processing techniques to pre-process the text and engineer features.

So What?

The sentiment analysis model that I created for this project achieves a 60% harmonic mean of precision & recall (F1 score) and it is predicting that Bernie Sanders has the highest average sentiment prediction, while Donald Trump has the lowest. Now the data from the N.Y. Times was run through February 2020 and the results displayed here are in line with the results from the Iowa and New Hampshire primaries. Please refer to the "Findings & Results" section below for more information on the relevance of the findings.

What does the Sentiment Analysis Pipeline technically do?

At a high level, the sentiment_analysis_pipe() function in the nyt_sentiment_analyzer.py script will perform the following functions: 1) Read in data from multiple N.Y.Times .csv files from a specified directory into a single DataFrame. Files contain data about N.Y. Times Articles about U.S. Presidential Candidates scraped from the Article Search API. 2) Then it will pre-process the data, engineer additional features, as well as label the data using TextBlob sentiment analysis scores. 3) Next, the function will generate a series of graphs to assist with EDA in the Data Science workflow. 4) After that, it will train user-selected Sci-kit Learn Models and print out F1 and Accuracy score metrics, performing 5-Fold cross- validation. This allows the user to compare models with numeric & text-based features and see how they perform, and see if the model is overfitting to the training data. 5) Then the function will both tune the hyper-parameters of both models (numeric & text-based features) and print the most important features for each model type, respectively. 6) Next, the function will train and evaluate a stacked model pipeline: Using the aforementioned predictions from the text-based model and the numeric model as features, it will train a second-layered Logistic Regression model. 7) Finally, the pipeline will make predictions on the data and produces a final graph showing the average sentiment of the N.Y. Times Articles about a particular candidate over time.

Model Training Methodology:

The model training methodology for this project takes a two-step approach:

Train text models on BOW and TfIdf scores using content from the articles via the N.Y. Times Article Search API. All of the text models that were trained were evaluated using Bi-Gram Bag of Words (BOW) Frequencies and TfIdf Scores as features for text-based models. Standardization was only performed on the TfIdf scores and dimensionality reduction using Chi-Square test was performed on both text-based feature types.
Train numeric models using date features, character counts, article total word counts, and TextBlob subjectivity. All numeric models that were trained were evaluated using Min-Max Scaling as features prior to model fitting.

Models Trained & Evaluated (for both Text & Numeric Models):

Logistic Regression
XGBoost Classifier
Random Forest Classifier
Linear SVM Classifier
Multinomial Naiive Bayes Classifier
Recurrent Neural Network (LSTM)

Findings & Results:

Model Results

For multi-class classification F1 Score was used to evaluate data with imbalanced classes
XGBoost Classifier outperformed the other models for the text-based model with 45.7% F1 score using Bi-gram TfIdf weights as features.
XGBoost Classifier also outperformed the other models using numeric features, touting 66% F1 Score.
The stacked Logistic Regression model achieves 60% F1 Score, which is 10 percentage points better than a random guess. However, while the numeric features involved in the model stacking significantly improve F1 Score (by approximately 15 percentage points), F1 Score is ill-defined for negative predictions, with no prediction samples. This finding is conclusive with the results for the XGBoost Classifier text model.
The LSTM had 68% F1 Score, however it only predicted 2/3 classes. The LSTM Classifier needed more training data.

Sentiment Score Prediction Interpretation

While the model results suggest that the model has a 60% harmonic mean of precision & recall, the interpretation of the sentiment score predictions are quite interesting.

Consider the table below that shows the results for candidates and their average sentiment predictions from the model:

candidate	avg. predicted sentiment
Bernie Sanders	0.56
Amy Klobuchar	0.50
Elizabeth Warren	0.43
Joe Biden	0.18
Donald Trump	-0.02

The model is predicting that Bernie Sanders has the highest average sentiment prediction while Donald Trump has the lowest. This prediction of Senator Sanders is in line with current polls and Caucus results from Iowa & New Hampshire (which was the around the time the data from the N.Y. Times Article Search API data was last run). It would be interesting to run the data throughout the election to current date, and see if the results wind up with Joe Biden having the highest average sentiment score. On the other hand, it's interesting that President Trump has the lowest sentiment, out of any Presidential Candidate. The N.Y. Times has a reputation for being a progressive news organization. It's still interesting that compared to the rest of the candidates Trump's average sentiment is the lowest, considering that the President's impeachment trial was happening during the same time these articles were collected.

Next Steps

Train the model on a wider universe of N.Y. Times articles (through current) to accurately predict the election outcome
Use negation rules to generate intelligent labels
Train the LSTM on a larger set of data

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.idea		.idea
core_utils		core_utils
custom_utils		custom_utils
data		data
model_utils		model_utils
models		models
nyt_api		nyt_api
viz_utils		viz_utils
README.md		README.md
RNN_nyt_sentiment_predictions_2020.03.03.csv		RNN_nyt_sentiment_predictions_2020.03.03.csv
Stacked_nyt_sentiment_predictions_2020.03.02.csv		Stacked_nyt_sentiment_predictions_2020.03.02.csv
log.txt		log.txt
nyt_sentiment_analyzer.py		nyt_sentiment_analyzer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sentiment Analysis of N.Y. Times Articles about 2020 U.S. Presidential Candidates

Experiment Abstract: Application of Sentiment Analysis on Presidential Candidates using the N.Y. Times

So What?

What does the Sentiment Analysis Pipeline technically do?

Model Training Methodology:

Models Trained & Evaluated (for both Text & Numeric Models):

Findings & Results:

Model Results

Sentiment Score Prediction Interpretation

Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Languages

scarnyc/2020-Presidential-Election-NY-Times-Sentiment-Analyzer

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis of N.Y. Times Articles about 2020 U.S. Presidential Candidates

Experiment Abstract: Application of Sentiment Analysis on Presidential Candidates using the N.Y. Times

So What?

What does the Sentiment Analysis Pipeline technically do?

Model Training Methodology:

Models Trained & Evaluated (for both Text & Numeric Models):

Findings & Results:

Model Results

Sentiment Score Prediction Interpretation

Next Steps

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages