Repo for the -- ENRON EMAIL TOPIC MODEL AND CLUSTERING -- project

Author: Shayne Longpre

Description:

Topic Modelling of the Enron dataset (520,901 emails) using Non-Negative Matrix Factorization (NMF), 
Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). The dimensionality reduction techniques
are applied to the TF-IDF processed emails. Topic Clustering is conducted on the dimensionality reduced matrices,
using Mean Shift (MS), Affinity Propogation (AF), Agglomerative Clustering (AG) and Spectral Clustering (SP). 
MS and AF generate the number of clusters based on the data, rather than optimizing for a given number. The cluster 
centroids are annotated by the most frequently occurring TextRank keywords from the member emails.

Dataset:

Download at: https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz
More information at: https://www.cs.cmu.edu/~./enron/

Layout of the Data/ directory:

Data/Raw - should contain all oiginial 150 email accounts from dataset
Data/emails.txt - parsed emails in EmailUnit class format (see src/Textrank/Units.py)
Data/feature_labels.txt - the word labels for the tf_idf feature matrix 
Data/x_tfidf.matrix - the TF-IDF feature matrix 

All text files in Data/ are generated by 'python src/parser.py Data/Raw'

Layout of the src/ directory:

src/Util - Contains helper classes and functions used within other files
src/Textrank - Used to compute textrank_keyword scores for emails
src/parser.py - Run 'python src/parser.py Data/Raw' to generate emails.txt, feature_labels.txt and x_tfidf.matrix
src/topicmodels.py -
	Run `python src/topicmodels <dr_model> <cluster_model>`.
		where <dr_model> must be one of 'MNF', 'LDA', 'LSA' or 'ALL'.
		where <cluster_model> must be one of 'MS', 'SP', 'AF', 'AG' or 'ALL'.
	EG: Running 'python src/topicmodels MNF MS' will print topics for MNF and then clustering topics for MNF MS.
	EG: Running 'python src/topicmodels ALL ALL' will print topics for all dimensionality reduction models (3) as 
		well as cluster topics for all combinations (12).

Layout of the results/ directory:

results/topicModelResults/ - Contains the human readable topic labelled results for the MNF, LDA and LSA models.
results/topicClusterResults/ - Contains the human readable cluster labelled results for all combinations of
	dimensionality reduction models followed by a clustering model.

Pipeline:

(1) Parsing:

- Emails are processed and converted to EmailUnits, extracting the owner, sender, recipient and subject.
- The email body is filtered for stopwords, punctuation, numerics and other commonly occurring unuseful regexs.
- The Textrank keywords for each email are calculated using PageRank.

(2) Feature Extraction:

- From the cleaned email text we create a 'bag of words' matrix.
- From the 'bag of words' we compute the TF-IDF matrix (520,901 emails X ~60,000 words)

(3) Topic Modelling/Extraction:

- We run MNF / LDA / LSA to obtain the fitted vector model (num_components X num_features).
- We output the highest weighted words for the vector components as preliminary results.

(4) Topic Clustering:

- We transform the dimensionality reduction component vectors to new feature matrices (520,901 emails X num_components).
- We run MS / SP / AG / AF on the dimensionality reduced feature matrix to obtain cluster labels.
- We obtain human understandable word labels for cluster centroids by extracting the most frequently occurring TextRank 
	keywords from the member emails.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
results		results
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Repo for the -- ENRON EMAIL TOPIC MODEL AND CLUSTERING -- project

Author: Shayne Longpre

Description:

Dataset:

Layout of the Data/ directory:

Layout of the src/ directory:

Layout of the results/ directory:

Pipeline:

(1) Parsing:

(2) Feature Extraction:

(3) Topic Modelling/Extraction:

(4) Topic Clustering:

About

Uh oh!

Releases

Packages

Languages

Shayne13/Enron_Topic_Clustering

Folders and files

Latest commit

History

Repository files navigation

Repo for the -- ENRON EMAIL TOPIC MODEL AND CLUSTERING -- project

Author: Shayne Longpre

Description:

Dataset:

Layout of the Data/ directory:

Layout of the src/ directory:

Layout of the results/ directory:

Pipeline:

(1) Parsing:

(2) Feature Extraction:

(3) Topic Modelling/Extraction:

(4) Topic Clustering:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages