Classifying the Emails of HRC

Comparing SVM and Random Forest as predictors for text

Summary

In this project for a Machine Learning class, we used a document term matrix generated from about 3500 emails, with sender labels, to train SVM and Random Forest classifiers. Since we were dealing with about 9000 predictors, using RF to do some feature selection yielded the best estimated MSE. Our test accuracy was 76%. See the Repo Here.

Project Outline

Tools

All scripting was performed in R. We used the tm package for text processing, e1071 for SVM, and randomForest. We also tried some feature selection using boruta and topicmodels. Most plotting was done with ggplot2. The deliverable for the project was the full github repository our team used to collaborate, which includes data, scripts, and written reports.

See the Repo Here.