This was an open-ended project for a class focused on reproducible data science. Using a federally maintained dataset on US Colleges, we defined “outcome gaps”, looking specifically at the difference in graduation rates between ethnic groups and post-graduation earnings between different household income groups. We looked for patterns in the data. We found a significant relationship between school expenditures per student and the difference in earnings between students from high and low income backgrounds.
Select relevant variables from original dataset (2000+ variables available). Each row in the dataset is a US college or university.
Calculate gap statistics
Exploratory Data Analysis on full data set and familiar school cohorts (UCs, CSUs, elite private schools)
Perform Analysis of Variance on gap statistics with respect to type of school (private, public, for-profit private)
Regress gap statistics on school expenditures per student
Test null hypothesis on regression coefficients
Use Random Forest to evaluate the relative importance of other possible predictors
All scripts were written in R, with plots generated by ggplot, report written in Rnw, and slides in Rmd.
Here’s a simple shiny app visualizing this relationship.
The full repo is available here.
Direct Link to the paper.