Introduction
Kalamazoo is city of around 80,000 people located in southwest Michigan. In 2005, The Kalamazoo Promise was announced with the declaration that students who graduate from Kalamazoo Public Schools (and meet the basic requirements) will receive a free college education at any in-state public institution. The Kalamazoo Promise was funded by anonymous donors that aimed to boosted education achievements within the Kalamazoo area. Read more about The Kalamazoo Promise here.
In Michigan, 1.7 million K-12 students are provided free and reduced cost lunches in public based on financial need. In Kalamazoo, about 70% of students receive free and reduced lunch. One in three students in Kalamazoo school district live below the poverty line. For this reason, there have been many critiques of The Kalamazoo Promise stating that free tuition is a start but it does not come close to resolving the inequality within the community. The free tuition provided by the Kalamazoo promise does not cover the costs of college boarding and books, which are just few of the barriers low income students face.
This project will explore a data set of Kalamazoo Promise student’s college graduation rate based on their enrollment in the free or reduced lunch program and the student’s sex. The goal of this project is to investigate how the The Kalamazoo Promise fails to support low income students, using the enrollment in the free and reduced lunch program as proxy. Furthermore, this study aims to understand the role of an indivudals sex in there college graduation status.
Data Processing
Clean Data
# Load Data:
library(readxl)
<- read_excel("C:/Users/Kelly Nickelson/Desktop/Kalamazoo Promise Project/Free Reduced Lunch Data v5.xlsx")
raw_data
# Mutate Variables:
<- raw_data %>%
cleaned_data mutate( # change these below variables into a factor
Graduated = as.factor(Graduated)
FreeLunch = as.factor(Lunch)
,Sex = as.factor(Sex))
,
#view(cleaned_data)
Data Parition
The data is split to create the training and test data sets basis 70:30 ratio.
# Shuffle the Data Frame by rows:
= cleaned_data[sample(1:nrow(cleaned_data)), ]
shuffled_data #view(shuffled_data)
# Split Data:
<- sample(2, nrow(shuffled_data),
splitMe replace = TRUE, prob = c(0.7,0.3)) # split data 70% as train & 30% as test
# Set Train Data Frame:
<- shuffled_data[splitMe==1,] # labeling the new data set for training
train
# Set Test Data Frame:
<- shuffled_data[splitMe==2,] # labeling the new data set for testing
test #view(test)
Build Model
Logistic regression is a classification algorithm aimed to demonstrate the relationship between an independent variable and the response variable. The response variable for this data is graduation status. The independent variables provided in this public data set are free and reduced lunch status, sex, and high school graduation year.
The generalized linear model is created using the glm function with the family as a binomial since the response variable is binary. Meaning, the students either graduated from college or they did not graduate. Below is the creation of the full model including all three independent variables provided by the public data set.
Full Model
# Training the model
<- glm(Graduated ~ Sex + FreeLunch + Year, data = train,family = "binomial")
logitmodel
# Checking the model
summary(logitmodel)
##
## Call:
## glm(formula = Graduated ~ Sex + FreeLunch + Year, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3436 -0.6512 -0.5006 1.0276 2.0829
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.22557 32.90592 -0.341 0.733
## SexMale -0.57074 0.08526 -6.694 2.17e-11 ***
## FreeLunchNot Free Lunch 1.82084 0.08665 21.014 < 2e-16 ***
## Year 0.00486 0.01637 0.297 0.767
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3933.2 on 3175 degrees of freedom
## Residual deviance: 3404.2 on 3172 degrees of freedom
## AIC: 3412.2
##
## Number of Fisher Scoring iterations: 4
In this model, the p-values demonstrate that the independent variables sex and free lunch and reduced lunch status are significantly significant in determining if a student will graduate college. However, the p-value for students high school graduation year indicates this doesn’t significantly impact graduation results, as expected. Lets drop the high school graduation year and re-run the model.
Reduced Model
# Training the model
<- glm(Graduated ~ Sex + FreeLunch, data = train,family = "binomial")
logisticmodel
# Checking the model
summary(logisticmodel)
##
## Call:
## glm(formula = Graduated ~ Sex + FreeLunch, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3347 -0.6477 -0.4976 1.0279 2.0738
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.45518 0.07194 -20.227 < 2e-16 ***
## SexMale -0.57127 0.08524 -6.702 2.06e-11 ***
## FreeLunchNot Free Lunch 1.81761 0.08594 21.150 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3933.2 on 3175 degrees of freedom
## Residual deviance: 3404.2 on 3173 degrees of freedom
## AIC: 3410.2
##
## Number of Fisher Scoring iterations: 4
The Akaike information criterion (AIC) score for this regression model, without the high school graduation year variable, is lower indicating this model is better. Let’s use this model going forward.
Interpreting Model
Impact of Student’s Sex
Creating a model with just the independent variable of student’s sex allows us to understand the impact this has on their likelihood of college graduation.
<- glm(Graduated ~ Sex, family=binomial(), data = test)
model $coefficients model
## (Intercept) SexMale
## -0.7363194 -0.2249228
The difference in the log-odds of graduation between males and females is -0.49337. Meaning the chance of graduating college is lower for male students than for female students. We know this to be true because the p-value < 0.05 so we can reject the null hypothesis that there is no difference in the log-odds between male and female students. Males have
Impact of Free & Reduced Lunch Status
Now creating another model with just the independent variable of student’s free and reduced lunch status allows us to understand the impact income level may have on the likelihood of college graduation.
<- glm(Graduated ~ FreeLunch, family=binomial(), data = test)
model $coefficients model
## (Intercept) FreeLunchNot Free Lunch
## -1.762201 1.740617
The difference in the log-odds of graduation between students that did not receive free lunch programs and students that did receive free lunches is 1.812839 . Meaning the chance of graduating college is much higher for those students that were living above the poverty line (did not receive free and reduced lunches) than students that lived below the poverty line (received free and reduced cost lunches). We know this to be true because the p-value < 0.05.
Visualization of Data
The differences in the log-odds of gradation between students based on their income level, demonstrated through students enrollment in the free and reduced lunch program, is very important. The graph below shows the number of Kalamazoo promise student’s that graduated from college based on their free and reduced cost lunch status. This graph includes all students that graduated from Kalamazoo public high school between 2006-2014.
# Create Plot for Survival Across Passenger Across Class:
<-ggplot(data= cleaned_data, aes(FreeLunch, ..count..)) + # select data and x variable
pgeom_bar(
aes(fill = Graduated) # select fill variable
position = "dodge" # center bars
,col= "black" # add black outline to bar
,+
)theme_classic()+ # apply classic ggplot2 in-built theme
theme(
plot.title = element_text(hjust = 0.5) # center the plot title
axis.text.x = element_text(colour = "Black") # change x axis text color
,axis.text.y = element_text(colour = "Black") # change y axis text size
,axis.ticks.x = element_blank() # remove x axis tick marks
,axis.ticks.y = element_blank() # remove y axis tick marks
,axis.line = element_line(colour = "Black") # change axis line color & thickness
,+
)labs(
title = "Kalamazoo Promise Student's rate of College Graduation", # add plot title
+
)ylab("Number of Students")+ # modify y axis title
xlab("Student's Free and Reduced Cost Lunch Status") # modify y axis title
+ scale_fill_discrete(name = "Graduated?", labels = c("No", "Yes"))+ # modify legend title & labels
p scale_x_discrete(labels = c("Recipent of Free Lunch","Not Recipent of Free Lunch")) # change x axis labels
Assesing Model Predictive Ability
Misclassification Errors
An important measure of a model’s predictive ability is the rate of misclassification errors. Meaning, the number of students we know belong to a category but were classified by the model in a different category.
This table demonstrates the number of students that were classified to each group by our model compared with the number of students actually in that group.
<- predict(logisticmodel, test, type = 'response')
p2 <- ifelse(p2>0.5, 1, 0)
pred2 <- table(Predicted = pred2, Actual = test$Graduated)
tab2 tab2
## Actual
## Predicted No Yes
## 0 768 238
## 1 113 140
These are the rates of the misclassification errors and the accuracy of our model.
<- 1 - sum(diag(tab2))/sum(tab2)
misClasificError
print(paste('Misclassifcation Rate', misClasificError))
## [1] "Misclassifcation Rate 0.278792692613185"
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.721207307386815"
Area Under the ROC Curve
Another way to investigate the success of our model is using the area under the ROC Curve. This aggregated metric that determines how well our logistic regression model classifies positive and negative outcomes at all possible cutoffs.
<- predict(logisticmodel, test, type="response") # generate prediction for ROC
p <- prediction(p, test$Graduated)
pr <- performance(pr, measure = "tpr", x.measure = "fpr") # asses performance for ROC
prf plot(prf, main = "ROC Curve") # plot of the model curve
The area under the curve (AUC) provides an aggregate measure of our models performance. This AUC value demonstrates that our model is moderately accurate at determining if a Kalamazoo Promise student’s will graduate college based on their sex and free and reduced cost lunch status.
# Calculate Area Under the Curve:
<- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc auc
## [1] 0.7259382
Conclusion
This logistic regression model is moderately accurate at determining a Kalamazoo Promise student’s likelihood of graduating college. A more extensive data set with additional independent variables about individual students would greatly improve this model’s accuracy. Unfortunately there is no public access to such data sets at this time.
Kalamazoo Promise students that received free and reduced cost lunches clearly have a lower rate of college graduation. I believe this demonstrates the reality that The Kalamazoo Promise does not provide equitable support for all students. The Promise is a great start to creating positive change in the Kalamazoo community but more needs to be done to support all students.
Citation
Datasciencebeginners. (2020, May 27). Binary logistic regression with R: R-bloggers. R. Retrieved March 8, 2022, from https://www.r-bloggers.com/2020/05/binary-logistic-regression-with-r/
Digest of Education Statistics, 2017. National Center for Education Statistics (NCES) Home Page, a part of the U.S. Department of Education. (n.d.). Retrieved March 8, 2022, from https://nces.ed.gov/programs/digest/d17/tables/dt17_204.10.asp
Ready, T. (2016, July 29). Free college is not enough: The unavoidable limits of the kalamazoo promise. Brookings. Retrieved March 8, 2022, from https://www.brookings.edu/blog/social-mobility-memos/2015/06/24/free-college-is-not-enough-the-unavoidable-limits-of-the-kalamazoo-promise/
Data Set Source:
The Kalamazoo Promise: Data Collection. Upjohn Institute. (n.d.). Retrieved March 8, 2022, from https://www.upjohn.org/about/research-initiatives/promise-investing-community/kalamazoo-promise-data-collection