Introduction

Kalamazoo is city of around 80,000 people located in southwest Michigan. In 2005, The Kalamazoo Promise was announced with the declaration that students who graduate from Kalamazoo Public Schools (and meet the basic requirements) will receive a free college education at any in-state public institution. The Kalamazoo Promise was funded by anonymous donors that aimed to boosted education achievements within the Kalamazoo area. Read more about The Kalamazoo Promise here.

In Michigan, 1.7 million K-12 students are provided free and reduced cost lunches in public based on financial need. In Kalamazoo, about 70% of students receive free and reduced lunch. One in three students in Kalamazoo school district live below the poverty line. For this reason, there have been many critiques of The Kalamazoo Promise stating that free tuition is a start but it does not come close to resolving the inequality within the community. The free tuition provided by the Kalamazoo promise does not cover the costs of college boarding and books, which are just few of the barriers low income students face.

This project will explore a data set of Kalamazoo Promise student’s college graduation rate based on their enrollment in the free or reduced lunch program and the student’s sex. The goal of this project is to investigate how the The Kalamazoo Promise fails to support low income students, using the enrollment in the free and reduced lunch program as proxy. Furthermore, this study aims to understand the role of an indivudals sex in there college graduation status.

Data Processing

Clean Data

# Load Data:
library(readxl)
raw_data <- read_excel("C:/Users/Kelly Nickelson/Desktop/Kalamazoo Promise Project/Free Reduced Lunch Data v5.xlsx")


# Mutate Variables:
cleaned_data <- raw_data %>% 
  mutate( # change these below variables into a factor
     Graduated = as.factor(Graduated) 
    ,FreeLunch = as.factor(Lunch) 
    ,Sex = as.factor(Sex))

#view(cleaned_data)

Data Parition

The data is split to create the training and test data sets basis 70:30 ratio.

# Shuffle the Data Frame by rows:
shuffled_data = cleaned_data[sample(1:nrow(cleaned_data)), ]
#view(shuffled_data)


# Split Data:
splitMe <- sample(2, nrow(shuffled_data), 
      replace = TRUE, prob = c(0.7,0.3)) # split data 70% as train & 30% as test

# Set Train Data Frame:
train <- shuffled_data[splitMe==1,] # labeling the new data set for training

# Set Test Data Frame:
test <- shuffled_data[splitMe==2,]  # labeling the new data set for testing
#view(test)

Build Model

Logistic regression is a classification algorithm aimed to demonstrate the relationship between an independent variable and the response variable. The response variable for this data is graduation status. The independent variables provided in this public data set are free and reduced lunch status, sex, and high school graduation year.

The generalized linear model is created using the glm function with the family as a binomial since the response variable is binary. Meaning, the students either graduated from college or they did not graduate. Below is the creation of the full model including all three independent variables provided by the public data set.

Full Model

# Training the model
logitmodel <- glm(Graduated ~ Sex + FreeLunch + Year, data = train,family = "binomial")

# Checking the model
summary(logitmodel)

## 
## Call:
## glm(formula = Graduated ~ Sex + FreeLunch + Year, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3436  -0.6512  -0.5006   1.0276   2.0829  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -11.22557   32.90592  -0.341    0.733    
## SexMale                  -0.57074    0.08526  -6.694 2.17e-11 ***
## FreeLunchNot Free Lunch   1.82084    0.08665  21.014  < 2e-16 ***
## Year                      0.00486    0.01637   0.297    0.767    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3933.2  on 3175  degrees of freedom
## Residual deviance: 3404.2  on 3172  degrees of freedom
## AIC: 3412.2
## 
## Number of Fisher Scoring iterations: 4

In this model, the p-values demonstrate that the independent variables sex and free lunch and reduced lunch status are significantly significant in determining if a student will graduate college. However, the p-value for students high school graduation year indicates this doesn’t significantly impact graduation results, as expected. Lets drop the high school graduation year and re-run the model.

Reduced Model

# Training the model
logisticmodel <- glm(Graduated ~ Sex + FreeLunch, data = train,family = "binomial")

# Checking the model
summary(logisticmodel)

## 
## Call:
## glm(formula = Graduated ~ Sex + FreeLunch, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3347  -0.6477  -0.4976   1.0279   2.0738  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -1.45518    0.07194 -20.227  < 2e-16 ***
## SexMale                 -0.57127    0.08524  -6.702 2.06e-11 ***
## FreeLunchNot Free Lunch  1.81761    0.08594  21.150  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3933.2  on 3175  degrees of freedom
## Residual deviance: 3404.2  on 3173  degrees of freedom
## AIC: 3410.2
## 
## Number of Fisher Scoring iterations: 4

The Akaike information criterion (AIC) score for this regression model, without the high school graduation year variable, is lower indicating this model is better. Let’s use this model going forward.

Interpreting Model

Impact of Student’s Sex

Creating a model with just the independent variable of student’s sex allows us to understand the impact this has on their likelihood of college graduation.

model <- glm(Graduated ~ Sex, family=binomial(), data = test)
model$coefficients

## (Intercept)     SexMale 
##  -0.7363194  -0.2249228

The difference in the log-odds of graduation between males and females is -0.49337. Meaning the chance of graduating college is lower for male students than for female students. We know this to be true because the p-value < 0.05 so we can reject the null hypothesis that there is no difference in the log-odds between male and female students. Males have

Impact of Free & Reduced Lunch Status

Now creating another model with just the independent variable of student’s free and reduced lunch status allows us to understand the impact income level may have on the likelihood of college graduation.

model <- glm(Graduated ~ FreeLunch, family=binomial(), data = test)
model$coefficients

##             (Intercept) FreeLunchNot Free Lunch 
##               -1.762201                1.740617

The difference in the log-odds of graduation between students that did not receive free lunch programs and students that did receive free lunches is 1.812839 . Meaning the chance of graduating college is much higher for those students that were living above the poverty line (did not receive free and reduced lunches) than students that lived below the poverty line (received free and reduced cost lunches). We know this to be true because the p-value < 0.05.

Visualization of Data

The differences in the log-odds of gradation between students based on their income level, demonstrated through students enrollment in the free and reduced lunch program, is very important. The graph below shows the number of Kalamazoo promise student’s that graduated from college based on their free and reduced cost lunch status. This graph includes all students that graduated from Kalamazoo public high school between 2006-2014.

# Create Plot for Survival Across Passenger Across Class:
p<-ggplot(data= cleaned_data, aes(FreeLunch, ..count..)) + # select data and x variable
      geom_bar(
               aes(fill = Graduated)  # select fill variable
              ,position = "dodge" # center bars
              ,col= "black" # add black outline to bar
              )+
      theme_classic()+ # apply classic ggplot2 in-built theme 
      theme( 
             plot.title = element_text(hjust = 0.5) # center the plot title
            ,axis.text.x = element_text(colour = "Black") # change x axis text color
            ,axis.text.y = element_text(colour = "Black") # change y axis text size
            ,axis.ticks.x = element_blank() # remove x axis tick marks
            ,axis.ticks.y = element_blank() # remove y axis tick marks
            ,axis.line = element_line(colour = "Black") # change axis line color & thickness
            )+
      labs(
          title = "Kalamazoo Promise Student's rate of College Graduation",  # add plot title
          )+
      ylab("Number of Students")+ # modify y axis title
      xlab("Student's Free and Reduced Cost Lunch Status") # modify y axis title

p + scale_fill_discrete(name = "Graduated?", labels = c("No", "Yes"))+  # modify legend title & labels
      scale_x_discrete(labels = c("Recipent of Free Lunch","Not Recipent of Free Lunch")) # change x axis labels

Assesing Model Predictive Ability

Misclassification Errors

An important measure of a model’s predictive ability is the rate of misclassification errors. Meaning, the number of students we know belong to a category but were classified by the model in a different category.

This table demonstrates the number of students that were classified to each group by our model compared with the number of students actually in that group.

p2 <- predict(logisticmodel, test, type = 'response')
pred2 <- ifelse(p2>0.5, 1, 0)
tab2 <- table(Predicted = pred2, Actual = test$Graduated)
tab2

##          Actual
## Predicted  No Yes
##         0 768 238
##         1 113 140

These are the rates of the misclassification errors and the accuracy of our model.

misClasificError <- 1 - sum(diag(tab2))/sum(tab2)

print(paste('Misclassifcation Rate', misClasificError))

## [1] "Misclassifcation Rate 0.278792692613185"

print(paste('Accuracy',1-misClasificError))

## [1] "Accuracy 0.721207307386815"

Area Under the ROC Curve

Another way to investigate the success of our model is using the area under the ROC Curve. This aggregated metric that determines how well our logistic regression model classifies positive and negative outcomes at all possible cutoffs.

p <- predict(logisticmodel, test, type="response") # generate prediction for ROC 
pr <- prediction(p, test$Graduated)
prf <- performance(pr, measure = "tpr", x.measure = "fpr") # asses performance for ROC
plot(prf, main = "ROC Curve") # plot of the model curve

The area under the curve (AUC) provides an aggregate measure of our models performance. This AUC value demonstrates that our model is moderately accurate at determining if a Kalamazoo Promise student’s will graduate college based on their sex and free and reduced cost lunch status.

# Calculate Area Under the Curve:
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.7259382

Conclusion

This logistic regression model is moderately accurate at determining a Kalamazoo Promise student’s likelihood of graduating college. A more extensive data set with additional independent variables about individual students would greatly improve this model’s accuracy. Unfortunately there is no public access to such data sets at this time.

Kalamazoo Promise students that received free and reduced cost lunches clearly have a lower rate of college graduation. I believe this demonstrates the reality that The Kalamazoo Promise does not provide equitable support for all students. The Promise is a great start to creating positive change in the Kalamazoo community but more needs to be done to support all students.

Citation

Datasciencebeginners. (2020, May 27). Binary logistic regression with R: R-bloggers. R. Retrieved March 8, 2022, from https://www.r-bloggers.com/2020/05/binary-logistic-regression-with-r/

Digest of Education Statistics, 2017. National Center for Education Statistics (NCES) Home Page, a part of the U.S. Department of Education. (n.d.). Retrieved March 8, 2022, from https://nces.ed.gov/programs/digest/d17/tables/dt17_204.10.asp

Ready, T. (2016, July 29). Free college is not enough: The unavoidable limits of the kalamazoo promise. Brookings. Retrieved March 8, 2022, from https://www.brookings.edu/blog/social-mobility-memos/2015/06/24/free-college-is-not-enough-the-unavoidable-limits-of-the-kalamazoo-promise/

Data Set Source:

The Kalamazoo Promise: Data Collection. Upjohn Institute. (n.d.). Retrieved March 8, 2022, from https://www.upjohn.org/about/research-initiatives/promise-investing-community/kalamazoo-promise-data-collection

Logistic Regression: The Kalamazoo Promise

Kelly Nickelson

3/6/2022