Decision Tree & Random Forest Model: Esophageal Cancer

Introduction

Esophageal cancer is a disease where cancer cells form within the esophageal tissues. The esophagus is the tube like structure running from the mouth to the stomach. Tobacco and heavy alcohol use increase an individuals risk for developing esophageal cancer. Esophageal cancer has a greater chance of being sucessfully treated if its caught early. This project aims to create a model of esophageal cancer rates bases on an individual’s age group, daily alcohol consumption, and daily tobacco consumption. The data used to grenerate this decsion tree and models are from a case-control study of esophageal cancer in Ille-et-Vilaine, France.

Clean Data

This chunk is loading data, removing unnecessary columns and mutating the variables. Then removing any NA or blank entries and displaying the head of the cleaned data.

# Load Data:
library(readxl)
raw_data <- read_excel("C:/Users/Kelly Nickelson/Desktop/esophv3.xlsx")
#View(esoph)

# Mutate Variables:
cleaned_data <- raw_data %>% 
  mutate( # change these below variables into a factor
     Alchol_Use = as.factor(Alcohol_Use) 
    ,Tobacco_Use = as.factor(Tobacco_Use) 
    ,Cancer = as.factor(Cancer))


# View Cleaned Data Head:
#head(cleaned_data)

Generate Test & Train Data Sets

This chunk is splitting the cleaned data into two data sets: the training data set (70% of data) and the test data set (30% of the data).

# Shuffle the Data Frame by rows:
shuffled_data = cleaned_data[sample(1:nrow(cleaned_data)), ]
#view(shuffled_data)


# Split Data:
splitMe <- sample(2, nrow(shuffled_data), 
      replace = TRUE, prob = c(0.7,0.3)) # split data 70% as train & 30% as test

# Set Train Data Frame:
train <- shuffled_data[splitMe==1,] # labeling the new data set for training

# Set Test Data Frame:
test <- shuffled_data[splitMe==2,]  # labeling the new data set for testing
#view(test)

Decision Tree

This chunk builds a decision tree classification model and generates the tree plot.

# Build Tree Classification Model:
tree.esophcancer <- rpart(Cancer~.,data=train) # select data for model

# Generate Tree Plot
tree<-rpart.plot(
      tree.esophcancer # select data
      ,shadow.col="gray" # selects shadopw color
      ,nn=TRUE) # includes node number

Pruned Tree Model

This chunk prunes the tree to prevent over fitting and makes a prediction on test data using the pruned tree. This demonstrates that the pruned tree model accuracy moderately high.

# Prune Tree to Prevent Over Fitting:
cp <- min(tree.esophcancer$cptable[,1]) #select tree for pruning 
pruned.tree.esophcancer <- prune(tree.esophcancer, cp=cp) # generates pruned tree

# Make Prediction on Test Data with Pruned Tree:
tree.esophcancer.predict <- predict(
      pruned.tree.esophcancer # select pruned tree
      ,test #select test data set
      ,type="class") # select type 

confusionMatrix(tree.esophcancer.predict, test$Cancer) # print confusion matrix for prediction

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  222  60
##        Yes   4   6
##                                           
##                Accuracy : 0.7808          
##                  95% CI : (0.7289, 0.8269)
##     No Information Rate : 0.774           
##     P-Value [Acc > NIR] : 0.4217          
##                                           
##                   Kappa : 0.1046          
##                                           
##  Mcnemar's Test P-Value : 6.199e-12       
##                                           
##             Sensitivity : 0.98230         
##             Specificity : 0.09091         
##          Pos Pred Value : 0.78723         
##          Neg Pred Value : 0.60000         
##              Prevalence : 0.77397         
##          Detection Rate : 0.76027         
##    Detection Prevalence : 0.96575         
##       Balanced Accuracy : 0.53660         
##                                           
##        'Positive' Class : No              
##

Pruned Tree Model Performance

This chunk is designed to evaluate the performance of the pruned tree using a model curve.

# Generate ROC curve for pruned tree model
tree.esophcancer.predict <- predict( # generate prediction
      pruned.tree.esophcancer # select pruned tree model
      ,test # select test data set
      ,type="prob") # select type of output

predROC <- prediction(tree.esophcancer.predict[,2], test$Cancer) # generate prediction for ROC 
perfROC <- performance(predROC, "tpr", "fpr") # asses performance for ROC
plot(perfROC, main = "Pruned Tree Curve") # plot of the model curve
abline(a=0, b=1)

This chunk calculates the area under our pruned tree model curve or AUC. This model has low AUC indicating the model is poor at predicting esophageal cancer based on the factors provided.

# Calculate Area Under the Curve:
perfROC <- performance(predROC, "auc") 
perfROC@y.values[[1]]

## [1] 0.6605323

Random Forest Model

This chunk uses the random forest model to improve the prediction accuracy by demonstrating the optimal number of trees.

# Use Random Forest Model to Improve Predictive Accuracy:
rf.tree.esophcancer <- randomForest(Cancer~., data=train) # select data and variable 
plot(rf.tree.esophcancer, main="Random Forest Model") # prints the plot

This chunk building a random forest with the model parameter and making a prediction with it on the test data set. This demonstrates that the random forest model accuracy is oderately high.

# Build Random Forest with Model Parameter:
ntrees <- which.min(rf.tree.esophcancer$err.rate[,1]) # selects the optimal number of trees

rf.tree.esophcancer <- randomForest(  # build model 
      Cancer~., #select variable  
      data=train,  # select data
      ntree=ntrees) # select model parameter

rf.trees.predict <- predict( # make a prediction
     rf.tree.esophcancer # select model
     ,test # select data
     ,type="class") # select type

confusionMatrix( # print confusion matrix
    rf.trees.predict # select the prediction
    ,test$Cancer) # select dataset & variable

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  216  51
##        Yes  10  15
##                                           
##                Accuracy : 0.7911          
##                  95% CI : (0.7399, 0.8363)
##     No Information Rate : 0.774           
##     P-Value [Acc > NIR] : 0.267           
##                                           
##                   Kappa : 0.2346          
##                                           
##  Mcnemar's Test P-Value : 3.032e-07       
##                                           
##             Sensitivity : 0.9558          
##             Specificity : 0.2273          
##          Pos Pred Value : 0.8090          
##          Neg Pred Value : 0.6000          
##              Prevalence : 0.7740          
##          Detection Rate : 0.7397          
##    Detection Prevalence : 0.9144          
##       Balanced Accuracy : 0.5915          
##                                           
##        'Positive' Class : No              
##

Random Forest Model Performance

This chunk is designed to evaluate the performance of the random forest model using a model curve.

# Generate ROC curve for random forest model
tree.esophcancer.predict2 <- predict( # generate prediction
      rf.tree.esophcancer # select random forest model
      ,test # select test data set
      ,type="prob") # select type of output

predROC <- prediction(tree.esophcancer.predict2[,2], test$Cancer) # generate prediction for ROC 
perfROC <- performance(predROC, "tpr", "fpr") # asses performance for ROC
plot(perfROC, main = "Random Forest Model Curve") # plot of the model curve
abline(a=0, b=1)

This chunk calculates the area under our random forest model curve or AUC. This model has a moderately high AUC indicating the model is sufficient at predicting the esophageal cancer based on the factors provided.

# Calculate Area Under the Curve:
perfROC <- performance(predROC, "auc") 
perfROC@y.values[[1]]

## [1] 0.7504358

Citation

The skills needed to create this decision tree were learned throught the titantic surival examples provided at the sites below.

Dean, J. (n.d.). Titanic Decision Trees. JasonTDean. Retrieved February 27, 2022, from http://jasontdean.com/R/Titanic_Decision_Trees.html

Frushicheva, M. P. (n.d.). Titanic Data Decision Trees. RPubs. Retrieved February 27, 2022, from https://rpubs.com/violetgirl/201322

Maklin, C. (2019, July 30). Random Forest in R. Medium. Retrieved February 27, 2022, from https://towardsdatascience.com/random-forest-in-r-f66adf80ec9

R decision trees tutorial: Examples & code in R for Regression & Classification. DataCamp Community. (n.d.). Retrieved February 27, 2022, from https://www.datacamp.com/community/tutorials/decision-trees-R

Data From:

Breslow, N. E. and Day, N. E. (1980) Statistical Methods in Cancer Research. Volume 1: The Analysis of Case-Control Studies. IARC Lyon / Oxford University Press.