Decision Tree & Random Forest Model: Esophageal Cancer
Introduction
Esophageal cancer is a disease where cancer cells form within the esophageal tissues. The esophagus is the tube like structure running from the mouth to the stomach. Tobacco and heavy alcohol use increase an individuals risk for developing esophageal cancer. Esophageal cancer has a greater chance of being sucessfully treated if its caught early. This project aims to create a model of esophageal cancer rates bases on an individual’s age group, daily alcohol consumption, and daily tobacco consumption. The data used to grenerate this decsion tree and models are from a case-control study of esophageal cancer in Ille-et-Vilaine, France.
Clean Data
This chunk is loading data, removing unnecessary columns and mutating the variables. Then removing any NA or blank entries and displaying the head of the cleaned data.
# Load Data:
library(readxl)
<- read_excel("C:/Users/Kelly Nickelson/Desktop/esophv3.xlsx")
raw_data #View(esoph)
# Mutate Variables:
<- raw_data %>%
cleaned_data mutate( # change these below variables into a factor
Alchol_Use = as.factor(Alcohol_Use)
Tobacco_Use = as.factor(Tobacco_Use)
,Cancer = as.factor(Cancer))
,
# View Cleaned Data Head:
#head(cleaned_data)
Generate Test & Train Data Sets
This chunk is splitting the cleaned data into two data sets: the training data set (70% of data) and the test data set (30% of the data).
# Shuffle the Data Frame by rows:
= cleaned_data[sample(1:nrow(cleaned_data)), ]
shuffled_data #view(shuffled_data)
# Split Data:
<- sample(2, nrow(shuffled_data),
splitMe replace = TRUE, prob = c(0.7,0.3)) # split data 70% as train & 30% as test
# Set Train Data Frame:
<- shuffled_data[splitMe==1,] # labeling the new data set for training
train
# Set Test Data Frame:
<- shuffled_data[splitMe==2,] # labeling the new data set for testing
test #view(test)
Decision Tree
This chunk builds a decision tree classification model and generates the tree plot.
# Build Tree Classification Model:
<- rpart(Cancer~.,data=train) # select data for model
tree.esophcancer
# Generate Tree Plot
<-rpart.plot(
tree# select data
tree.esophcancer shadow.col="gray" # selects shadopw color
,nn=TRUE) # includes node number ,
Pruned Tree Model
This chunk prunes the tree to prevent over fitting and makes a prediction on test data using the pruned tree. This demonstrates that the pruned tree model accuracy moderately high.
# Prune Tree to Prevent Over Fitting:
<- min(tree.esophcancer$cptable[,1]) #select tree for pruning
cp <- prune(tree.esophcancer, cp=cp) # generates pruned tree
pruned.tree.esophcancer
# Make Prediction on Test Data with Pruned Tree:
<- predict(
tree.esophcancer.predict # select pruned tree
pruned.tree.esophcancer #select test data set
,test type="class") # select type
,
confusionMatrix(tree.esophcancer.predict, test$Cancer) # print confusion matrix for prediction
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 222 60
## Yes 4 6
##
## Accuracy : 0.7808
## 95% CI : (0.7289, 0.8269)
## No Information Rate : 0.774
## P-Value [Acc > NIR] : 0.4217
##
## Kappa : 0.1046
##
## Mcnemar's Test P-Value : 6.199e-12
##
## Sensitivity : 0.98230
## Specificity : 0.09091
## Pos Pred Value : 0.78723
## Neg Pred Value : 0.60000
## Prevalence : 0.77397
## Detection Rate : 0.76027
## Detection Prevalence : 0.96575
## Balanced Accuracy : 0.53660
##
## 'Positive' Class : No
##
Pruned Tree Model Performance
This chunk is designed to evaluate the performance of the pruned tree using a model curve.
# Generate ROC curve for pruned tree model
<- predict( # generate prediction
tree.esophcancer.predict # select pruned tree model
pruned.tree.esophcancer # select test data set
,test type="prob") # select type of output
,
<- prediction(tree.esophcancer.predict[,2], test$Cancer) # generate prediction for ROC
predROC <- performance(predROC, "tpr", "fpr") # asses performance for ROC
perfROC plot(perfROC, main = "Pruned Tree Curve") # plot of the model curve
abline(a=0, b=1)
This chunk calculates the area under our pruned tree model curve or AUC. This model has low AUC indicating the model is poor at predicting esophageal cancer based on the factors provided.
# Calculate Area Under the Curve:
<- performance(predROC, "auc")
perfROC @y.values[[1]] perfROC
## [1] 0.6605323
Random Forest Model
This chunk uses the random forest model to improve the prediction accuracy by demonstrating the optimal number of trees.
# Use Random Forest Model to Improve Predictive Accuracy:
<- randomForest(Cancer~., data=train) # select data and variable
rf.tree.esophcancer plot(rf.tree.esophcancer, main="Random Forest Model") # prints the plot
This chunk building a random forest with the model parameter and making a prediction with it on the test data set. This demonstrates that the random forest model accuracy is oderately high.
# Build Random Forest with Model Parameter:
<- which.min(rf.tree.esophcancer$err.rate[,1]) # selects the optimal number of trees
ntrees
<- randomForest( # build model
rf.tree.esophcancer ~., #select variable
Cancerdata=train, # select data
ntree=ntrees) # select model parameter
<- predict( # make a prediction
rf.trees.predict # select model
rf.tree.esophcancer # select data
,test type="class") # select type
,
confusionMatrix( # print confusion matrix
# select the prediction
rf.trees.predict $Cancer) # select dataset & variable ,test
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 216 51
## Yes 10 15
##
## Accuracy : 0.7911
## 95% CI : (0.7399, 0.8363)
## No Information Rate : 0.774
## P-Value [Acc > NIR] : 0.267
##
## Kappa : 0.2346
##
## Mcnemar's Test P-Value : 3.032e-07
##
## Sensitivity : 0.9558
## Specificity : 0.2273
## Pos Pred Value : 0.8090
## Neg Pred Value : 0.6000
## Prevalence : 0.7740
## Detection Rate : 0.7397
## Detection Prevalence : 0.9144
## Balanced Accuracy : 0.5915
##
## 'Positive' Class : No
##
Random Forest Model Performance
This chunk is designed to evaluate the performance of the random forest model using a model curve.
# Generate ROC curve for random forest model
<- predict( # generate prediction
tree.esophcancer.predict2 # select random forest model
rf.tree.esophcancer # select test data set
,test type="prob") # select type of output
,
<- prediction(tree.esophcancer.predict2[,2], test$Cancer) # generate prediction for ROC
predROC <- performance(predROC, "tpr", "fpr") # asses performance for ROC
perfROC plot(perfROC, main = "Random Forest Model Curve") # plot of the model curve
abline(a=0, b=1)
This chunk calculates the area under our random forest model curve or AUC. This model has a moderately high AUC indicating the model is sufficient at predicting the esophageal cancer based on the factors provided.
# Calculate Area Under the Curve:
<- performance(predROC, "auc")
perfROC @y.values[[1]] perfROC
## [1] 0.7504358
Citation
The skills needed to create this decision tree were learned throught the titantic surival examples provided at the sites below.
Dean, J. (n.d.). Titanic Decision Trees. JasonTDean. Retrieved February 27, 2022, from http://jasontdean.com/R/Titanic_Decision_Trees.html
Frushicheva, M. P. (n.d.). Titanic Data Decision Trees. RPubs. Retrieved February 27, 2022, from https://rpubs.com/violetgirl/201322
Maklin, C. (2019, July 30). Random Forest in R. Medium. Retrieved February 27, 2022, from https://towardsdatascience.com/random-forest-in-r-f66adf80ec9
R decision trees tutorial: Examples & code in R for Regression & Classification. DataCamp Community. (n.d.). Retrieved February 27, 2022, from https://www.datacamp.com/community/tutorials/decision-trees-R
Data From:
Breslow, N. E. and Day, N. E. (1980) Statistical Methods in Cancer Research. Volume 1: The Analysis of Case-Control Studies. IARC Lyon / Oxford University Press.