Simple logistic regression model and ROC Curve with R + intuitive explanation of ROC curve

enter image description here

I recently discovered a cool package in R called pROC that is very convenient for making ROC curves.

Here is an example of how to implement in R after creating a model:

I've included the entire process starting from splitting the data into training and test, fitting the model, validating by predicting the test set, and finally drawing the ROC curve.

library(sqldf)
library(ggplot2)
library(reshape)
library(pROC)
setwd('/home/willie/data_science/XXXXX')
system('ls)


###generate training and test set using 80/20 split########
splitdf <- function(dataframe, seed=NULL) { 
if (!is.null(seed)) set.seed(seed) 
index <- 1:nrow(dataframe) 
trainindex <- sample(index, trunc(length(index)/5)) 
trainset <- dataframe[-trainindex, ] 
testset <- dataframe[trainindex, ] 
list(trainset=trainset,testset=testset) 
} 

splits <- splitdf(mydata, seed = 888)

training_data <- splits$trainset
test_data <- splits$testset


###create a logistic regression model
mymodel <- glm(y ~ x1 + x2 + x3, family = 'binomial', data = training_data)

summary(mymodel)

###validate the model by computing ROC curve.
library(pROC)
prediction <- predict(mymodel, newdata = test_data, type = 'response')
test_data$prob = prediction
g <- roc(y ~ prob, data = test_data)
plot(g) 

###Now compute the confusion matrix: 
table(test_data$y, test_data$prob > .5)  


The resulting plot looks something like this:

Intution behind an ROC curve:

So what is the intuition behind the ROC curve?  So actually, the intuition is not obvious.  For me, the best way to understand it is by thinking through an example such as the following:

Suppose you had a binary classifier for spam detection.  Your spam classifier outputs the probability of an email being spam, a percentage between 0 and 100%.   You use this classifier to classify 1000 emails.  You then set a cutoff for each of the emails, such that if the probability is greater than the cutoff you will classify the email as spam (1), and if it is lower than the cutoff then you classify it as not spam (0).  

Now, if you had a perfect classifier, then obviously everything above the cutoff would actually be a spam email and everything below the cutoff would be a non-spam email.  

To assess the accuracy of your model, you begin drawing your ROC curve.   You first sort your classifications in descending order by probability, and you beginning going down the list.  

You begin drawing your ROC curve from the bottom left of the graph.  Now, starting from the top of the list, you see that your model classified the email as spam (1), and you check to see if it was actually a spam email.  

If it was actually a spam email (true positive), then you draw a vertical line upwards.
If it wasn't a spam email (false positive), then you draw a horizontal line towards the right

Eventually you go further down the list and you end up below the cutoff threshold.  These are examples that your model classified as not spam.

If it was wasn't a spam email (true negative), then you draw a horizontal line towards the right.
If it was a spam email (false negative), then you draw a vertical line upwards.

Keep doing this and you'll end up with a graph like this, the ROC curve!

enter image description here


Now, if your model had classified everything perfectly, you would have a perfect curve that goes vertically to the top left, then horizontally to the top right, with an AUC of 1.00.

But most ROC curves are not perfect, so you will end up with an AUC < 1.

Let me know if this is unclear in the comments!


Comments

Popular posts from this blog

grandmaster level chess AI using python - Part 2 (the code)

building a chess ai - part 4: learning an evaluation function using deep learning (keras)

Brief intro to recurrent neural networks