Wednesday, November 26, 2014

Accuracy Testing Predictive Models

Hi fellow data scientists and others who accidentally stumbled into what must seem like a tutorial of an Asian language. Today the goal is to make sure we don't architect a library without accounting for the weight of the books, or, in predictive model speak, we are going to out-of-sample test the accuracy of the models we construct. In the following tutorial I'm going to walk through generating
and testing of predictive models for accuracy against the mtcars data set which is readily available in R. First off, if you are playing along at home, let's fire up an R session and open up the mtcars data set. That looks like:
------------------------
data(mtcars)
head(mtcars)
------------------------





Laid out before us we have  a number of observations of different cars with various statistics describing them. While we could spend days just on this data set analyzing information on each variable, let's concentrate on one and how to predict it given the others. So as I'm writing this gas is less than $3.00 a gallon, but let's assume that people still care about miles per gallon (mpg) as much as they did back when it was in the $4.00 range. For you research people, our 'y' variable will thus be mpg with the remaining variables being our vector X of independent or predictor variables. Thus, our formula will be mpg = cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb.

Let's hop, skip and jump over all of the modeling technique discussion and straight to validation now. In validating data models, one of the most widely used methods is out-of-sample testing. All of those hyphens are because we now need to take a perfectly good data sample and split the it into what is
typically called a training and a testing data set. Randomly we will select some of the observations to be placed in the training set and the remainder will fall in the test set. The most frequent split is 70/30 with 70% of the data used in training and 30% in testing. The training data will be used to build our model and then we will blindly predict the test data set and score how we did against what the actual values are. Yay! Just like school. Don't you just love grading your own quizzes?

Let's now split our data, run a random forest model against the train data and test the accuracy in the holdout or test data.

-------------------------------------------------------------------------------------------------------------------
indexes=sample(2,length(mtcars),replace=TRUE, prob=c(.3,.7)) 
train<-mtcars[indexes==2,] #split into train with ~70% of the data
test<-mtcars[indexes==1,] #and test data with ~30%
library(randomForest)
variables<-c('cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb') #X variables
fm <- as.formula(paste('mpg'," ~", paste(variables, collapse = "+")))
rf_model<-randomForest(fm,data=train) #training a random forest model
pred<-predict(rf_model,newdata=test) #predicting onto the holdout set

library(hydroGOF)

sim<-as.matrix(pred)
obs<-as.matrix(test$mpg) 

mse(sim,obs)  #obtaining the mean squared error of my predicted values
-------------------------------------------------------------------------------------------------------------------

When I run the above, I get a mean square error (mse) of 5.76 but you probably won't. In fact, when I run the SAME EXACT code a second time, I get a mse of 5.31. This brings me to my next point, South Park is an amazing show. Or, in other words, randomness is something to guard against in data modeling.

Oftentimes you can generate a model that will predict beautifully onto a single test set because it randomly got lucky and picked the right answers. To make things more vivid, if I were to give you a multiple choice test with only 1 question having 4 possible answers, the probability of you blindly
 getting a 100 is 25%. That said, a guessing model with 25% accuracy is atrocious, but if you were to score it once, you would have a 25% chance of believing that simple guessing was the best route or a 25% chance of getting fired, canned, receiving your walking papers, crying like a little baby who just had it's blanky stolen... you get the point. So, how do we guard against random chance here?

Cross validation is a method which scores the model against several test sets with several different training sets. And how will that prevent you from getting canned? If you do what's called 2-fold cross validation, you are randomly splitting the data twice and scoring your accuracy twice. In our example with the simple guessing, you will reduce the chance of getting fired to 12.5%. And, if we dive even further into this, if we did 10-fold cross validation we will reduce the odds of losing our job to 1 in 1024. Not too bad for a 1 question test.

For those of you who just want to jack some code I've spent countless time locked away in my office dungeon learning, here's a quick function I built that takes as arguments the data, X variables, Y variable and a function for how to model and spits out the individual mean squared errors of each split in the data and the average of all of the errors.

-------------------------------------------------------------------------------------------------------------------
x.validate<-function(data,vtu,outVar,FUN) {
    library(hydroGOF)

    k = 10
    err.vect = rep(NA,k)
    holder<-matrix(nrow=nrow(data),ncol=2)
    
    for (i in 1:k) { #10 fold cross validation
        indexes=sample(2,length(data),replace=TRUE, prob=c(.3,.7)) 
        
        train<-data[indexes==2,] #split into train
        test<-data[indexes==1,] #and test data
        
        pred<-FUN(train_dat=train,test_dat=test,variables=vtu,y_var=outVar)
       
        sim<-as.matrix(pred)
        obs<-as.matrix(test[,outVar]) 
        
        this_cross<-cbind(test$pred,test[,outVar])
        holder<-rbind(holder,this_cross)
        
        err.vect[i] = mse(sim,obs) 
        print(paste('error for fold',i,'=',err.vect[i]))
        
    }
    
    print(paste('average error =',mean(err.vect)))
    
}
-------------------------------------------------------------------------------------------------------------------

Here's a simple model function as well that partners up with the above to make predictions. I use a cubist model here.

-------------------------------------------------------------------------------------------------------------------
Cube<-function(train_dat,test_dat,variables,y_var,committees=20){
    
    library(Cubist)
    x<-train_dat[,names(train_dat) %in% variables]
    
    committeeModel <- cubist(x = x, y = train_dat[,y_var], committees = committees)
    
    pred <- predict(committeeModel, newdata=test_dat)
    return(pred)
}
-------------------------------------------------------------------------------------------------------------------

And why don't you watch a quick video of me running it for both the cubist model and a separate linear model.

Make sure to set to 720p. For some reason it doesn't auto to it.

Here the cubist model has an average mean squared error of ~11 and the bands on my error aren't that high. The linear model has an average of ~20 and in fold 9 has an error of 80 which is extreme and would likely be large enough to have huge business ramifications if I was building a model for a company. That said, in the first iteration, the linear model had a mse less than the cubist model! Thus, if I didn't use a cross-validation technique, I would incorrectly choose a model that is half as good as the other, whups!!! Here, if I were going to choose an individual modeling method, I would pick the cubist hands down over the linear, so it's a good thing we did this cross-validation thingy.

So, today we walked through the method of scoring different models that we build and learned the importance of scoring them several times to reduce the chance of choosing an inaccurate model. I hope you enjoyed it or at least were able to steal some techniques or code to help out. Next time I'm going to write a post about how to select variables for models as selecting too many variables will introduce the much dreaded white noise and garner more errors than a well fit model.