Saturday, February 28, 2015

Data Mining Sports

Deflate-gate. Whether you care about sports or not, a lot of people have a lot to say about the recent New England Patriots scandal over deflating the balls in the Conference Championship. Personally, I follow football very little, preferring to devote my sports watching time to the amazing sport of Ultimate Frisbee. But, since rarely do Frisbee scandals make their way to the front page, I have instead taken the time to scrape some NFL data from the web and run my own analysis on whether the New England Patriots were in fact benefiting from improperly inflated footballs. Now I have to say, I'm not trying to make ANY claims as to whether the Patriots cheated the rules of the National Football League. The goal of my post is simply that of teaching data analysis and having a little fun with interesting data.

In a recent article which I will be following for some of my analysis, the Patriots plays per fumble were put under a microscope before and after the 2007 rules change which allowed away teams to supply their own footballs to games. In order to analyze this, we first have to get NFL data. Let's take data from 2000-2014 and see what trends we can find. Here I use Python to scrape data from the web (you can also download but I wanted to walk through how to scrape data and have the code available so others can validate) and all of my code is available here including the code in R which does the analysis.

A powerful method of determining changes in trends over time is what is referred to as a difference in differences estimation. In this case, it refers to taking the difference among the Patriots in plays per fumble before and after 2007 and then subtracting the difference among the rest of the league before and after 2007. Yes, that seems confusing so let's make it simple. The Pats averaged 42.96 plays per 1 fumble before the rules change in 2007. After they averaged 72.4. The rest of the league was at 42.8 before and 48.1 after. So, the difference in differences comes out to (72.4 - 42.96) - (48.1-42.8) or 24.2 which is around a 50% jump in plays per fumble! Let's view a graph of what that looks like.



For the statisticians out there, here's a print out of the associated regression:



As we can see, this result comes back as significant at a very high level (24.2, <.01). What that means is that 99.526% of the time, we will not see these results if we were to randomly pick from the league's normal distribution of fumbles per play. So, maybe the Pats ARE actually cheaters... Hold up Colts fans and just about everybody else. There is another team that has similar numbers when this style of regression is run. If I run the numbers for all of the teams in the league and only print out those with a high significance I also find that the Atlanta Falcons have suspiciously high numbers of plays before a fumble (refer to "team_after").


So maybe it's a case of multiple cheaters! We should just expel the lot of them and teach these kids to behave right?! But, let's do a little inspection into the Falcons before we rush in to judge their ridiculously outrageous statistics. While the Patriots had the same type of QB for the entire period from 2000-2014, the Falcons went from Mike Vick to Matt Ryan coincidentally in 2007. Mr. Vick happens to be a large outlier in the data. I mean, the guy had 16 fumbles in 2004. That's one per game! So, let's go ahead and re-run the numbers accounting for that. All I'm going to do is replace Vick's fumbles with Matt Ryans yearly average (4.4) and here is the new output:


Wow! The results almost completely disappear now. Once again, we are back to thinking that there is a lone shooter. But, skeptics might say that it's unfair to compare the Patriots to the entirety of the rest of the league, which is what this analysis currently does. That's a very astute analysis skeptics. Instead, let's use a technique which involves using "synthetic controls". Sounds confusing, but in reality it's just taking an average of the teams which are closest to the Pats leading up to the 2007 rules change (we will call that "synthetic Patriots") and then comparing what that team does to what the Patriots do. Below is a print out of teams and how much weight they get in the creation of the "synthetic Patriots".


So, here we can see that the Denver Broncos receive 47% of the weight and the Rams come in second at 24%. Now that we have a synthetically created team, let's see how they match up. Here is a graph of that:

Well, it looks like the Patriots are a little closer to the synthetic team. But, they still outperform them in every year except 2013.

In conclusion, maybe the Patriots were fully aware of their actions and maybe they deserve some sort of punishment. I honestly don't care either way, but the data is interesting and I hope you could follow along and perhaps glean some information on how to go about your own sports research.


Wednesday, November 26, 2014

Accuracy Testing Predictive Models

Hi fellow data scientists and others who accidentally stumbled into what must seem like a tutorial of an Asian language. Today the goal is to make sure we don't architect a library without accounting for the weight of the books, or, in predictive model speak, we are going to out-of-sample test the accuracy of the models we construct. In the following tutorial I'm going to walk through generating
and testing of predictive models for accuracy against the mtcars data set which is readily available in R. First off, if you are playing along at home, let's fire up an R session and open up the mtcars data set. That looks like:
------------------------
data(mtcars)
head(mtcars)
------------------------





Laid out before us we have  a number of observations of different cars with various statistics describing them. While we could spend days just on this data set analyzing information on each variable, let's concentrate on one and how to predict it given the others. So as I'm writing this gas is less than $3.00 a gallon, but let's assume that people still care about miles per gallon (mpg) as much as they did back when it was in the $4.00 range. For you research people, our 'y' variable will thus be mpg with the remaining variables being our vector X of independent or predictor variables. Thus, our formula will be mpg = cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb.

Let's hop, skip and jump over all of the modeling technique discussion and straight to validation now. In validating data models, one of the most widely used methods is out-of-sample testing. All of those hyphens are because we now need to take a perfectly good data sample and split the it into what is
typically called a training and a testing data set. Randomly we will select some of the observations to be placed in the training set and the remainder will fall in the test set. The most frequent split is 70/30 with 70% of the data used in training and 30% in testing. The training data will be used to build our model and then we will blindly predict the test data set and score how we did against what the actual values are. Yay! Just like school. Don't you just love grading your own quizzes?

Let's now split our data, run a random forest model against the train data and test the accuracy in the holdout or test data.

-------------------------------------------------------------------------------------------------------------------
indexes=sample(2,length(mtcars),replace=TRUE, prob=c(.3,.7)) 
train<-mtcars[indexes==2,] #split into train with ~70% of the data
test<-mtcars[indexes==1,] #and test data with ~30%
library(randomForest)
variables<-c('cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb') #X variables
fm <- as.formula(paste('mpg'," ~", paste(variables, collapse = "+")))
rf_model<-randomForest(fm,data=train) #training a random forest model
pred<-predict(rf_model,newdata=test) #predicting onto the holdout set

library(hydroGOF)

sim<-as.matrix(pred)
obs<-as.matrix(test$mpg) 

mse(sim,obs)  #obtaining the mean squared error of my predicted values
-------------------------------------------------------------------------------------------------------------------

When I run the above, I get a mean square error (mse) of 5.76 but you probably won't. In fact, when I run the SAME EXACT code a second time, I get a mse of 5.31. This brings me to my next point, South Park is an amazing show. Or, in other words, randomness is something to guard against in data modeling.

Oftentimes you can generate a model that will predict beautifully onto a single test set because it randomly got lucky and picked the right answers. To make things more vivid, if I were to give you a multiple choice test with only 1 question having 4 possible answers, the probability of you blindly
 getting a 100 is 25%. That said, a guessing model with 25% accuracy is atrocious, but if you were to score it once, you would have a 25% chance of believing that simple guessing was the best route or a 25% chance of getting fired, canned, receiving your walking papers, crying like a little baby who just had it's blanky stolen... you get the point. So, how do we guard against random chance here?

Cross validation is a method which scores the model against several test sets with several different training sets. And how will that prevent you from getting canned? If you do what's called 2-fold cross validation, you are randomly splitting the data twice and scoring your accuracy twice. In our example with the simple guessing, you will reduce the chance of getting fired to 12.5%. And, if we dive even further into this, if we did 10-fold cross validation we will reduce the odds of losing our job to 1 in 1024. Not too bad for a 1 question test.

For those of you who just want to jack some code I've spent countless time locked away in my office dungeon learning, here's a quick function I built that takes as arguments the data, X variables, Y variable and a function for how to model and spits out the individual mean squared errors of each split in the data and the average of all of the errors.

-------------------------------------------------------------------------------------------------------------------
x.validate<-function(data,vtu,outVar,FUN) {
    library(hydroGOF)

    k = 10
    err.vect = rep(NA,k)
    holder<-matrix(nrow=nrow(data),ncol=2)
    
    for (i in 1:k) { #10 fold cross validation
        indexes=sample(2,length(data),replace=TRUE, prob=c(.3,.7)) 
        
        train<-data[indexes==2,] #split into train
        test<-data[indexes==1,] #and test data
        
        pred<-FUN(train_dat=train,test_dat=test,variables=vtu,y_var=outVar)
       
        sim<-as.matrix(pred)
        obs<-as.matrix(test[,outVar]) 
        
        this_cross<-cbind(test$pred,test[,outVar])
        holder<-rbind(holder,this_cross)
        
        err.vect[i] = mse(sim,obs) 
        print(paste('error for fold',i,'=',err.vect[i]))
        
    }
    
    print(paste('average error =',mean(err.vect)))
    
}
-------------------------------------------------------------------------------------------------------------------

Here's a simple model function as well that partners up with the above to make predictions. I use a cubist model here.

-------------------------------------------------------------------------------------------------------------------
Cube<-function(train_dat,test_dat,variables,y_var,committees=20){
    
    library(Cubist)
    x<-train_dat[,names(train_dat) %in% variables]
    
    committeeModel <- cubist(x = x, y = train_dat[,y_var], committees = committees)
    
    pred <- predict(committeeModel, newdata=test_dat)
    return(pred)
}
-------------------------------------------------------------------------------------------------------------------

And why don't you watch a quick video of me running it for both the cubist model and a separate linear model.

Make sure to set to 720p. For some reason it doesn't auto to it.

Here the cubist model has an average mean squared error of ~11 and the bands on my error aren't that high. The linear model has an average of ~20 and in fold 9 has an error of 80 which is extreme and would likely be large enough to have huge business ramifications if I was building a model for a company. That said, in the first iteration, the linear model had a mse less than the cubist model! Thus, if I didn't use a cross-validation technique, I would incorrectly choose a model that is half as good as the other, whups!!! Here, if I were going to choose an individual modeling method, I would pick the cubist hands down over the linear, so it's a good thing we did this cross-validation thingy.

So, today we walked through the method of scoring different models that we build and learned the importance of scoring them several times to reduce the chance of choosing an inaccurate model. I hope you enjoyed it or at least were able to steal some techniques or code to help out. Next time I'm going to write a post about how to select variables for models as selecting too many variables will introduce the much dreaded white noise and garner more errors than a well fit model.

Monday, October 27, 2014

Intro to Predictive Analytics: The Economist's Perspective

Relatively recently the data science moniker has been getting a lot of attention garnering such titles as "sexiest job of the 21st century". Well, I have met quite a few data scientists in my day and must concur, we are a sexy bunch.

me at a bar looking stylish
But seriously, while discussions around forecasts, customer clustering analyses, and consumer prediction might not exactly be proverbial panty-droppers, for business they absolutely are. Many businesses and even industries are currently functioning on simple "gut-instinct" with their day-to-day operations and data science is the sleek, powerful solution to an archaic operational architecture in regards to almost every nook and cranny of a firm.

For me it has been quite a transformation, from getting a PhD in the so-called "dismal science" of Economics to the transition into the seductive data science arts. One might say it was similar to a caterpillar's transformation into a beautiful butterfly... or one might not.

Economics at its heart is a world driven by causal analysis (variable x causes variable y to move and by exactly z units) while data analytics focuses mainly on prediction (variables a, b, and c predict variable y to be z amount). As such, you could certainly overlay some Venn diagrams between the two and find ample space in the middle, but methodically they are starkly opposed to each other. Whereas in causal research, we focus on randomized trials, instrumental variables, discontinuities, difference-in-differences and other methods that are intended to isolate the effect of one or a few variables on an outcome variable, the methods in predictive analytics have such catchy titles as random forests, support vector machines, k-means clustering along with numerous others all with the intent of predicting the outcome of a single variable based on all variables that can be considered important. In that vein, today I'm going to focus in on the differences between the two and the goals of each along with delving into some interesting conclusions on the actual power of a simple linear regression.

As is done so often in Economics, let's be reductionist and create a very basic world where we have some variable, call it INCOME, which predicts CONSUMPTION. It's a simple idea and can be written in an algebraic equation: CONSUMPTION = b*INCOME and you can plot this on a number line just like we all did way back in Jr. High. Yes it's true, a large portion of my PhD was spent trying to understand the concepts that were taught to me when I was 12. Woe is me!

Now let's write some code in R that generates random data for the variable INCOME and maps that onto CONSUMPTION. Below I take 1,000 random draws on numbers between 50 and 100 with replacement as my INCOME. You can think of it as spinning a wheel and having some dollar amount between $50 and $100 poof right into your hand. If only...









And, for kicks, let's run a basic regression and find 'b' from earlier.








So b ~ .45 which closely follows how we generated the data. Now, in a real world, how much you consume is based on quite a few variables, not just "income" so let's add in other things which might matter like  INTEREST RATES, HEALTH and AGE.














Alrighty then, let's view the original equation again in a world that's now slightly more dynamic and compare it to a regression which includes all important variables.















If this were Econometrics 101, I would highlight that the coefficient on INCOME doesn't really change between the two regressions because the omitted variables are orthogonal to it (meaning they don't predict INCOME at all). And, at this juncture we could write out some elaborate matrix algebra proof to demonstrate this statistical fact which I'm pretty sure tickles nobody's fancy so let's move on shall we?

Ok, with that tedious intro behind us, let's get to the fun stuff. How about we try our hand at some prediction, specifically of CONSUMPTION? I mean, in all honesty what question do you want to answer, whether INCOME effects people's CONSUMPTION or what CONSUMPTION is going to be this upcoming quarter?

Let's take the more realistic state of the world (example 2) and try to predict what CONSUMPTION will be given the all of our other variables. Here I first separate my data by taking 70% of the data for building a prediction model and the rest for testing my model (this is a concept called out-of-sample prediction). Below you can see some simple code and a quick plot of our "residuals" or the differences between what we predicted in the testing data set and what consumption actually was.



As you can see, we are doing decently well. Our predictions aren't missing by a lot and seem to be highly centered around zero. But, what if we modified the world even further? If you can remember back to "piece-wise" equations in Algebra II (I know, painful memories), oftentimes a single linear equation doesn't describe all state-spaces. Let's assume that at different levels of income you have different linear equations. So, I'm going to generate three states of the world, "poverty", "middle-class" and "wealthy" and build three piece-wise functions for each state. Disclaimer: these equations were randomly thought up by someone laying in bed at 2 in the morning and as such probably don't come close to describing reality but will be used for instructional purposes.










Now let's check out our residuals after we train a linear function on this new set of data.


Wow! It looks like we aren't doing so hot here. There are some pretty big misses even though our random variation term, "e", isn't very large. How about we try a more powerful prediction model, a Cubist function, which partitions the data and then trains a linear function on each partition? Below is the graph for the residuals of the Cubist.


Cha ching! This looks very similar to our initial residual plot. And, for good measure let us check out a selection of the data with predictions (linear and cubist) and the two functions' mean squared errors (a way of quantifying our prediction mistakes).









As we can clearly see, in a somewhat dynamic world the simple linear regression we have all spent a large portion of our lives learning and perfecting is relatively powerless at predicting while the more robust tool, the Cubist, gives us the ability to predict at a very high level.

I hope this high-level foray into the world of predictive analytics has been intriguing. In the future I plan on delving deeper into the art of prediction while making some pit-stops in web development and data visualization since these are also useful tools in the utility belt of a macho suave data guy.