Data Science: October 2014

Relatively recently the data science moniker has been getting a lot of attention garnering such titles as "sexiest job of the 21st century". Well, I have met quite a few data scientists in my day and must concur, we are a sexy bunch.

me at a bar looking stylish

But seriously, while discussions around forecasts, customer clustering analyses, and consumer prediction might not exactly be proverbial panty-droppers, for business they absolutely are. Many businesses and even industries are currently functioning on simple "gut-instinct" with their day-to-day operations and data science is the sleek, powerful solution to an archaic operational architecture in regards to almost every nook and cranny of a firm.

For me it has been quite a transformation, from getting a PhD in the so-called "dismal science" of Economics to the transition into the seductive data science arts. One might say it was similar to a caterpillar's transformation into a beautiful butterfly... or one might not.

Economics at its heart is a world driven by causal analysis (variable x causes variable y to move and by exactly z units) while data analytics focuses mainly on prediction (variables a, b, and c predict variable y to be z amount). As such, you could certainly overlay some Venn diagrams between the two and find ample space in the middle, but methodically they are starkly opposed to each other. Whereas in causal research, we focus on randomized trials, instrumental variables, discontinuities, difference-in-differences and other methods that are intended to isolate the effect of one or a few variables on an outcome variable, the methods in predictive analytics have such catchy titles as random forests, support vector machines, k-means clustering along with numerous others all with the intent of predicting the outcome of a single variable based on all variables that can be considered important. In that vein, today I'm going to focus in on the differences between the two and the goals of each along with delving into some interesting conclusions on the actual power of a simple linear regression.

As is done so often in Economics, let's be reductionist and create a very basic world where we have some variable, call it INCOME, which predicts CONSUMPTION. It's a simple idea and can be written in an algebraic equation: CONSUMPTION = b*INCOME and you can plot this on a number line just like we all did way back in Jr. High. Yes it's true, a large portion of my PhD was spent trying to understand the concepts that were taught to me when I was 12. Woe is me!

Now let's write some code in R that generates random data for the variable INCOME and maps that onto CONSUMPTION. Below I take 1,000 random draws on numbers between 50 and 100 with replacement as my INCOME. You can think of it as spinning a wheel and having some dollar amount between $50 and $100 poof right into your hand. If only...

And, for kicks, let's run a basic regression and find 'b' from earlier.

So b ~ .45 which closely follows how we generated the data. Now, in a real world, how much you consume is based on quite a few variables, not just "income" so let's add in other things which might matter like INTEREST RATES, HEALTH and AGE.

Alrighty then, let's view the original equation again in a world that's now slightly more dynamic and compare it to a regression which includes all important variables.

If this were Econometrics 101, I would highlight that the coefficient on INCOME doesn't really change between the two regressions because the omitted variables are orthogonal to it (meaning they don't predict INCOME at all). And, at this juncture we could write out some elaborate matrix algebra proof to demonstrate this statistical fact which I'm pretty sure tickles nobody's fancy so let's move on shall we?

Ok, with that tedious intro behind us, let's get to the fun stuff. How about we try our hand at some prediction, specifically of CONSUMPTION? I mean, in all honesty what question do you want to answer, whether INCOME effects people's CONSUMPTION or what CONSUMPTION is going to be this upcoming quarter?

Let's take the more realistic state of the world (example 2) and try to predict what CONSUMPTION will be given the all of our other variables. Here I first separate my data by taking 70% of the data for building a prediction model and the rest for testing my model (this is a concept called out-of-sample prediction). Below you can see some simple code and a quick plot of our "residuals" or the differences between what we predicted in the testing data set and what consumption actually was.

As you can see, we are doing decently well. Our predictions aren't missing by a lot and seem to be highly centered around zero. But, what if we modified the world even further? If you can remember back to "piece-wise" equations in Algebra II (I know, painful memories), oftentimes a single linear equation doesn't describe all state-spaces. Let's assume that at different levels of income you have different linear equations. So, I'm going to generate three states of the world, "poverty", "middle-class" and "wealthy" and build three piece-wise functions for each state. Disclaimer: these equations were randomly thought up by someone laying in bed at 2 in the morning and as such probably don't come close to describing reality but will be used for instructional purposes.

Now let's check out our residuals after we train a linear function on this new set of data.

Wow! It looks like we aren't doing so hot here. There are some pretty big misses even though our random variation term, "e", isn't very large. How about we try a more powerful prediction model, a Cubist function, which partitions the data and then trains a linear function on each partition? Below is the graph for the residuals of the Cubist.

Cha ching! This looks very similar to our initial residual plot. And, for good measure let us check out a selection of the data with predictions (linear and cubist) and the two functions' mean squared errors (a way of quantifying our prediction mistakes).

As we can clearly see, in a somewhat dynamic world the simple linear regression we have all spent a large portion of our lives learning and perfecting is relatively powerless at predicting while the more robust tool, the Cubist, gives us the ability to predict at a very high level.

I hope this high-level foray into the world of predictive analytics has been intriguing. In the future I plan on delving deeper into the art of prediction while making some pit-stops in web development and data visualization since these are also useful tools in the utility belt of a macho suave data guy.

Data Science

Monday, October 27, 2014

Intro to Predictive Analytics: The Economist's Perspective