April 20, 2014 1 Comment
A large chunk of machine learning (although not all of it) is concerned with predictive modeling, usually in the form of designing an algorithm that takes in some data set and returns an algorithm (or sometimes, a description of an algorithm) for making predictions based on future data. In terminology more friendly to the philosophy of science, we may say that we are defining a rule of induction that will tell us how to turn past observations into a hypothesis for making future predictions. Of course, Hume tells us that if we are completely skeptical then there is no justification for induction — in machine learning we usually know this as a no-free lunch theorem. However, we still use induction all the time, usually with some confidence because we assume that the world has regularities that we can extract. Unfortunately, this just shifts the problem since there are countless possible regularities and we have to identify ‘the right one’.
Thankfully, this restatement of the problem is more approachable if we assume that our data set did not conspire against us. That being said, every data-set, no matter how ‘typical’ has some idiosyncrasies, and if we tune in to these instead of ‘true’ regularity then we say we are over-fitting. Being aware of and circumventing over-fitting is usually one of the first lessons of an introductory machine learning course. The general technique we learn is cross-validation or out-of-sample validation. One round of cross-validation consists of randomly partitioning your data into a training and validating set then running our induction algorithm on the training data set to generate a hypothesis algorithm which we test on the validating set. A ‘good’ machine learning algorithm (or rule for induction) is one where the performance in-sample (on the training set) is about the same as out-of-sample (on the validating set), and both performances are better than chance. The technique is so foundational that the only reliable way to earn zero on a machine learning assignments is by not doing cross-validation of your predictive models. The technique is so ubiquotes in machine learning and statistics that the StackExchange dedicated to statistics is named CrossValidated. The technique is so…
You get the point.
If you are a regular reader, you can probably induce from past post to guess that my point is not to write an introductory lecture on cross validation. Instead, I wanted to highlight some cases in science and society when cross validation isn’t used, when it needn’t be used, and maybe even when it shouldn’t be used.
Read more of this post