Machine learning and prediction without understanding
June 5, 2013 29 Comments
Big data is the buzzword du jour, permeating from machine learning to hadoop powered distributed computing, from giant scientific projects to individual social science studies, and from careful statistics to the witchcraft of web-analytics. As we are overcome by petabytes of data and as more of it becomes public, it is tempting for a would-be theorist to simply run machine learning and big-data algorithms on these data sets and take the computer’s conclusions as understanding. I think this has the danger of overshadowing more traditional approaches to theory and the feedback between theory and experiment.
Of course, I am not saying anything new, Chomsky has famously expressed his skepticism along similar lines with the prevalence of Machine Learning in linguistics research:
It’s true there’s been a lot of work on trying to apply statistical models to various linguistic problems. I think there have been some successes, but a lot of failures. There is a notion of success … which I think is novel in the history of science. It interprets success as approximating unanalyzed data.
In this case, though, it seems that Chomsky might be fighting a losing battle. A lot of the funding for research is driven by results and predictions, and — for many — understanding is just an inessential step in that process. In linguistics, this is best summarized by a quip attributed to Frederick Jelinek:
Every time I fire a linguist, the performance of our speech recognition system goes up.
Jelinek is referring to the advantage that purely statistical machine learning methods provide over explicit linguistic models. Coincidently, machine learning can also provide us with the tools to formally look at this debate. We can model learning a theory from data as proper learning: our real world is given by some real function and our algorithm tries to find some function that closely approximates this. This function class is usually something structured that we can understand, like deterministic finite state machines, Markov decision processes, or probabilistic finite state automata. However, we can make the learning algorithm’s job easier by giving it more freedom in the representations it uses, by asking for a hypothesis . In fact, we can consider the most general hypotheses class by asking the learning algorithm to generate a prediction instead of a theory. This choice makes the algorithm itself the hypothesis and thus the hypotheses class is equivalent to all efficiently computable functions. This is how a computer scientist can operationalize prediction and understanding. In a classic result, Pitt & Valiant (1988) showed that there are certain concept classes that cannot be efficiently properly learnt, but that we can learn to predict. In other words, it really is easier to predict than to understand.
Ishanu Chattopadhyay — the third speaker on the second day of the 2nd workshop on Natural Algorithms and the Sciences — considered a very natural variant on this idea. A common task in science is, given finite samples from two sources of data, to determine if the sources implement the same underlying noisy process or two different ones. Further, if the data is generated by two different processes then can we say how different these processes are? If you set one of your sources as your null hypothesis then this is equivalent to hypothesis testing
Chattopadhyay, Wen, & Ray (2010) considered a processes that is governed by (or well approximated by) probabilistic finite state automata (PFSAs). We can think of this source as a PFSA inside black box: we can press a button that will generate a symbolic stream of observations but we cannot peek inside to see how the machine works. The authors also assume there are given a finite number of candidate PFSAs that they can examine as potential candidates for approximating the environment, their goal is to determine (in an online manner) , which candidate approximates the black box best and to what extent. Before they start receiving data from the black box, the authors use the candidate PFSAs to build annihilators that serve as a type of filter for the incoming data. The data that makes it through the filter is then compared to the simplest model that just outputs uncorrelated letters uniformly at random, and if it matches this behavior then the PSFA that generated this annihilator is a good approximation of the data.
More formally, the authors build an abelian group on the space of probability measures that induces a group structure on a restricted set of irreducible PFSAs. The restriction is fundamental since if this procedure worked for reducible PFSAs then it could be used to break the RSA cryptosystem — that secures all our online transactions — by the method of Kearns & Valiant (1994). The group operation corresponds to their annihilation procedure on data streams, the identity element in the null PFSA that generates symbols uniformly at random, and the inverse is computed during their preprocessing of the candidate PFSAs into annihilation procedures. The algorithm comes down to adding the unknown stream to inverses of the candidates in this abelian group and seeing if the result matches the identity. If the resulting stream (statistically) matches the identity then the inverted element must have been (a close approximation of) our unknown function.
The above procedure is efficient but inadvertently builds a PFSA representation of the original data — it provides some understanding. In his talk, Chattopadhyay showed that the procedure can be extended to the problem of comparing two data sources without first learning either one’s representation as a PFSA. We can compute the inverse directly on the raw stream data instead of first building a PFSA and then inverting it. Unfortunately, if we have output symbols then we need independent instances of the stream we want to invert. However, in this understanding-free setting, the total number of samples needed to determine with high accuracy if two sources are generated by the same process is lower than if we had to gain understanding by first building the PFSA. It is easier to compare than to understand.
This work has an obvious application in fields like finance, where a real model is impossible to come by. A trader can use old data to create the inverted stream annihilator and use it to annihilate new data tic-by-tic as it comes in to detect anomalies: the process suddenly shifting from its historic process to a new one. In the rest of his talk, Chattopadhyay stressed similar successes for his approach in other domains, and showed that it can match the behavior of much more carefully constructed understanding-rich models of anomaly detection in domains like the detection of gravitational waves.
If we can skip understanding and just get quick results, then why bother with theory and analytic treatments? It is tempting to say that more data and statistical analysis can’t possibly hurt (except maybe the opportunity cost of collecting it), they just give more playthings for theorists. But for me, the real problem is that a focus on data changes priorities. Scientists (due mostly to science funding) start to move away from trying to explain and understand phenomena and more to collecting data and data-mining it to make predictions without understanding. A focus on blind statistics on big data often provides great practical results in the short term (and that is why it wins funding) at the expense of the understanding needed for long term development of science. In particular I suspect that after a short success it will produce a field that doesn’t know how to ask new and interesting questions. A field that must rely on continuing to mine slight variations on the same sort of data for harder and harder to find practical gems. Unfortunately, this social consideration is not built into our machine learning based analysis.
Note: this is the sixth (and last!) review blogpost of a series (1, 2, 3, 4, and 5) on the recent workshop on Natural Algorithms and the Sciences at Princeton’s Center for Computational Intractability. Still to come is a general overview post summarizing the six reviews and making them easier to navigate.
Chattopadhyay, Ishanu, Wen, Yicheng, & Ray, Asok (2010). Pattern Classification In Symbolic Streams via Semantic Annihilation of Information American Control Conference arXiv: 1008.3667v1
Kearns, M., & Valiant, L. (1994). Cryptographic limitations on learning Boolean formulae and finite automata. Journal of the ACM, 41(1), 67-95.
Pitt, L., & Valiant, L. G. (1988). Computational limitations on learning from examples. Journal of the ACM, 35(4), 965-984.