What Machine Learning Means for the Future of Data Science
Read the full article on Huffington Post
How will data science evolve with the rising popularity of machine learning in industry?originally appeared on Quora: the knowledge sharing network where compelling questions are answered by people with unique insights.
Before it makes sense to answer this question, one needs to think a bit about the relationship between data science and machine learning. To me personally, data science includes machine learning. Machine learning by definition is the ability of a machine to generalize knowledge from data—call it learning, or induction if you like. Without data, there is little machines can learn. So if anything, the increase in machine learning usage more broadly in many different industries will be a catalyst to push data science to increasing relevance. Machine learning is only as good as the data it is given and the ability of algorithms to consume it. My expectation is that moving forward basic levels of machine learning will become a standard requirement for data scientists.
This being said, for me, one of the most relevant data science skills is the ability to evaluate machine learning. In data science we do not lack for cool stuff to do and shiny new algorithms to throw at data. What we still have no really good grasp on is why things work and how to solve non-standard problems. One major concern I have about the (academic) machine learning perspective is its continuing focus on simple out of sample performance. 99% of all research papers are accepted based on the accuracy on some holdout.
One of the realizations I have come to during the last twelve years working in industry: standard academic evaluation is close to useless in most application domains. Models that perform well on some random test sets can be outright harmful. This is really a rather long discussion, but I have misgivings about a number of things:
a) The typical metrics people consider (accuracy defined as percent correct in a classifier being the worst offender): The Basics of Classifier Evaluation, Part 2 – Silicon Valley Data Science.
b) The fact that the model most often predicts the wrong thing to start with (mostly because for the ‘right’ thing you have no data): All the Data and Still Not Enough!
c) Models being evaluated outside the context of their use. You want to evaluate how the outcome improves after you take some actions based on the predictions, not just the predictions.
d) Huge sampling problems relative to the problem one really needs to solve; people build models on the data they have, not on the data they should use and more problematically evaluate them on a highly non-representative sample.
e) Adversarial situations with a mixture generating distribution where in the end the model only identifies the ‘wrong’ positives.
f) Leakage – the signal the model found was purely an artifact of the data collection and the real performance of the model will be terrible: Working With Data and Machine Learning in Advertising from Talking Machines
So I do hope that Data Science (as the more applied arm) can channel some of the machine learning work towards practically relevant advances.