I think there is a difference between machine learning and data mining. Machine learning is similar to data mining because a lot of the pioneers in data mining are still around, and they are pioneers in machine learning.
In my mind part of the main difference was the emphasis even in the terms. One emphasizes mining. The other one emphasizes learning in terms of the branding. I think that's one of my observations. The other thing of course is the notion that we have to separate empirical results versus theory. So people in industry I guess they care about theory to some extent, but at the end of the day it's empirical results that matter.
I think the connotations data mining sometimes brings up, and if that's historical consequence I don't know, is that, it's about torturing data. It's about mining it till it confesses with whatever the preconceived notions you had going into it. I think maybe that's a little bit unfortunate. Indeed I think that's probably more where we want to be than in that drill-till-you-find-something mentality.
Machine learning has always been used as a part of data mining. Data mining involves all this data storage and data manipulation, and also machine learning or statistics is where we learn from the data. So I think a lot of comments are leading us towards this theme of automation.
There is one more point that I wanted to emphasize which is actually for whatever reason, the techniques that statisticians have historically brought to the table became ill-suited at some point, because they didn't scale particularly to the number of variables.
#1 Ranking: Read how InetSoft was rated #1 for user adoption in G2's user survey-based index |
|
Read More |
In images variables are so correlated, and in text the data can be so sparse, but our friends that are statisticians are working extremely hard to make new parsimonious interpretable with theoretically supported approaches to deal with all this stuff. I think it's just easier right now to reach for a machine learning algorithm to deal with a broader set of data.
Let's get onto automation because it's a topic that I like to speak very passionately about to data scientists but also to organizations because I think it's really important. There are a lot of people on this WebEx today. There's a lot of interest out there for what data science, and that is very encouraging. Still, within any individual organization there's always more demand than the actual data scientist can fill.
How do we best make the most precious commodity those people have with data? I know for myself in the analysis that I've done, I like to do the whole end-to-end process. I like to go from collecting the data to preparing the data. In each step of that you learn something about what's contained in that data and what's not contained in the data, and what's the limitations of the data, what's the potential of that data is.
|
View a 2-minute demonstration of InetSoft's easy, agile, and robust BI software. |
The only thing that I would add to that is once you've got that data stream, it shouldn't be your job to repeat that, to make sure that it keeps coming in, in a consistent way to feed whatever production jobs you have running, whatever analytics actually running in the production environments of your organization. For that very reason those types of things need robust automation processes. The same thing for running models, production model, scoring, deploying those scoring jobs, that's not something that data scientist in my opinion should be spending a lot of time with.
They should be able to do that very quickly and transparently, to be able to hand that off to an IT department without a lot of effort and with a lot of confidence, and with a real clear handshake between what they're handing off to the IT department with the confidence that whether they deploy it in a batch environment, whether they deploy it in an online application, whether it's the streaming data actually happening closer to the data creation state.
Really being sure that what they are being pushed will run, first of all, and that from a business perspective, it's generating exactly what it needs to. Sometimes some of these models need to be embedded with business logic in order to make sure that happens. That's why sometimes you need business experts to be involved with this process as well.
Making productionalization of a machine learning model as transparent and as fluid as possible is very important. When I say transparent we talk about heavily regulated industries like pharmaceutical, like the financial industry. It's really important that each step along that way be transparent, especially, as operational decisions are being taken.
It's really important to have that whole audit trail of what data went in, what process was applied to it and at what time. That's why really having a robust decision management solution coordinating a lot of these tasks will free up time for the data scientist to spend more time on things like machine learning.
As a data scientist, I want to do fun cool stuff. I don't want to spend all day doing manual things that should be automated. Before all industries move towards automation, we should expect that data mining, machine learning, statistics, analytics, whatever we call it, that's also going to move towards automation.
We're going to be able to do more and better things as processes that used to take a lot of our time get automated, and we can trust that automation. For me that's is key, using machine learning to automate machine learning workflows.