When we talk about the current machine learning adoption trends, are we expecting data scientists to be domain experts, to be business domain experts? Is that a realistic expectation?
I think that there's this myth of the 23-year old data scientist who knows about math, knows about computers and knows about business. I think there's plenty of really smart, super sharp people out there, young people out there who may really know math and statistics and really may know their computer hacking skills extremely well, but that substantive expertise can take years to develop.
I think there's a little - if we are lumping business expertise into the data scientist definition, then I think that takes some time to cultivate for sure. Some problems that we see with data scientists and managing data scientists is sometimes if this is a new function in an organization, management might not really understand how they need to work.
One classic pitfall is getting bogged down with algorithms. Spending all your time picking the best algorithm or trying to tune algorithms instead of just focusing on solving the business problem, instead of focusing on the fastest way to solve the business problem. There can be all kinds of issues between data science and IT, who owns the data, who owns the tools, who owns the hardware that can cause conflict.
#1 Ranking: Read how InetSoft was rated #1 for user adoption in G2's user survey-based index |
|
Read More |
Spread of Machine Learning
We're seeing machine learning in organizations now. This isn't coming out of the blue. This has a long history, and so we wanted to spend a little bit of time here. One good thing to do at first is, of course, to define machine learning, and that's really tricky.
I think for better or for worse in a certain sense, machine learning has taken on sort of a pop culture, meaning it's just the rebranding of analytics or data mining. Then there is this other academic definition because machine learning has been studied so long within computer science departments at universities.
We are going to have to straddle that definition today because at SAS and in other places we sort of use machine learning in both of these ways sort of as rebranding of analytics, but to me a true branch of computer science also.
To define machine learning I'm going to contrast it with statistics, and I'm not saying that machine learning is better than statistics. I'm just saying that that machine learning is different than statistics. I think this is one of the easiest ways to define it.
Machine learning techniques tend to make less assumptions about data. We typically look for, in statistics, for normality of the data or the data to obey certain distributions. With machine learning we can often relax those expectations on the data which is really nice. Machine learning methods also tend to sacrifice interpretability to promote greater accuracy.
|
View a 2-minute demonstration of InetSoft's easy, agile, and robust BI software. |
Difference Between Traditional Statisticians And Data Scientists
Going back to this difference between traditional statisticians and data scientists, your traditional statistician was taught in school about these concepts like biased, confounding variables, confounding features, scientific method, design of experiments, and parsimonious results. These are all super valuable things to bring to that people like me.
I came from a chemistry background. I had to learn about this stuff on the job, so I think there's things that the kind of the self-taught data scientist might not know that the PhD statistician would just have had drilled into their heads. Those are all very good things. My friends who are statisticians, one thing I think they're really good at remembering is that just because you have a lot of data doesn't mean that you have the right data.
As this data science functionality develops within organizations there's just a lot of pros and cons that we've seen, and we want people to watch out for and make the most out of, make the most out of the situation. What I wanted to say about algorithms, I don't know whether I'm a data scientist or what I do. I've been doing a data analysis for upwards of 20 years, trying to bring value for organizations. I do love the algorithms, I do love trying different ones to find the best answers.
I love trying out new algorithms to see if they get closer to an approximation of complex phenomena. I find it reassuring somehow from an existential perspective that these algorithms can provide some clarity on a trend or a phenomenom that's otherwise very illusive. That said, just because it's very interesting to me to try out learning vector quantization for a new text body that I have to see if I can calculate it better in more relative groups of text with the algorithm.
If I can do it effectively with an algorithm that I already know, I'm not sure I'm making the best use of my organization's time by just exploring that algorithm. If I can do it quickly and efficiently and in a robust environment that's fine, but if I can't then I'm probably better off limiting myself to the algorithms that I know well, that I know are going to give me some good results instead of just feeding my intellectual curiosity.
I think that's a balance we have to seek. Innovations and algorithms, that has to come somewhere and, yeah if it doesn't come from us who is it going to come from? Yeah, I think we definitely have to seek a healthy balance between the intellectual curiosity that we have in trying out new algorithms and getting certain tasks done.