InetSoft Webinar: Best Practices in Data Mining

This is the transcript of a Webinar hosted by InetSoft on the topic of "Best Practices in Data Mining" The speaker is Mark Flaherty, CMO at InetSoft.

Moderator: When you talk about best practices in data mining, what are some of the first things that you tell people to keep in mind?

Flaherty: What I really try to stress is to think about the data. Actually spend some time trying to understand it, trying to go beyond what you can get just out of the box using some analytical software. It seems like there is so much capability there that sometimes, we are tempted to turn our brains off when we get close to a data set.

Moderator: Right. So you really have to kind of think about the data and the context of the data. You really have to focus on what is the goal that you have in mind, right?

Flaherty: Absolutely, you need to really start at it from both ends like you said, the goal of where you want to get to. Think about where you are starting from. What is the population of data I am working with? Is it my customers? Am I thinking about sites for retail locations? Am I thinking about production jobs that I am trying to run out of manufacturing plant?

More Analytics Examples

#1 Ranking: Read how InetSoft was rated #1 for user adoption in G2's user survey-based index

All of those things might be my population or the level of analysis that I am trying to do. And you really have to put that into the proper context. I always find myself asking, “compared to what?” Now if I look at a set of customers, you think okay well, I have got a lot of one gender, for instance. Well, how do I know that? Compared to what?

Moderator: Yes, and speaking in terms of context and so forth, maybe you can talk about some of the most common use cases for data mining and specifically when you can get value from data mining? What are some of the more common examples of success stories in data mining? What do people get out of it?

Flaherty: Right, probably the best common use case is doing some form of customer response. You are reaching out to your customers and seeing their response to various kinds of the marketing or promotions. And then trying to figure out why some didn’t respond, so that you can target your efforts better, so you can message those efforts better in future contacts.

The world is changing a lot with the move from paper based promotions to so much more being online and via e-mail, so that’s certainly an area where people have picked it data mining and analysis and run with it for a long time. But I have certainly seen a lot more interest lately in social media in trying to extract sentiment analysis.

This can include doing some text-based data mining to try and understand not only the behaviors of those customers, but what their attitudes are, what they care about and what they are thinking, what’s leading them up to that buying proposition. Those are just a couple off the top of my head. If you have got some other interest, let’s talk about.

The basics that I talk about with customers and clients is really, they have to do their data mining with an understanding of their data and an understanding the problem that they are solving. I think that right now the current state of the art of predictive analytics is still in the scientific realm and less in the big decision maker realm and it really is important that the big decision makers understand what the goals of predictive analytics are and how predictive analytics can help them.

What I really try to do is I really try to help the customer or the problem solver try to frame their predictive problem as a classification problem. Most times when I talk to customers, they say they want to do forecasting. They want to forecast, for example, the number of customers they are going to churn, or they want to forecast inventory levels. There are a variety of things they have to go back to think about.

Read the top 10 reasons for selecting InetSoft as your BI partner.

The classification problem in data mining is a fundamental task where the objective is to categorize data into predefined classes or labels based on input features. This problem is central to various applications, such as fraud detection, spam email filtering, medical diagnosis, and customer segmentation. The process involves training a model on a labeled dataset, where the algorithm learns to associate input features with the correct output labels. Once trained, the model can predict the class of new, unseen instances. This ability to generalize from training data to new data is crucial for the effective application of classification models.

One of the primary challenges in classification is dealing with imbalanced datasets, where some classes are significantly underrepresented compared to others. This imbalance can lead to models that are biased towards the majority class, resulting in poor performance on the minority class. Techniques such as oversampling, undersampling, and the use of specialized algorithms like SMOTE (Synthetic Minority Over-sampling Technique) can help address this issue. Ensuring that the classifier performs well across all classes is essential for applications where misclassification of the minority class can have serious consequences, such as in medical diagnoses or fraud detection.

Another critical aspect of the classification problem is feature selection and engineering. The quality and relevance of input features significantly impact the performance of the classification model. Feature selection involves identifying the most informative features that contribute to the prediction, while feature engineering involves creating new features from the existing ones to better capture underlying patterns. Techniques like recursive feature elimination, principal component analysis (PCA), and domain-specific knowledge can enhance the model's ability to differentiate between classes. Proper feature selection and engineering can improve model accuracy, reduce overfitting, and shorten training times.

Evaluating the performance of classification models is also a complex task that requires careful consideration of various metrics. Common metrics include accuracy, precision, recall, F1-score, and the area under the Receiver Operating Characteristic (ROC) curve (AUC-ROC). Each metric provides different insights into the model's performance, and the choice of metric often depends on the specific application and the cost of false positives versus false negatives. For instance, in a medical diagnosis application, recall (sensitivity) might be more important than precision, as missing a true positive case (a patient with a disease) could be more critical than incorrectly identifying a healthy patient as having the disease. Balancing these metrics and understanding their implications is key to developing effective classification models

Next: Preparing to Take on a Data Analytics Problem

InetSoft Webinar: Best Practices in Data Mining

More Resources