Data Mining FAQ
For those not familiar with data mining, simply mentioning the term ‘data mining’ often leads people to mentally check out. The term is carelessly thrown around; leaving the definition unclear. The truth is, the subject is full of jargon, tedious detail, and complicated math; but if you can understand some of the basic concepts it can be extremely valuable.
To give you insight into this important topic I enlisted the help of OG’s Director of IT, Jim Kenyon. Jim has helped some of the largest brands in the world make sense of their data. At OG, we tend to look at this from a marketing perspective. In other words, what does the data tell us about our client’s marketing?
Before we look at some of the fundamentals of data mining, it’s important to understand one principle: data mining is most effective when you ask specific questions. We like to use the illustration of peeling an onion. By peeling the onion one layer at a time, you can see more of what’s really going on with your marketing and ask more specific questions.
To help you ‘peel back the onion’ and better understand data mining, specifically as it relates to marketing, we’ve asked Jim Kenyon to answer some of the most important data mining questions in language that’s easy to understand. Each question is designed to help you get the most out of data mining and understand what it takes to get started.
1. How much historical data do you recommend a client has before they can do any meaningful data mining?
We want as many observations as we can get, though we’ve had reasonable success with three years of monthly observations (36 observations). The quality of the model goes up with more data (to a point).
2. Is there a recommended time period classification (observation frequency)? In other words, does it make a difference if the data is recorded daily, weekly, monthly, or annually?
Models that use macro-economic data tend to use monthly sampling as most econometric data is reported monthly. It’s a “least common denominator”.
3. How do I know if my data is in good or bad shape? What are the indicators?
- Do you have monthly observations across a continuous time range?
- Are there valid values for all observations?
- Valid values are values that are within the expected range for a field. For example, if the field is “age of person”, negative numbers would not be valid. Missing values are another case – for example, if TV spending is “missing” is it because there was no spending (in this case it would be a zero, not missing) or is it because accounting lost the data for that month?
- Is the data recorded on the same scale / unit of measure for each observation?
- Does the format (file layout) of the data vary from observation to observation (or year to year)?
- Is the data recorded in a common file format (CSV, Excel, fixed column width)?
If someone answers “NO” to more than one of these questions, chances are, the data is in bad shape and will need significant work. It doesn’t necessarily mean that their data is unusable – on the contrary, most of the engagements we see have data that’s in pretty rough shape.
4. If my data is in bad shape, how is it cleaned and prepared for data mining?
OG data scientists will restructure the data and load it into a relational database. Once in the database, it is transformed into monthly observations of features. Descriptive statistics are calculated for each feature. Descriptive statistics are things like mean, mode, standard deviation, median, frequency distributions, etc.
Plots are created for each feature. Typical plots are frequency distribution and time series (value of a feature by time period, from the start of the range through the end, in chronological order). These statistics and plots are reviewed internally (OG) and cross-checked with original (raw – as provided by the client) data to make sure the transformation did not alter the data. The data review is then conducted with client data stakeholders/providers to check for and explain anomalies.
5. How does the quality of my data affect any potential data mining project?
Poor data quality can reduce the predictive accuracy of a model. It can, in the extreme, prevent model development entirely.
6. I’m not sure where all my data is. What places do clients most commonly store their data?
In their sock drawer next to old Playboys. Seriously, in an ideal world, all data is stored in a data warehouse. More commonly, it comes from spreadmarts – an Excel spreadsheet from Bob in finance; another one from Jill in media planning; a CSV from some legacy mainframe application; and three external SaaS applications that two different guys in sales bought because they saw them in an airline magazine…
7. What are the most overlooked and underappreciated aspects of data mining?
Data mining doesn’t require “big data”. That is, you don’t need millions of customer records to take advantage of the power of machine learning techniques. Monthly marketing spending and sales data, over at least three years, can produce very useful models to improve the effectiveness of your marketing dollars.
8. From a marketing perspective, what are some of the most common questions addressed through data mining?
- Are my marketing efforts working?
- Where should I spend my next marketing dollar?
- Have I reached minimum / maximum spending thresholds for media X?
- What marketing efforts “work better together” than individually?
9. How are outside variables (weather, MCSI, etc.) incorporated into a data mining project? How is the impact of outside variables measured alongside of internal marketing variables?
External variables are included at the same observation frequency (typically monthly) as client-provided data. The machine learning tools include these features while constructing the candidate models and determine if any of them are contributing the predictive accuracy of the model. If they are contributing, they are included in the model. If not, they aren’t. They are not treated differently than client-provided features.
10. Does it matter how many variables are in a data mining project? How does this affect time, cost, etc.?
There is a limit based on the number of features mining tools (varies by tool) can handle, though we haven’t reached this limit with client projects. The number of features is reduced through “feature selection” – an iterative process that looks for features that “tell the same story” (for example, temperature reported in Celsius and Fahrenheit – the second “copy” doesn’t add information to the model – they “tell the same story” but in different units), or are highly correlated. Only one copy of such a group of features is carried into the modeling phase.
Additional features that are not in analytic-ready (one record per observation period, with a variable being an additional column in said record) add to ETL time. This can be expensive if the data requires significant work to get it into an analytic-ready format.
11. What’s the first step in every data mining project?
The first step is to understand the business problem being solved. If this step is ignored or given short shrift, one ends up with a very good answer to the wrong question.
12. What is typically the client’s role in a data mining project? What things fall on the client?
- Defining the business problem
- Providing data only the client can provide
- To the extent possible and/or desired, delivering client data in an analytic-ready format
- Reviewing data during ETL to help make sure the process didn’t introduce errors
- Reviewing candidate model(s) to see if they make sense
13. What steps are taken to make sure the data model delivered is the most accurate model?
Interestingly, we tend to ignore “the most accurate model” as these tend to be precisely wrong rather than generally accurate. That is, they can suffer from “overfitting.” Rather, we look for candidate models that:
- Make sense
- Are explainable
- Are simple
- Have good predictive accuracy
- Are biased in the way that best suits the client’s needs. For example, it’s better to have a model that includes some Type I errors (false positives) when sending direct mail advertising pieces than to have Type II errors (false negatives). In this case, you spend a few extra cents per piece to people who won’t respond rather than not send to people who would respond and generate revenue.