The Role of Data Preparation in Predictive Modeling

Why up to 90% of your modeling timeline could be in data preparation

I recently came across an analogy for predictive modeling, where it was compared to launching a rocket. It talked about how our focus is typically on the type of model that will be built, the techniques, the algorithm and the assumptions – these are all analogous to building a faster and more efficient rocket… which is certainly important. In this comparison, data preparation leading up to modeling was compared to pointing that rocket in the right direction.

Given the immense importance of data preparation, it’s no surprise that the average modeler spends 60 percent of his or her time preparing data; in some cases, the number can be as high as 90 percent.

Figure 1: Steps in modeling process along with typical timelines

Four Key Challenges in Data Preparation:

1. Completeness – Do you have all relevant data for potential dependent and independent variables?

The 80/20 rule usually works best for assessing data completeness. Eighty percent of data is relatively easy to gather, but the last 20 percent can require significant time, effort, and cost. So a decision around completeness needs to be made with model objectives in mind – what are you trying to achieve from this model? What business decisions will be made using model results? What period of time will these results be applied over? At what level will these results be implemented (national rollout vs. localized pilot)? The idea is to strike the right balance between not letting perfect become the enemy of good, and also maintaining credibility of model output.

2. Sourcing - who owns the most updated data for each component in the model? Does data need to come from external vendors, and do you have legal agreements required in place with each?

Sourcing can easily derail the project timeline if not preceded by extensive planning. I’ve worked on projects where several different marketing tactics were implemented by up to nine separate vendors, and sourcing campaign data from each one in a usable format was a major undertaking.

3. Cleanup – how do you address duplicates, missing values, outliers, different coding across sources, and data granularity, etc.?

In predictive modeling, if you cannot separate the signal from the noise, then there is no insight. And achieving that balance is extremely tricky as well as critical – not eliminating data that could have been signal, but not keeping data that could have been noise.

4. Model considerations – what’s the best dependent variable to answer business questions? How is it best to apply lag/decay factors to input data?

These questions are typically considered in the modeling phase, but it’s useful to think about these in advance during data preparation.

Promotion Mix Model Example

Let’s take the below promotion mix modeling example where we measured relative impact of each HCP tactic, along with absolute impact per touch and ROI. The model was built at the HCP level, so it outlined channel responsiveness for each physician and segment. That led to creating tactic bundles that delivered the highest ROI as well as optimal customer experience.

In the context of this example, completeness of data and sourcing is key, because not including activity for a subset of promotional tactics could inflate contribution from other channels.

Figure 2: Individual data components overlaid in the predictive model

Another issue that could be critical is outlier management. It starts by identifying which data points are outliers – if you keep them in the model, they could contribute to noise and mute the signal. If you remove them, they could take away useful learnings for the model.

Significant dips in activity without corresponding dips in sales will dampen the promotional impact of this tactic in the model. The model will conclude that this tactic must not be very effective. But one could argue that such data points are a gold mine of information for the model. Variability of data over time is key to drawing insights from predictive modeling – which is why it’s so hard to measure the impact of static billboards - so these insights could be key. Several clients ask us the question, “what will be the impact of reducing my promotional spend by fifty percent?”  The model is rarely able to answer that question with high confidence because it hasn’t seen that happen in historical data. So when you have such data points in the model, they can be invaluable.

Certain outliers can be managed in the model using structured variables, while others will need data manipulation or even the elimination of certain data. Structured variables are particularly useful when addressing data variances due to external market events, such as competitive launches, pricing changes, regulatory action, etc. However, there needs to be a clear, defined process for handling outliers, so it can be consistently applied from cycle to cycle and across brands, enabling an apples-to-apples comparison. Automation could be key in some cases.

Another question that comes up often in relation to data preparation is about granularity. At what level should we summarize all data? And at what level should we build the model?

Modeling at the level you will make business decisions.

If key decisions will be driven at the HCP level, all data should be summarized/ attributed to HCPs. If media planning decisions will be made at the Designated Market Area (DMA) level, that’s where the data needs to be summarized. Similarly, timeline could be a key component as well – for example, using weekly vs. monthly data.

More granular level is preferred because:

  • It uncovers more variability which improves overall model predictive power
  • It allows insight into more granular level of detail within a media (daypart for example)

However, granularity brings challenges:

  • Data has more noise (random variation)
  • Have to bring data down to granular levels (i.e. allocate GRPs by market, etc.)

Selecting the right dependent variable

Another key element of data preparation that became even more apparent to me recently is selecting the right dependent variable. In the above example to measure relative impact of channel contribution, a model with one fixed effect across all accounts will miss inherent bias in data due to different account sizes. We could be implying that if sufficient activity is provided to smaller clinics, their sales could start matching larger hospitals, which is obviously not true. And that’s where the importance of picking the right dependent variable is key. Instead of sales or Rx, a variable such as sales/ target HCPs, or sales/ # of beds will better incorporate the size component.

Lag and decay considerations

Finally, lag and decay are also key data decisions that need to be made before beginning the modeling phase. Decay is applied so that the “touch” is moved into the time period in which the impact will be observed – sometimes spread fractionally. The rate of decay is different for different channels. Digital tactics decay a lot faster than print, and with TV, it is somewhere in between.

In conclusion, there are several issues related to data preparation that make this step absolutely critical to predictive modeling. And even though it could take 60 percent or more of your modeling timeline, it’s time well spent!

For more on how analytics can increase customer experiences, view our on-demand webinar “Unlocking Customer Experience through Analytics

Join the Discussion