SEMiotics: April 2016

Wednesday, April 13, 2016

Mutli-Channel Attribution and Understanding Interaction

I'm no cosmologist, but this post is going to rely on a concept well known to astrophysicists, who often have something in common with today's marketers (as much as they might be loathe to admit it). So what is it that links marketing analytics to one of the coolest and most 'pure' sciences known to man?

I'll give you a hint: it has to do with such awesome topics as black holes, distant planets, and dark matter

The answer? It has to do with measuring the impacts of things that we can't actually see directly, but still make their presence felt. This is common practice for scientists who study the universe, and yet not nearly common enough among marketers and people who evaluate media spend and results. Like physicists, marketing analysis has progressed in stages, but we have the advantage of coming into a much more mature field, and thus avoiding the mistakes of earlier times.

Marketing analytics over the years and the assumptions created :

Overall Business Results (i.e. revenue) : if good, marketing is working!
Reach/Audience Measures (i.e. GRPs/TRPs) : more eyeballs = better marketing!
Last-click Attribution (i.e. click conversions) : put more money into paid search!
Path-based Attribution (i.e. weighted conversions) : I can track a linear path to purchase!
Model-based Attribution (i.e. beta coefficients) : marketing is a complex web of influences!

So what does this last one mean, and how does it relate to space? When trying to find objects in the distant regions of the cosmos, scientists often rely on indirect means of locating and measuring their targets, because they can't be observed normally. For instance, we can't see planets orbiting distant stars even with our best telescopes. However, based on things like the bend in light emitted from a star, and the composition of gases detected, we can 'know' that there is a planet in orbit of a certain size and density, that is affecting the measurements that we would expect to get from that star in the absence of such a hypothetical planet. Similarly we don't see black holes, but we can detect a certain radiation signature that is created when gases under the immense gravitational force of the black hole give off x-rays.

This is basically what a good media mix/attribution model is attempting to do, and it's why regression models can work so well. You are trying to isolate the effect of a particular marketing channel or effort, not in a vacuum, but in the overall context of the consumer environment. I first remember seeing white papers about this mainly about measuring brand lift due to exposure to TV or display ads, but those were usually simple linear regression problems, connecting a single predictor variable to a response, or done as a chi-square style hypothesis test. But outside of a controlled experiment, this method simply won't give you an accurate picture of your marketing ecosystem that takes into account the whole customer journey.

As a marketer, you've surely been asked at some point "what's the ROI of x channel?" or "How many sales did x advertisement drive?" And perhaps, once upon a time, you would have been content to pull a quick conversion number out of your web analytics platform and call it a day. However, any company that does things this way isn't only going to get a completely incorrect (and therefore useless) answer, but they aren't really even asking the right question.

Modern marketing models tell us that channels can't be evaluated in isolation, even if you can make a substantially accurate attempt to isolate a specific channel's contribution to overall marketing outcomes in a particular holistic context.

Why does that last part matter? Because even if you can build a great model out of clean data that is highly predictive, all of the 'contribution' measuring that you are doing is dependent on the other variables.

So for example, if you determine that PPC is responsible for 15% of all conversions, Facebook is 9%, and email is 6%, and then back into an ROI value based on the cost of each channel and the value of the conversions, you still have to be very careful with what you do with that information. The nature of many common methods for predictive modeling is such that if your boss says, "Well, based on your model PPC has the best ROI and Facebook has the worst, so take the Facebook budget and put it into PPC" you have no reason to think that your results will improve, or change the way you assume.

Why not? Because hidden interactivity between channels is built into the models, so some of the value that PPC is providing in your initial model (as well as any error term), is based on the levels of Facebook activity that were measured during your sample period.

It's a subtle distinction, but an important one. If you truly want to have an accurate understanding of the real world that your marketing takes place in, be ready to do a few things:

Ask slightly different questions; look at overall marketing ROI with the current channel mix, and how each channel contributes, taking into account interaction
Use that information to make incremental changes to your budget allocations and marketing strategies, while continuously updating your models to make sure they still predict out-of-sample data accurately
If you are testing something across channels or running a new campaign, try adding it as a binary categorical variable to your model, or a split in your decision tree

Just remember, ROI is a top-level metric, and shouldn't necessarily be applied at the channel level the way that people are used to. Say this to your boss "The marketing ROI, given our current/recent marketing mix, is xxxxxxx, with relative attribution between the channels being yyyyyyy. Knowing that, I would recommend increasing/decreasing investment in channel (variable) A for a few weeks, which according to the model would increase conversions by Z, and then see if that prediction is accurate." Re-run the model, check assumptions, rinse, repeat.

Friday, April 8, 2016

Own Your Data (Or at least house a copy)

There is a common trope among data analysts that 80% of your time is spent collecting, cleaning, and organizing your data, and just 20% is spent on analyzing or modeling. While the exact numbers may vary, if you work with data much you have probably heard something like that, and found it to be more or less true, if exaggerated. As society has moved to collecting exponentially more information, we have of course seen proliferation in the types, formats, and structures of the information, and thus the technologies that we use to store it. Moreover, very often you want to build a model that incorporates data that someone else is holding, according to their own methods that may or may not mesh well with your own.

For something as seemingly simple as a marketing channel attribution model, you might be looking at starting with a flat file containing upwards of 50-100 variables to start, pulled from 10+ sources. I recently went through this process to update two years' worth of weekly data, and it no joke took days and days of data prep just to get a CSV that could be imported into R or SAS for the actual analysis and modeling steps. Facebook, Twitter, Adwords, Bing, LinkedIn, YouTube, Google Analytics, a whole host of display providers... the list goes on and on. All of them using different date formats, calling similar/identical variables by different names, limited data exports to 30 days or 90 days at a time, etc..

Obviously, worth the effort for a big one-time study, but what about actually building a model for production? What about wanting to update your data set periodically and make sure the coefficients haven't changed too much? When dealing with in-house data (customer behavior, revenue forecasting, lead scoring, etc.) we often get spoiled by our databases, because we can just bang out a SQL query to return whatever information we want, in whatever shape we want. Plus, most tools like Tableau or R will plug right into a database, so you don't even have to transfer files manually.

At day's end, it quickly became apparent to me that having elements of our data, from CRM to Social to Advertising, live in an environment that I can't query or code against is just not compatible with solving the kinds of problems we want to solve. So of course, the next call I made was to our superstar data pipeline architect, an all-around genius who was building services for the Dev team running off our many AWS instances. I ask him to start thinking about how we should implement a data warehouse and connections to all of these sources if I can hook up the APIs, and he of course says he has already not only thought of it, but started building it. Turns out, he had a Redshift database up and was running Hadoop MapReduce jobs to populate it from our internal MongoDB!

So with that box checked, I started listing out the API calls we would want and all of the fields we should pull in, figure out the hook ups to the third party access points. Of course, as we have an agency partner for a lot of our paid media, that became the biggest remaining road block to my data heaven. I schedule a call to the rep in charge of their display & programmatic trade desk unit, just so we can chat about the best way to hook up and siphon off all of our daily ad traffic data from their proprietary system. After some back and forth, we finally arrive at a mainly satisfying strategy (with a few gaps due to how they calculate costs and potentially exposing other clients' data to us), but here is the kicker:

As we are trying to figure this out, he says that we are the first client to even ask about this.

I was so worried about being late to the game, that it didn't even occur to me that we would have to blaze this trail for them.

The takeaway? In an age of virtually limitless cheap cloud storage, and DevOps tools to automate API calls and database jobs, there is no reason that data analysts shouldn't have consistent access to a large, refreshing data lake (pun fully intended). The old model created a problem where we spend too much time gathering and pre-processing data, but the same technological advances that threaten to compound the problem can also solve it. JSON, SQL, Unstructured, and every other kind of data can live together, extracted, blended and loaded by HDFS into a temporary cloud instance, as needed.

The old 80/20 time model existed, and exists, because doing the right thing is harder, and takes more up-front work, but I'm pretty excited to take this journey and see how much time it saves over the long run.

(Famous last words before a 6 month project that ultimately fails to deliver on expectations; hope springs eternal)

What do you think? Have you tried to pull outside data into your own warehouse structure? Already solved this issue, or run into problems along the way? Share your experience in the comments!