Article

The Experimentation Gap

“Analytics”, “algorithms”, “data-driven”, “data science”, these are terms typically pinned to an organization’s strategy to use data to be “better”. Better can mean a lot of things, more internal efficiency, improved sales, increased cyber security, organizational decisions, you name it, data is there to help.

My belief with regard to data initiatives is that every problem you are trying to solve has a direction and multiple modes, those two factors are what drive your technology solutions and implementations. Making critical technology decisions before you carefully examine the direction and modes of your problems is a mistake.

Let’s start with describing direction. This really means the direction you are looking for answers: backwards in time or forwards in time. In most cases, to look forward, you always have to look backwards first. This is what most would call learning from the past.

Backwards: You are looking at bounded historical data to unearth insight that wasn’t obvious until you started looking. These are your classic Business Intelligence tools to help visualize data for discovery and clarification. For example, provide me a chart with the average temperature every August since 2000. What “Big Data” did was make it possible to look at bounded historical data that was massive in scale.

Forwards: This is making predictions – what is going to happen in the future based on things I’ve seen when I was looking backwards. Here’s a super simple example, when you wake up on a Saturday in August, do you check the weather to see if you should wear a sweater or a tee shirt? No, you just grab a tee shirt, because you “know” it’s hot in August and you “predict” it will be hot today (yah, yah, you get it, Southern hemisphere). You learned from looking backwards in order to look forward. Sometimes you need an algorithm to learn from the past and predict the future, but in this case, the algorithm is called common sense.

Here’s a more complicated example. If you have an Android phone, on weekday mornings you get an alert from the app Google Now (creepily) that says, “Commute to work 5 minutes longer than usual”. What’s happening here? First, Google “knows” where you currently are. Google also “predicts” where you work. Google “knows” the amount of traffic between where you are and your work. Google lastly “predicts” how long it’s going to take you to get there and tells you about it before you leave for work, meaning it also “predicts” when you leave for work.

backwards = knows

forwards = predicts

To do the looking forward piece, you need to build an algorithm. Looking backwards, or knowns, do not require an algorithm, they simply require processing power against potentially massive data, ehem, Big Data.

Remember all those data scientists you hired for your data initiatives? — those gals and guys write algorithms — they look forward.

Here’s the thing – do you need to look at how hot it was every single August since the 1800’s to determine if it was going to be hot in August today? No.

Does Google really need to look at every single phone subscriber to build their Google Now algorithms? No.

Here’s what I think the algorithms could be doing:

  • Predict where you work: where does your phone remain stationary most often during the hours of 9–5, M-F. Stationary = no pinging off new towers or new wifi hot spots.
  • Predict when you leave for work: where does the most “bouncing between” tower and wifi hotspots occur before and after your stationary periods — your commute.
  • Predict how you drive there: use the road network between where you are and your predicted work location.
  • Predict how long it will take: Look at the traffic on those roads and determine the average time it’s taken in the past to traverse a similar traffic density.

There’s flaws in those algorithms, like if someone doesn’t work M-F, but you get the idea. So, to the point, did I have to look at every single user to figure out how to calculate these, e.g. build the algorithm? No, you look at some, specifically some that you know where they live and where they work and how they commute — known truths. You hold those truths back, and then you validate your predictions, the algorithms, against them to see how well they did. All of that is done with samples of data, in fact, you can’t possibly build the algorithm against everyone, because you’d have no way to validate your findings because you don’t know the truth for everyone, if you did, there’d be no reason to build this algorithm.

So wait, wouldn’t I always look forward using samples with truths…and if all of my data scientists do is look forward, what do I need Big Data tools for again?

(ok, ok, data scientists don’t always just look forward, but I would argue, the work done looking forward is the work that brings the most ROI).

This question brings us to the other factor for every problem: modes. In the above example we were in experimentation mode, using samples of data to build a meaningful algorithm to look forward. There’s another mode, that’s production mode. Production mode is where you sometimes need your Big Data tools, because, for example, Google may want to pre-calculate some of these results for all your subscribers instead of doing it one by one. Production mode means putting your new shiny algorithms from experimentation mode to work — at scale.

So we finally get to the point:

Big Data = looking backward, knowns, production mode.

? = looking forward, predictions, experimentation mode.

That question mark is what we call the “Experimentation Gap”. When you invest in Big Data, and Big Data only, you force the wrong mode on your forward problems. You stymie experimentation and algorithm development in exchange for an engine to run the algorithms you don’t have yet. Let’s face it, most of your problems have a forward component. You’ve effectively put the cart before the horse.

This is why everyone says your Data Lake is a Data Swamp. It’s not because it’s disorganized or because there’s not enough governance, it’s because you’re forcing a production tool to be an experimentation tool (which makes it a mess and a policy hazard).

We believe Immuta fills that experimentation gap. Allow rapid access to data without forcing data centralization up front, while still enforcing policy controls and allowing your data scientists to experiment however they want. Take chunks of knowns, and build algorithms quickly. Then you can take those algorithms and put them into production, maybe in a Big Data solution, maybe in a streaming solution, that’s for you to decide, but only after you’ve experimented, found algorithms/solutions, and understand the ROI in your data.