Data Governance Anti-Patterns [Part III]: The Copy & Paste Data Sharing Method

(This Is Part Three of Our “Data Governance Anti-Patterns Series,” Part II Can Be Found Here).

Anti-patterns are behaviors that take bad problems and lead to even worse solutions.  In the world of data governance, they’re everywhere. Today’s anti-pattern probably isn’t thought of as a “pattern” at all because it feels so obvious, and (on the surface) is too “easy” – you share data by giving someone a copy.  This is plain wrong, and we’ll explain why.

Imagine you’re a small mom and pop pizza place and have been diligent about storing all of your data about inventory, deliveries, and orders in a database.  You know you could use powerful machine learning techniques to be more predictive about all that you do, but obviously don’t have the data scientists on staff to assist.  To do so, you hire external consultants and share your data and business goals with them so that they can build the models for you.

This isn’t an uncommon scenario. In big organizations, such as banks, this sharing occurs internally between groups.  Also, large organizations share with each other in order to gain value from their combined data.

The key word here is “share.” How does that actually happen?  In almost every organization we work with, the method is consistent and problematic (at least before Immuta comes in): copy and paste. Specifically, the data is copied out of the database, likely transformed in some way, then that static file is shared.  In fact, this explains why data scientists are so familiar with working with files because they’re always at the tail end of the copy and paste process.

Why does everyone gravitate to copy and paste? A few reasons.

  1. Anonymization of data must occur:  Going back to our pizza joint example, it’s likely that the owners would want to mask their customers’ names and personal information before sharing their data with the consultants.  This requires some kind of extract/transform/load (ETL) process to copy the data out of the database, mask it appropriately, and then dump it somewhere to use as the final output to share.  This technique isn’t limited to small pizza shops. It’s almost universally used in every large organization we meet with.
  2. Legal:  Anonymization techniques can be very complicated and involve sign-off from experts as well as auditing of where data is being shared (think GDPR).  This requires manual workflows and the involvement of several employees beyond just IT. In many cases, data owners use this as an excuse not to share at all.
  3. Database security:  IT doesn’t want to manage more accounts to the database, especially with third parties so they copy and paste instead, rather than adding more accounts.  Remember how complex adding accounts can get if IT is also following our anti-patterns one and especially two.

The Problems With Copy and Paste

It turns out that the copy and paste method of data sharing directly hurts data science programs and leads to massive frustrations for the downstream data consumers.) There are a few reasons behind this:

  1. Data consumers have no process to discover or request access to the data across the organization.
  2. They typically have to wait months (yes, months) for the data to arrive.
  3. They’re working with a static snapshot of the data, which is typically months old – if not simply outdated – because of the above process.  
  4. They’re required to sign data usage agreements and therefore must be very careful about how they subsequently share the data with their colleagues.

And on top of that, the organization is frustrated because:

  1. They lose insight into who has what data and how it’s being used.
  2. They significantly increase storage costs as many anonymized versions of data need to be stored for various different user scenarios.
  3. They have complex ETL “spaghetti” to manage the creation of all the anonymized copies.
  4. It isn’t clear how the anonymization policies are actually implemented (or if they are correct) across different data systems (see anti-pattern 1).
  5. Biggest of all, their data science initiatives are stymied because frankly, none of this works and nobody can access data in the way they need.

So how should you avoid this anti-pattern?

A great analogy is how Netflix led to the demise of Blockbuster.  Blockbuster was the copy and paste method. After searching through the store on foot, you got the raw video, watched it, remembered to rewind, and returned it.  Netflix changed that. Instead of copy and paste, they provided a live feed to the movie over the internet. The value here was discovering the movie you wanted through a web search then having immediate access from your living room, without moving from your couch.  

Data science programs need Netflix, not Blockbuster. With Blockbuster, they fall apart.

The Value of a Data Control Plane

In order to provide live access to data, we recommend implementing what Immuta terms a “data control plane.” This control plane is placed on top of your databases as an abstraction layer which provides discoverability, policy authoring, data access, full audit and request history, and dynamic anonymization at data-access-time.  Taking this approach resolves all of the above issues.

  • Data owners can share their data with complex access and anonymization policies that are enforced at request and query-time.  No ETL jobs, no extra storage.
  • Policies can be reviewed or enhanced by legal and compliance – the policies should be written in a way that’s simple to understand by all employees.
  • The control plane, which reduces your surface area for a security breach, acts as an abstraction to your database and doesn’t require new accounts to be created.
  • Data scientists can rapidly discover data, request access, and immediately be entitled to the data based on the logic of the access policy.
  • The data scientists’ are accessing live, up-to-date data through industry standard access patterns.
  • The data scientists are 1) comforted knowing they followed a recognized access process and 2) can share work (code) with their colleagues, knowing they’ll be able to access the data through this same control plane.
  • Full audit of all actions are captured with reporting capability to fully understand who is using what data for what purpose.

The data control plane is likely overkill for the pizza joint, but is absolutely critical for large organizations with disparate data silos as well as small organizations with very complex data policies to enforce (such as HIPAA and GDPR). With Immuta, this control plane can serve as the foundation for your data privacy initiatives.

Click here for a 1:1 demo to see how Immuta’s platform can help you rapidly personalize data access and dramatically improve the creation, deployment, and auditability of machine learning and AI.