Data Governance Anti-Patterns: Data Policies Shouldn’t Be Snowflakes

(This Is Part One of Our “Data Governance Anti-Patterns Series”, Part Two Can Be Found Here).

Anti-patterns are behaviors that take bad problems and lead to even worse solutions.  In the world of data governance, they’re everywhere.

The first step to enabling algorithm-driven organizations starts by eliminating your data policy snowflakes – they’re your anti-pattern. What’s a data policy snowflake you ask? In short, they’re the “from scratch” implementation of data policies in every database across your organization – uniquely. Just like snowflakes. Leaving no consistent way to manage, enforce, understand, and audit policies on data costing your organization time, money, and puts your data science initiatives at risk.

Instead of building unique policy snowflakes across your databases, this post highlights the benefits of a single consistent abstraction layer to build and enforce your policies, across any database, in plain English, that are applied on-the-fly. Keep reading if you’re excited to snowplow some policy snowflakes.

The data policy snowflake

If you’re a part of a typical organization, data is stored in various databases in various locations. There are people who are allowed to see that data for their job, others who may be able to see some of it, and others who shouldn’t see that data at all. So there are always three parts to this equation: people, policies, and enforcement of those policies.

In almost all of our engagements with customers, we find that policy enforcement is fragmented. It’s typically uniquely implemented by the database administrator or data engineering team on an individual platform-by-platform basis across the organization. In practice, this means that the administrator of each platform has to take policies, which are typically based on a document or email from legal/compliance, and translate that into the policy enforcement rules all on their own.

In short, they manage policies on their data in their own unique way – like a snowflake. And it’s bad to have snowflakes all over your organization for several reasons. Here are just a few:

  1. Mistakes: Legal and compliance policies are being interpreted by database administrators and data engineers. This is what lawyers are for. Policy should not be open to individual interpretation.
  2. Validation: Legal and compliance professionals, who likely don’t know how to write a SQL statement or Spark job, have no way to understand how those policies are actually being enforced, nor can they defend the actions of their organization with no proof.
  3. Fragility: When a policy changes, organizations have to make sure every policy snowflake across their organization enforces that change accordingly. If you think centralizing all your data in Hadoop or a data warehouse solves this problem, you’re dead wrong. The warehouse is just a second copy of all your data, and politics involved with actually migrating that data are exactly the issues we’re discussing here.
  4. Fear: Probably the worse issue of all, data sharing is slowed or completely stopped because data owners have no digital transfer “handshake” recognized by their organization of how their data is protected and shared due to the issues above.

A similar issue existed in programming before Object Oriented Programming (OOP) came to the rescue. As computers became more advanced, their associated software programs also became much more complex and difficult to manage. In these situations, a computer program would have a “from scratch” set of procedures for every objective it aimed to solve.

Sound familiar? Software programs were like snowflakes.

This led to the same issues above: common objectives were interpreted differently across programs, testing was complex, changes that span common objectives were fragile and led to fear of changing code. And that’s when OOP came in.

OOP has three central principles that solve these problems for software programs.

  1. Abstraction: only focus on what is necessary for the object and abstract the complexities under the covers. For example, you’ll never buy a “phone device”, but always buy something more specific : iPhone, Samsung Galaxy, Google Pixel 2… those are concrete things; phone device is abstract. You don’t care how they work, you do know you can make a call, though, because it’s a phone device.
  2. Encapsulation: you need this to do abstraction. Think of this as blueprinting how you intend an object to behave. For example, you’ve got several devices, all of them have a USB port. You don’t know what kind of circuit there is behind that port, you just have to know you’ll be able to plug a USB cable in.
  3. Inheritance: this means you can create new objects that inherit from others. For example, you can assume an iPhone has a SIM card just like a Samsung Galaxy would, since they would inherit that from the parent “phone device”.

Programming this way avoids “from scratch” code and allows reuse of common objectives. But OOP concepts have never entered the world of data protection because they span well beyond a computer program into real world people and processes.

That’s where Immuta comes in.

Instead of building policy snowflakes across all your databases, we provide a single consistent abstraction layer to build your policies. We’ve encapsulated what a policy should look like across any database, in plain English, which can be applied on-the-fly through our abstraction of how the databases work under the covers.

What does this mean? It means there’s no more mistakes, because all the policies are built consistently and can be validated by anyone in plain English. It also means if a policy changes, you only need to change that policy in one place. And it allows data owners to more freely share data internal and external to the organization, as there is a digital “handshake” that spans technology and business which is referenceable and defensible.

Why is this make or break for algorithm-driven organizations?

As organizations begin their a digital transformations to better leverage their data for machine learning models, every single one will find itself at odds with compliance and regulatory controls. And these controls are growing across the globe and becoming more complex. While most organizations focus on the data science models at the start of this transformation, organizations can no longer afford this luxury. Data management without thoughtful, long-term policy management is quite simply destined to fail. The first step in enabling algorithms starts by eliminating your data policy snowflakes.