Establishing data management is a foundational step to creating a productive data science organization. Data management and business unit accountability are often overlooked, but are critical steps to success. The following post provides an overview of what data management is, why it is hard, and why it is so important to free your team of Data Scientists to focus on data science and not data wrangling.
What is Data Management?
Better data science requires the best data management. Data Management should:
- Allow you to understand the corpus of data across your company and make it easily discoverable and accessible.
- Allow you to control access to data at the most granular level: each piece of data to each user. That control must be dynamic in nature, allowing Data Owners to make policies around data more restrictive or less restrictive, on the fly, without retagging data.
- Provide the ability to log and audit all actions by users against your data: Who is accessing what, when, and why.
- Avoid load on production systems reliant on your data.
- Provide a “single sign on” abstraction layer so consumers of your data can access it easily and in ways that are familiar to them.
- Create a lab-to-factory workflow which allows experimentation to happen in sandboxes, and then allow successful experiments to be promoted to production easily without upfront data migration and security tasks.
- React: do not copy/consolidate all your data up front, rather, let the actions of your data consumers drive the consolidation of your data over time.
Data management spans several roles across the organization:
Data Owners: owners of the sensitive data. Data is power and Data Owners realize this, their concerns and goals must not be minimized simply for the benefit of data science. They have production systems to maintain, auditing requirements, and security restrictions. The Data Owners must be empowered by the data management platform; not victimized.
Compliance Officer: if they mess up, they could potentially go to jail. They must follow all data regulations and ensure data within a company is only being seen by those with the proper authority or training. They must also follow data retention and auditing guidelines. The concerns of the Compliance Officer cannot be ignored and in most scenarios the Compliance Officer works closely with the Data Owners and understand the legacy production systems and their security and auditing capabilities.
Data Consumer: in our case, we are referring to the Data Scientists, but this could also be any end consumer of the data, such as a Business Analyst, Chief Data Officer, Sales or Marketing consultant, etc. They need relevant data fast yet legally to do their jobs.
What Data Management Is Not
Data management software is not data preparation software. Although data preparation software can be very useful to Data Scientists and Data Consumers in general, it remains downstream from your data management platform as an optional capability. Data preparation involves translating and unifying data sources from various disparate systems into a unified “single truth” of all your data combined (note that the single truth can change as new data arrives, but it is still a single static truth when accessed). This is useful for business analysts that require investigation and discovery from their data. However, requiring a Data Scientist to work with the data manipulated to a single truth can introduce data bias. The single truth was built by a person, and/or persons, and/or software with bias on how to translate and merge the data, there could be errors, assumptions, new insights on how to unify that occurred later, and/or data dropped to the floor which was critical. The downstream analytics will only be as good as the upstream single truth that was used as input.
Data preparation is in fact a data science process.
There is data preparation software that will do this through probabilistic means, however, as soon as the actual unification of the data occurs creating the single truth, it then becomes a deterministic model and you are left working with the single truth.
By providing a data management platform, the Data Scientist can reach back into the raw data to build their own single truths (models) specific to their problem at hand, as needed. Or the Data Scientists can leverage available models previously created, either by downstream data preparation tools or other data scientists. A data management platform can manage this “multiple models” workflow while inheriting the security controls from the upstream data (or other models). This allows the model creation process?—?the multiple representations of truth?—?to move at the speed of knowledge. There is no single right truth because different combinations of algorithms will produce different types of insights in your data. They are all equally valid if you can prove their statistical validity. You don’t want to somehow confine yourself to a single right answer. You want to pull out statistically significant ideas from all.
Thus, data management remains a required step even when utilizing data preparation software. As discussed, it can provide the means to create new models of the data (multiple truths) while still meeting the requirements of the Data Owners and Compliance Officers.
Why is Data Management Hard?
Data management is often an afterthought, but even if you considered it up front, it is a massive challenge. When a data science organization decides to abandon a project, the root cause can almost always be traced back to failed data management execution. Why is it so hard?
Governance and control of institutional data is the core of the data problem. Managing disparate sets of data within a single enterprise is complex and challenging, let alone managing external consumption and analysis of the data. Even if internal organizations agreed to collaborate, it would take a significant investment to develop a common means of governing different types of data across the enterprise. It boils down to three problems: access, audit, and security enforcement.
How can data scientists access data?
The current enterprise approach is to either:
- Directly ask Data Owners for their data, or
- Initiate a plan to unify data to a single “data lake” with a common access pattern and ontology.
Approach 1 is not scalable for two primary reasons. First, it would require the Data Owner to either provide direct access to the data or export the data in bulk to the requesting user, which does not provide a true representation of data state since that dump of data will be missing new data the second it leaves the owner. Second, the requestor of the data will have the burden of making sense of the data without help from the Data Owner. Even if the Data Owner provides help, the scalability of that approach breaks down quickly.
Approach 2, which sounds promising on the surface, is unachievable, particularly in an environment where data security controls are heterogeneous.
This is the core issue of the data sharing problem. Data Owners are not apt to give up control of their data for the good of the enterprise.
They not only risk letting data slip into the wrong hands, but the legal policies around the data are likely going to need to be modified in order to migrate the raw data to a new management system; in some cases, this isn’t even possible.
Legal roadblocks, Data Owner “slow rolling” (data is power), and ontology debates will ultimately and slowly crush the data lake utopian vision.
Who is managing our data?
One of the major issues with data lake approaches and the bulk export of data to requestors is maintaining audit, provenance, and purge control of data. How many copies are out there? How can I pull that data back? Who is using it?
Data Owners and Compliance Officers have the right to be concerned about providing copies of their locally managed data to a consumer or data lake out of their purview, specifically because:
- Compliance Officers require auditing that allows retroactive investigations to determine actions against their data. If a Data Owner needed to report who has seen their exported data during a specific slice of time, how would they do so quickly and in an automated way?
- Purge is also very difficult, if a Data Owner must purge their local copy of their data, that can be done, but how can that purge be reported and executed to the external copies? What if there is a temporal purge requirement, where all data must be purged within 30 days, this becomes exceedingly difficult for exported data.
- Lastly, maintaining provenance of data is nearly impossible. If a consumer publishes new data (a report, an analytic result, etc) that was built from their data, how can they maintain their security controls, their purge requirements, their audit requirements, against this new generated data?
How can I enforce complex data specific security policies?
In a fractured enterprise with many data silos, there is inevitably a vast amount of security policies and user authentication implementations. If one is to move to a data lake approach, how can those policies be boiled down to a consistent security enforcement model? Some cases may require complex logic to determine access to data. Other policies may distinguish between read and edit policies. There could be cases where portions of data could be released, but the raw data can never be accessed. Depending on the level of heterogeneity of the security policies across the enterprise’s data silos, representing those policies in a generic way would lead to classic data policy enforcement issues:
- Inability to migrate data with complex policies to the data lake.
- Relaxing/modifying policies to fit into the generic security representation (square peg, round hole).
- Over-restricting or “system-highing” data so coarsely that it eliminates access to those that should have access.
Bottom line, options 1 and 2 for data sharing are impossible due to access,auditing, and security enforcement.
The Problem is Real
Consider a satellite company with many satellites circling the globe and transmitting back telemetry data about their current state, functions of the system, and general health. This data is massive?—?not just in size, but in importance?—?and requires advanced analysis to catch issues and aid in future design. The data is also very sensitive?—?so sensitive that it is controlled for export by the U.S. International Traffic in Arms Regulations (ITAR).
This satellite company also has several scientists who need to analyze this data; however, some are not US citizens. The satellite company seeks approval from the State Department to release some of the data from ITAR restrictions. Their bid is successful, but for only certain components of the satellites. Now they have an IT problem on their hands. The US citizen scientists can see all the data, but the non-US citizens can only see certain portions.
What is the classically wrong approach to solving this problem? Data copies. The satellite company will now bifurcate their telemetry data to two destinations, one for US users and the other for foreign persons. This requires an engineering effort to ensure the bifurcation is occurring properly. It is also a cost problem, because the company is now storing the same data twice. Further, this approach is not scalable. For example, what if more components are deemed releasable? What if new scientists come on board with different restrictions under the ITAR? This type of situation can quickly devolve into a nightmare for the Data Owners, the Compliance Officers, and unfortunately, the Data Scientists themselves.
Similar conflicts of data sensitivity vs data science goals can be seen across all verticals of business; insurance, finance, telecommunications, health, intelligence, aerospace engineering, manufacturing, the list goes on. Data is powerful yet sensitive. That delicate balance between data value and sensitivity must be maintained to reap the benefits of data science.
The modern enterprise needs the support of a high-level data brokering mechanism to allow data scientists to gain access to data in a simple, secure, and audited way. This is why we started Immuta and created a platform to address these issues within the enterprise. Immuta provides a scalable software framework, deployable on any cloud, to tackle these challenges and create the required data broker layer to supply data in a scalable, policy-driven manner that allows dynamic policy changes. It accomplishes this through a unique data virtualization approach to data management. We believe that the deployment of Immuta on an enterprise will not only enable effective data sharing of disparate sets of critical silo’d data, but will also enable the enterprise to establish a means to manage fine grain access controls. Immuta will enable your company to create a controlled data management environment that allows Data Scientists, Data Owners, Compliance Officers, and Business Decision Makers to act in concert, seamlessly meeting the needs of each group while maximizing the potential of your data.