Data is more critical than ever. Large Language Models (LLMs) and Generative AI are taking the world by storm. Suddenly, “Data Strategy” and “AI Strategy” are board-level discussion topics, and every organization is racing to figure out how to leverage its data to take advantage of this new paradigm.
As a data professional, the momentum is exciting and puzzling at the same time. The latest LLMs, whether proprietary models like ChatGPT 4 or open-source models like Llama 2, are a step function change compared to what was previously possible. Yet, we all know that Machine Learning and Artificial Intelligence are not new topics. Models are powering a variety of use cases in production, from fraud detection to insurance risk management or recommendation engines.
The technology is complicated, but the challenge with leveraging data and innovating is always a different one – how to move quickly but securely, while also adhering to all regulations and rules that govern your organization.
As a large and complex organization, it always starts with deeply understanding your data. You can’t have an AI Strategy without a Data Strategy. Your Data Strategy starts with understanding what data you possess, especially in your data warehouse and data lake. The best place to get clarity is your enterprise data catalog. Tools like Alation allow your data consumers to browse all available data sets, understand their data quality, and identify which data sets your colleagues utilize to help them pick the correct data to accomplish their objectives. If you are just getting started, even vendor-specific solutions might be sufficient, and you can browse your data assets directly in Databricks or Snowflake.
While those approaches work for your data analysts and scientists, they do not fully meet the needs for your organization in terms of data security and governance. Tags in the data catalog help you find the right data or understand its quality, but they are not purpose-built for data security and compliance. Often, the tagging is only at the table level, which is insufficient for row- or column-level access controls.
To protect your data, you must granularly understand what’s stored in each column and table. False positives or inaccurate classification are highly consequential because tagging data as non-sensitive may grant many more employees access. Your data estate is never static, so you must continuously monitor and dynamically re-classify if the data or the schema changes.
Without correctly identifying your sensitive data, it’s impossible to understand what data you possess and how to categorize it according to regulatory frameworks like CCPA, GDPR, HIPAA, or PCI. On top of that, you need to adhere to internal policies and guidelines that categorize all your data into non-sensitive, sensitive, highly sensitive, and potentially more supplemental impact levels. Without these steps, any audit becomes a nightmare, resulting in manual work to reconcile what data was utilized by whom, for which purpose, and in which dashboard, model, or data product.
The Ideal Data Discovery Workflow For Data Security
In a perfect world, you would continuously inventory your data and the full schema of your data platforms like Snowflake. It’s a one-time exercise to create the first inventory, but ongoing schema monitoring is needed to ensure that new columns or other schema changes immediately trigger data inventory updates.
The data inventory process establishes sensitive data elements through out-of-the-box classifiers to find PII like email addresses, or PHI like social security numbers. Domain-specific sensitive data that’s unique to your organization is found automatically through custom classifiers. All those findings get categorized according to regulatory frameworks such as CCPA, GDPR, HIPAA, or PCI to ensure compliance at every level. To satisfy your data governance and security needs, on top of the regulatory frameworks, you also categorize all the data under your company-specific security framework by mapping it to multiple different sensitivity levels.
Ultimately, although your data is constantly changing, you automatically maintain a highly accurate and granular metadata inventory to understand what data you possess.
Why Today’s Data Environments Require a Different Data Discovery Tool
Today’s sensitive data discovery tools give you a shallow overview of your data corpus across a long list of platforms. They give you pointers on where you have sensitive data without the granularity to drive your column- or row-level access controls. They help you understand what data you possess according to a regulatory framework like HIPAA or PCI, but without the details needed to automate your audits or compliance reporting. Knowing that you need to drive east to west on a road map from New York to California is helpful, but ultimately insufficient to get from a specific point A to point B.
Existing tools promise a high degree of automation, yet their many false-positives result in painful manual work that never stops. Although data gets scanned automatically, performance breaks down at scale, or you manually need to fine-tune the computing resources of the scanners. Last but not least, your security team objects to the agent-based processing that requires taking data out of your data platform, and the associated data residency concerns give you pause.
At Immuta, we believe that data security should not be painful. We believe that you can innovate and move quickly, while at the same time protecting your data and adhering to your internal policies and external regulations. Technology and automation allow you to make the right trade-off decisions quickly. It all starts with highly accurate and actionable metadata. If you trust your metadata and if it’s actionable, you can leverage it to automatically grant access to data, mask sensitive information, and automate your audit reporting.
We built Immuta Discover to tackle those challenges and address them through a unique architecture that we designed in collaboration with the largest financial institutions, healthcare companies, and government agencies in the world. The cloud and AI paradigm requires a fundamentally different approach. You must assume that your data is dynamic, constantly changes, is unique, and is collected in a multitude of different geographies and legal jurisdictions. Immuta Discover is built for this new world and its specific demands.
Scalability Through In-Platform Processing
Identifying and classifying data requires analyzing and looking at the data – there’s no way around it. Immuta Discover does all the analysis and processing inside any data platform, including Snowflake and Databricks. It takes advantage of those platforms’ inherent scalability to enable you to analyze large amounts of data quickly, efficiently, and without the need for separate resource optimization for containers or Virtual Machines.
Data Residency Compliance By Design
By processing data directly inside the data platform, Immuta Discover automatically adheres to data residency and locality requirements. If you run your data warehouse or lake globally, across North America, the European Union, and Asia, Immuta processes the data in the region where your data is stored. No data ever leaves the data platform, and it will never move around across different cloud regions.
Improved Security and Simplicity Due To Agentless Scanning
In-platform processing greatly reduces risk and improves your data security posture. Provisioning agents, whether they’re in a container, Virtual Machine or AMI, create complexity and an unnecessary security risk. Not only can those agents become compromised, but their misconfiguration might lead to data leaking to other parts of your cloud infrastructure. An agentless approach can better leverage data platform optimizations to process data instead of transferring it out to re-optimize and analyze. This simplifies operations and increases efficiency for your infrastructure teams.
Cross-Platform Consistency
The advantages of in-platform processing are obvious, but implementing it across a multitude of platforms is challenging. Immuta helps bypass the obstacles by doing all the heavy lifting for you and building in specific implementations for each technology. Although all those implementations are ultimately different, Immuta abstracts the results to one standardized taxonomy, so you can have consistently accurate and granular metadata across all your data stores.
Granular Query-Level Classification
Immuta Discover classifies on a column level and instantaneously identifies schema changes. Only with that level of granularity and automation can you adhere to your audit requirements and understand what actions have been taken on your data. For example, if non-sensitive data is joined with sensitive data at query time, Immuta Discover will monitor and record that for your review. Continuous schema monitoring ensures schema changes never result in holes in your access controls and data security posture.
Highly Accurate and Actionable Metadata
Trust in your metadata is critical for data security. To unblock your data consumers, you need to automate your data access controls. This requires first knowing that your classification and metadata are accurate and actionable. Immuta Discover provides you with highly accurate metadata and tags out-of-the-box, and assists you in fine-tuning the classification mechanism to deal with false-positives quickly. That enables you to build policies that dynamically grant or restrict access to PII or PHI, depending on who is accessing it and what protections, like masking policies, you want to apply.
To see it in action, check out the video below.