Data Observability vs. Data Detection
The modern data stack is continually evolving, and confusion can arise when new frameworks experience a sudden spike in popularity — or when complementary solutions with overlapping functionalities enter the market. Recently, data observability has emerged as a new way of thinking about data quality and as a new technology category.
However, given its broad scope, there’s been some confusion within the data community on the differences between data observability, data detection, and data monitoring. This article will provide an overview of the similarities and differences between data observability and data detection, and describe how companies can use each capability to improve transparency across the data stack.
Data Observability vs. Data Detection: Similarities & Differences
Data observability and data detection both involve visibility into and information about data, but the terms shouldn’t be used interchangeably. On a functional level, observability and detection share some overlap, but companies deploy them in different ways to accomplish different goals.
What is Data Observability?
Data observability is a DevOps-inspired approach to ensuring data quality. It provides data engineers with full visibility into the health of their data, across the ingestion, storage, transformation, and analytics layers of the data stack. Just as site reliability engineers (SREs) use automated monitoring and alerting to stay informed of any application errors or downtime, data engineers use the same approach to get real-time insights into the health of their data pipelines and assets. Additionally, data observability includes lineage, or a map of the relationships between upstream sources and downstream consumers that helps data teams resolve incidents faster.
What is Data Detection?
Data detection involves surfacing insights about data and is accomplished through data monitoring, or tracking data over time. The goal of data detection is usually to identify anomalies, specific incidents, or risks.
As organizations work with increasing amounts of sensitive data, compliance expectations rise and data breaches become a greater threat. Data detection helps teams continuously monitor how data is used and identify when changes to security configurations or classifications occur. With timely insights into potentially risky user behavior, teams can proactively remediate risks and improve data security posture.
Outcomes of Data Observability vs. Data Detection
Data detection is made possible by data monitoring, which is a component of observability. The two concepts are closely related, but teams can use them to accomplish very different goals.
The core purpose of observability is to make data more trustworthy and reliable. For example, in addition to monitoring and alerting, leading data observability platforms include end-to-end lineage that enables more efficient root cause analysis and faster time-to-resolution for data incidents.
Meanwhile, data detection surfaces insights into how a company’s data itself is used or changed, which can be used for many purposes beyond observability. For instance, data detection techniques can improve security by monitoring changes in user behavior and entitlements across modern data stacks, or prove compliance by maintaining data audit trails.
When to Use Data Observability vs Data Detection
Just as data observability and data detection have different goals, they also have different use cases. Again, while some will overlap, there are distinct instances where observability is required instead of detection, and vice versa.
Questions Data Observability Can Answer
Data observability can tell teams if thresholds around freshness, volume, and quality haven’t been met, and automatically alert relevant team members when anomalies occur. Some questions that data observability tools can answer include:
- Is this data accurate and up-to-date?
- Does this data fall within an expected range?
- Is this data complete?
- When was the schema last changed?
- What are the downstream dependencies of this data incident?
For example, if a data asset typically updates every 24 hours, but the data hasn’t changed for two days, observability tooling will surface an alert. Similarly, if a table usually has 200 million rows, but suddenly reduces to 5 million, observability tooling would notify the relevant data owner.
Additionally, when a data asset or pipeline breaks, observability tooling includes lineage that can help data engineers identify upstream and downstream dependencies. So, for example, if a table hasn’t been updated for two days, lineage can help an engineer conduct a root cause analysis by investigating the upstream source of the data. Similarly, lineage allows data teams to mitigate the impact of unreliable data by identifying any downstream reports and their users.
Questions Data Detection Can Answer
Data detection can surface insights based on metadata, usage logs, and other cataloged information. For example, if a team wanted to use data detection to improve data security, they could gain visibility into:
- Does a table contain sensitive data? What columns are sensitive?
- What data access activity took place in the last 24 hours?
- What are the busiest sources containing sensitive data?
- Which users accessed sensitive data most often?
- Who are the users that accessed a specific table?
- What tags changed, and who changed them?
For example, if a data governance team needs to identify which data sources contain personally identifiable information (PII), data detection can provide those insights. Similarly, if a data audit is underway and a team needs to generate a log of which users most frequently accessed sensitive data, data detection can catalog and surface that information.
By aggregating this information and continuously monitoring data use, some data detection tools can also provide a profile of an organization’s risk levels. This can be beneficial in proactively adding or establishing data security measures, but can also help gain leadership buy-in for increased attention on and investment in data security.
Top Tools for Data Observability and Data Detection
Data observability platforms typically include features like automated monitoring and alerting across the data stack, including data warehouses, lakehouses, transformation, and BI layers. End-to-end lineage should be included to enable better root cause analysis and incident resolution, and some platforms use machine learning to automatically set thresholds based on historical data trends. The most popular data observability platforms include Monte Carlo, Metaplane, and Databand.
Data monitoring and detection tools may provide coverage across the entire data stack, or be limited to the data storage layer. Features to look for include automated alerts when anomalies or potential issues are detected, as well as reporting capabilities. Top performers include Dynatrace and Datadog, while Immuta’s data security platform specializes in data detection for data security teams, and offers user behavior analytics, risk severity scoring, and integration with SIEM technologies.
Do You Need Data Observability and Data Detection?
Smaller organizations with less complex pipelines and fewer data assets may be able to do without data observability solutions, instead relying on manual testing to check for incidents. However, for organizations with larger amounts of data, or those that are looking to sustain long-term growth, automated monitoring, detection, and observability will allow teams to scale without compromising on data quality or the availability of data insights. Companies that manage sensitive data will find data detection especially useful for meeting compliance requirements and mitigating security risks.
Ultimately, observability and detection complement one another and help organizations improve the quality, security, and visibility of their data.
How do data observability and detection play into risk management? Find out the fundamentals in Data Risk Management 101.