SQL: The Forgotten API

Data mesh has become a hot topic as organizations look to foster self-service analytics to make evidence-based decisions. There are plenty of articles discussing data mesh, including Zhamak Dehghani’s original blog post introducing the topic from 2019.

To get back to basics, data mesh is a socio-technical approach to creating a decentralized analytical data architecture. It organizes data by domains to make it more accessible by connecting data owners, producers, and consumers. Grounding the data mesh-enabling architecture are four guiding principles:

  • Domain Ownership
  • Data as a Product
  • Self-Service Data Infrastructure
  • Federated Computational Governance

Data products could be one of the most confusing pillars of the data mesh because they truly force organizations to rethink how they deliver data to customers using product thinking. Product thinking requires a cultural change that is externally focused vs. internally focused, solves a problem vs. takes an order, and builds value vs. breaking down tasks.

This change requires data product owners to have deep knowledge and understanding of the customer to deliver the right data product. It also forces data product owners to own the product over the entire lifecycle, adapting it to the customer’s needs and not just ending the project when the data is delivered. In other words, the data product must achieve product-market fit.

With this in mind, how do organizations deliver data as a product for business analysts and data scientists building dashboards and models? To meet the needs of these users, data needs to be delivered in a way that allows them to continue using their existing tools and processes to deliver the needed analytics. To do that, data products should be delivered using SQL.

What Is a Data Product?

Dehghani calls a data product an “architectural quantum,” and cites Evolutionary Architecture’s definition of the term: “The smallest unit of an architecture that can be independently deployed with high functional cohesion, and includes all the structural elements required for its function.”

If you’re like me, that feels too technical. Put simply, a data product is a valuable set of data for analytical consumption. While many would categorize dashboards, Excel spreadsheets, and ML/AI models as data products, I would argue that they are not. You are not able to use those entities to create new insights by combining them with other data – that would essentially mean a data product is just a data set.

What makes data products different is that they are created using product thinking. In other words, the data consumer is in the center and the product is seen as an ever-changing entity meant to delight them. This is a mindset shift in that just providing data is no longer good enough – you now need to show that the data provides value to the organization.

To do this, new metrics such as net promoter score provide better insights than traditional technical metrics, such as the number of times the data was accessed. In addition, data products must have the following usability characteristics:

  • Discoverable through a central catalog;
  • Addressable through a unique global identifier;
  • Trustworthy and truthful, providing service level objectives, data integrity/quality, lineage, and other key aspects;
  • Self-describing semantics and syntax, meaning they can be discovered, understood, and consumed without hand-holding;
  • Interoperable and governed by global standards; and
  • Secure and governed by global access control

These characteristics enable consumers to easily join, aggregate, and filter data for analysis using data from different domains. The question then becomes, how does a domain deliver a data product?

What’s Wrong with Web Services for Data Mesh?

Having worked with customers on their journey to creating a data mesh, I often see their first instinct is to utilize a web service-based delivery model. Usually, the data product is delivered over a REST interface using JSON, but some customers have looked at using GraphQL or even giving an AWS S3 endpoint for a parquet or CSV file.

Organizations fall back on web services because they are typically the way we deliver data via applications in a distributed architecture, like a data mesh. While these are perfectly valid options that meet the characteristics of a data product, they are not the most efficient way to deliver a data product. The number one reason is that a data mesh is built for analytics use cases, not operational systems.

The issue with using web services is that data is difficult to join, filter, and aggregate, which are essential functions to delivering analytical results. If one data product delivers data via JSON, and another via XML (yes, it is still used), then the data needs to be wrangled to normalize it for analysis. Even if the data is all in JSON, it must be brought into memory to perform the analysis.

This is inefficient because it becomes difficult to scale the BI tools and servers that are performing the work to provide a result. In most cases, the data will be stored in a database that creates a duplicate version. That data now needs to be kept up-to-date via data engineering work, but using APIs forces people into an anti-pattern for data mesh implementation.

Why SQL Is the Future of Data Product Delivery

History says that SQL will win out over web services for data product delivery in data mesh architectures. Why?

Enhanced Data Platform Capabilities

The first release of Hadoop in 2006 ushered in the “big data” and batch processing analytics revolution. Hadoop was exciting because it did not force analysts into strict SQL constructs like traditional relational database data warehouses.

As users worked with the data, they discovered the need to add structure to it. This led to the release of Apache Pig in 2008 and ultimately, Apache Hive in 2010. Hive brought the batch processing movement full circle by introducing SQL to the Hadoop platform.

Organizations forget that SQL can be used for data products because they are stuck on traditional RDBMS data warehouse thinking. Data platforms have significantly evolved since Haddop was released and today’s cloud technologies provide the perfect medium to deliver data products.

These new platforms decouple the compute layer from the storage, saving on the ownership cost of running an analytical data platform. Additionally, modern data formats such as Apache Iceberg, Delta, and Apache Hudi allow you to take advantage of cheap object storage while enabling RDBMS features like ACID compliance and adding the ability for concepts such as time travel. This data platform revolution is seen with Snowflake, Databricks, open source Trino and the commercial version Starburst, Google BigQuery, and others.

Organizations often get stuck in legacy thinking that says you cannot grant access to directly query data. This too has changed, as modern data platforms provide security controls to deliver data products through direct access.

Query Language & Tool Integration

SQL is the industry standard for analytical workloads and an open standard developed by the ISO/IEC Joint Technical Committee (JTC) 1 for Information Technology. Since all SQL systems base their query language on the ISO/IEC 9075, the fear of vendor lock-in is reduced, because the SQL can be used across systems with little to no change. This fact removes the need for using APIs to shield data product consumers from the underlying data platform technology.

SQL also checks all the boxes of a data product. Its functionality with existing programming languages and BI tools allows analysts and data scientists to work within their preferred tools and workflows. And, since SQL lets analysts use underlying platforms to query exactly what they want through joins, filters, and aggregates, it reduces the need for duplication and additional data engineering work that a data mesh is meant to prevent.

Most importantly, using tools such as Immuta to manage privacy and security policies allows you to securely query data products directly in your data platform of choice.

How Immuta Can Help

Immuta is a key component of a data mesh architecture, providing the computational governance to deliver data products as SQL through modern data platforms. Because Immuta enforces the necessary data controls to meet organizational and regulatory requirements, users can safely query for the data products using their existing tools, such as business intelligence tools or Juptyer notebooks. Instead of relying on in-house or custom solutions when implementing data access control, Immuta’s policy automation unlocks the potential of data mesh for self-service analytics.

Because users are directly querying the data, unlike with an API, Immuta improves cybersecurity by logging and enabling zero trust architectures at the data platform level. Combining user query information with network and application log data creates a complete picture of user activity to stop potential threats.

Immuta continually scans the data platform for sensitive data that needs to be protected, and automates access policy enforcement on that data. As new data products are onboarded, organizations can be assured that any sensitive data can be discovered and protected, lessening the possibility of data leakage.

Leveraging SQL to deliver data products within a data mesh architecture is the most practical choice for organizations aiming to achieve scalability and flexibility while unlocking true value from their cloud platforms. Layering Immuta into this equation helps ensure that no sensitive information slips through the cracks. With streamlined, efficient, and secure workflows, organizations will be better able to reap the full benefits of a data mesh architecture.

Simplify Data Mesh Implementation

Securing a data mesh doesn't have to be complicated. See how in our eBook.

Get Your Copy
Blog

Related stories