The data landscape is in a constant state of evolution, with new technologies and practices taking shape by the minute. When we published last year’s Data Engineering Survey: 2021 Impact Report, we wanted to get a baseline understanding of the major trends in cloud data use and what the coming years might bring. This year, we partnered with Gradient Flow to begin building on that first year data with a survey aimed at understanding data teams’ technology choices and the challenges they face in preparing data for consumption.
This year’s data yielded interesting insights about the maturity of various DataOps and engineering processes, the popularity of various cloud data platforms and tools, and the top challenges faced by data engineers today. As data use continues its rapid evolution, data leaders and practitioners can benefit from understanding how these trends may shape their resource decisions, both human and technological.
The 2022 State of Data Engineering Report covers the emerging challenges – and what is further complicating them – in fine detail. Here’s a sneak peek at the top obstacles we uncovered. Stay tuned for additional articles highlighting the top cloud data platforms that data-driven organizations are adopting, and why investment in multiple platforms is accelerating.
1. Data Quality & Validation
With data teams increasingly relying on data, and data architectures becoming more complex, it’s no wonder that data quality and validation topped the list of challenges cited by survey respondents.
BI, data science, and analytics initiatives are jeopardized from the start if the data quality is poor – yet, 27% of survey respondents were unsure what (if any) data quality solution their organization leverages. That number jumped to 39% for organizations with low DataOps maturity, indicating that data quality may not be a priority for data teams in the early stages of building their data infrastructures and strategies.
All organizations – but particularly startups and those in the process of building a data framework – would be wise to consider how they will approach data quality and validation early on to avoid major hurdles.
2. Data Discovery
In 2020 alone, the average amount of data created each day was 2.5 quintillion bytes. That number is projected to grow to 463 exabytes by the end of 2025 – an increase of more than 18,000% in daily data (to put things into perspective, there are 1 quintillion bytes in an exabyte). It’s almost unfathomable to grasp, yet data teams are expected to keep up with an ever-growing number of data sources. That could be one of the reasons why an estimated 80% of an organization’s data goes unused.
Automated data discovery tools help data teams understand what data exists, who owns it, and who has access to it – without wasting precious time on manual data discovery. As the amount of sensitive data collected and used increases, tools that provide automated sensitive data discovery will be most efficient at easing this burden with minimal overhead.
3. Data Masking & Anonymization
According to survey respondents, three out of four of organizations already collect and store sensitive data – a finding consistent with the previous year’s results. However, the advanced techniques needed to help protect and de-identify personal data remain elusive.
One of the primary challenges is that techniques like k-anonymization and differential privacy can be complex to implement, particularly across diverse cloud environments. Manually creating and applying privacy controls consistently on each platform is not only time- and resource-intensive, but it can also increase the surface area for risk. Making matters worse, ensuring that masking and anonymization techniques satisfy the requirements of various data use rules and regulations can be convoluted due to unclear anonymization standards and definitions.
Still, since Gartner estimates that 60% of organizations will rely on privacy-enhancing technologies (PETs) before 2025, organizations with an automated approach to dynamic data masking will have a leg up on the competition.
4. Data Monitoring & Auditing
The proliferation of data compliance laws and regulations is impacting organizations across all industries, geographies, and sizes. Eighty-eight percent of survey respondents reported that their organizations are subject to at least one regulation. While the most well-known and wide-reaching laws – GDPR, HIPAA, CCPA, and SOC 2 – topped the increasingly long list of regulations, almost one-third of data teams must also comply with internal, company-specific rules.
The need to adhere to a broad range of data use mandates puts additional strain on data teams to ensure that data access policies:
- Account for differing regulatory guidelines
- Are written in an easily understandable manner for transparency and efficiency with legal and compliance teams
- Are implemented consistently across platforms
- Can be audited on-demand to prove compliance
Given this, it was not surprising to find that data monitoring and auditing ranked among the most challenging tasks for data teams. The rapid adoption of multiple cloud data platforms, compounded by the exponential growth of users, regulations, and platforms, creates an incredibly complex data audit trail that requires automation.
Further magnifying challenges with data monitoring and auditing is the fact that many organizations are still using legacy data access control methods. Last year’s survey data revealed that more than 80% of respondents’ organizations relied on role-based access control (RBAC) or “all-or-nothing” access policies. The management overhead of these traditional approaches, and the subsequent “role explosion” they cause, makes data use monitoring and auditing much more difficult for data teams.
To put this into perspective, a study by GigaOm found that Apache Ranger’s RBAC method increased policy burden by 75x versus attribute-based access control (ABAC) policies, as offered by Immuta’s platform. If managing role-based approaches to access control is that much more challenging, it’s safe to assume that the downstream effects on monitoring and auditing will only multiply.
The challenges data teams face aren’t limited to data engineers. For data platform owners, data architects, and data consumers, completing data initiatives and achieving goals is immensely harder when DataOps processes are manual or inefficient. For leadership teams, unreliable data or delayed insights can lead to poor decision making and slow down data-powered innovation. For data-driven organizations today, the headaches caused by suboptimal data engineering practices threaten to increase tensions and lead to risky workarounds.
Yet, equipping teams with modern data access control, data discovery, and data auditing capabilities can accelerate speed to insights, enable greater scalability, and ensure the right data gets into the right hands at the right time.
To read the full findings of the survey, download the 2022 State of Data Engineering Report. If you want to see how Immuta helps simplify the most challenging tasks in the data pipeline, start your free trial today.