4 Ways a Data Science Company Enables Research at Scale

Published on August 17, 2020

Last updated on April 5, 2023

Heather Devane

Scalability is critical to driving accurate, applicable researched-based predictions and insights. But without the proper tools, the processes can be bogged down by information overload and data privacy restrictions that are difficult to scale. Together, Immuta and Databricks empower scalable quantitative research on sensitive data. Here’s how.

Imagine how your life would be different if you could predict the future. What would you have done differently? Perhaps more importantly, why didn’t you choose that path initially?

These are the types of questions WorldQuant Predictive seeks to answer for its clients before they make business decisions. How is WorldQuant Predictive able to anticipate such seemingly unpredictable outcomes? Its team of more than 100 researchers with extensive backgrounds in AI, data science, machine learning, and data modeling, and access to hundreds of thousands of data sources.

These researchers frame business challenges as prediction problems and find ways of solving them using enhanced signal detection and modeling, combining a proven quantitative approach with public and proprietary data sets to enable clear, accurate predictions. The broad swath of accessible data is a goldmine — but also comes with substantial risk.

WorldQuant Predictive CTO Slava Frid provided a behind-the-scenes look at how his organization plans to form resilient prediction models and insights at scale without delaying or obstructing the research process. At the center of his method are Databricks and Immuta.

Here are four ways Databricks and Immuta enable organizations like WorldQuant Predictive to scale quantitative research on sensitive data more quickly, safely, and seamlessly than ever before:

1. Empowers researchers to work without unnecessary limitations

WorldQuant Predictive’s mission for its clients is to improve business decisions through predictive modeling and improve ROI. The company’s mission for itself is to do so in the fastest, most secure and cost effective way possible. To meet these KPIs, researchers must be able to work as quickly and autonomously as possible. Therefore, WorldQuant Predictive allows them to add their own data sources, removing limitations that might unnecessarily restrict data access and utilization.

But if researchers are constrained by arbitrary restrictions or have hesitations about what they can and cannot upload, the depth and speed of their insights are severely limited. For a company like WorldQuant Predictive, which defines success based on speed to prediction and cost per prediction, hindrances stemming from data security ambiguity erode the business model and the quality of its researchers’ insights.

How does this look in practice? Let’s take research on the COVID-19 virus spread. To gather as many hypotheses as possible to integrate into holistic, resilient prediction models, researchers needed to access and upload private and public data, epidemiological models, published COVID-19 reports, and demographic information to the WorldQuant Predictive workspace. This makes the data available on-demand in one location, but also opens it up to all other users, which presents the possibility of potentially exposing sensitive financial, personal, and healthcare information to the wrong people.

How would the company quickly make necessary data available to its researcher network without the risk of exposing sensitive information? Using Databricks and Immuta, WorldQuant Predictive has been able to achieve this in just a few easy steps:

A researcher uploads any data source to their WorldQuant Predictive workspace.
Apache Airflow picks up the new data set so Databricks Notebooks can analyze it.
Once analysis is complete, Databricks notifies the researcher with a link to a newly created Dashboard.
The researcher accesses the Dashboard to verify the findings. Their approval triggers the data to move into raw and trusted data stages.
Databricks and Immuta data sources are created and scanned for sensitive data so it can be edited as necessary, shared with other researchers, and mapped to the right data cluster.

Immuta’s scalable, automated data access control provides trust and access that, when combined with Databricks’ speed and organization capabilities, has transformed the way WorldQuant Predictive’s researchers compile and share large data sets with the confidence of knowing no sensitive information is at risk.

“Databricks gives us scale and speed, Immuta gives us trust and privacy, Databricks and Immuta together are a good chunk of what we offer to our research team to work with,” says Frid. “The ability to deploy both together in a couple of days for proof of concept was pretty extraordinary.”

2. Streamlines data collection resources and integrations

One of the most burdensome and costly stumbling blocks organizations face is streamlining resources and determining their ROI. The COVID-19 pandemic and subsequent economic recession brought this reality to the forefront as businesses scrambled to audit their resources to determine which could be temporarily or indefinitely paused and which were necessary to keep moving forward.

Frid explained WorldQuant Predictive’s decision to invest in Databricks and Immuta as a way of streamlining the company’s data collection resources to avoid redundancy in its toolkit, while also scaling its capabilities to reach more researchers and have a greater impact. In assessing various enablement platforms, he asked the following questions:

What is our operational versus exploratory mix?
What other tools do we need our data collection platform to integrate with?
Do we need to repurpose existing dollars, and if so, to what extent?
How large and advanced is our team?

WorldQuant Predictive’s focus on empowering researchers to record their own findings and expand data source collections meant the operational versus exploratory mix was critical.

“Lots of organizations use Spark because their data volumes are just so large that it’s the only way to practically do what they need to do. That’s an operational use case,” says Frid. “We do a lot of exploratory [research and] we want to give researchers access to many data sources. So for us Notebooks were a big deal.”

Having a platform that supports both operational and exploratory needs, can be easily deployed, and seamlessly integrates with other technologies, added to Databricks’ and Immutas’ appeal. Additionally, Immuta’s native integration made this solution more robust and cost efficient than other options. As a result, WorldQuant Predictive is able to hit its speed to prediction and cost per prediction KPIs by enabling its researchers to source and upload data quickly and securely in a single platform.

3. Accelerates speed and organization

With hundreds of researchers referencing hundreds more data sources to create dozens of non-correlated hypotheses, it’s not hard to see how a data catalog at an organization like WorldQuant Predictive could quickly become overwhelmed, outdated, and disorganized. This threatens the effectiveness of researchers’ speed and insights as information becomes harder to find, share, and leverage effectively.

Take, for example, research on COVID-19’s potential economic disruption. Researchers looked at various micro- and macro-economic indicators, which were split into buckets, including credit card data, foot traffic data, digital marketing conversions, consumer spending trends, and US government and agency data. Each bucket could have as many as 50 data sources, which must be monitored and updated, particularly in such a fluid situation as the pandemic. With such a vast and fast-paced information flow, scaling with speed and structure would be impossible without a tool like Databricks.

“Even though not all of the data sets are large, as soon as you have a large catalog…it’s very difficult to join a few tens or hundreds of thousands of rows of data to something that has billions of rows of data, in a traditional environment,” says Frid. “The ability to quickly join across types of data regardless of where data is was a really big deal.”

As WorldQuant Predictive scales its model, Databricks has simplified researcher onboarding, reduced data upload and availability time, and accelerated query performance measurement, all without increasing researcher time or effort. This paves the way for more resilient predictions and insights to be gathered at a faster rate, driving timely action and results.

4. Provides context and transparency of data usage

With hundreds of thousands of data sources at researchers’ fingertips, how is it possible to track ways the data is being used and, even more critically, whether it’s being used correctly? The inability to answer this question puts organizations at risk of violating contractual and confidentiality agreements, which could have enduring ripple effects.

As WorldQuant Predictive’s network of researchers has grown substantially and continues to do so, Databricks’ native Immuta integration allows administrators to have oversight of their researchers and create distinct personas based on how they’re interacting with and using data. Just as companies are able to monitor product usage to determine and communicate messages that resonate with different types of customers, Immuta enables administrators to set policies that deliver the right data to the right users without exposing irrelevant or sensitive information.

“Immuta allows us to say, ‘sure, this user is authenticated, but how are they actually using the data?’” says Frid. “By having this context, data made available through the same queries can be completely different.”

Additionally, Immuta’s differential privacy, k-Anonymization and randomized response capabilities create randomness in data sets that doesn’t alter statistical properties but changes the data just enough to ensure its anonymization. By making subtle changes to the data based on context, which have no impact on intended usage, Immuta further reinforces security without unnecessarily hindering researchers. Frid refers to the process as delivering “an on-the-fly synthetic data set that is statistically different but protects privacy.”

The power to create privacy policies in data sets allows WorldQuant Predictive to scale research on sensitive data with the comfort of knowing all information is protected, auditable, and practical.

Today’s technological landscape allows us to have the world at their fingertips. But with so much information available and constantly changing, experienced researchers are central to making sense of it so data can be used to create informed, resilient, and actionable business and policy decisions. Equally importantly, their methods must be scalable, fast, and secure.

Databricks and Immuta work together to allow researchers and organizations alike to share information broadly, generate hypotheses from a broad range of data sources, monitor client data and control access to it and do it all quickly and confidently.

“One of the biggest things for us when taking Databricks and Immuta was that they work well together,” says Frid. “This is a nice set of features for us and our customers to feel comfortable with how we can protect their privacy and the privacy of the people they represent, and still do cutting edge research and data science.”

Request A Demo

Blog

Data Security Platform Overview

Discover

Secure

Detect

Attribute-Based Access Control

Dynamic Data Masking

Privacy Enhancing Technologies

Data Access Control

Data Security for AI

Data Mesh

Data Security Posture Management

Data Sharing

Data Regulations

Zero Trust

Financial Services

Healthcare & Life Sciences

Public Sector

Tech & Software

Media & Entertainment

Manufacturing & Supply Chain

Snowflake

Databricks

Starburst

Amazon Web Services

Azure Synapse

Google BigQuery

Hakkoda

phData

Blog

White Papers, eBooks, & Reports

Data Security Guides

Events & Webinars

Newsletter

Documentation

Customer Support

About

Leadership

Newsroom

Careers

Contact

Customers

4 Ways a Data Science Company Enables Research at Scale

Related stories

The Top 5 Barriers to Data Sharing and How to Overcome Them

What is a Data Supply Chain?

Data Masking Tools and Solutions