This is the second part of a two part blog post on Data Connective Tissue, here is Part 1 if you missed it
Where do people document their data science work to collaborate with colleagues & document data sources? Wiki? Git? Share drive? Notebooks?
— Donald Miner (@donaldpminer) February 23, 2016
One of our well respected Data Science colleagues posed this question on Twitter the other day – it’s a great question. This seems like it should be obvious at this stage of technology and data science, but it isn’t – the data connective tissue discussed in the first part of this two part blog is the foundation you need to get you there. In this second part of the series, I’ll explain how a data foundation can nurture and create this collaborative environment your Data Scientists and Business Intelligence Analysts are craving (and need; so critical mistakes aren’t made).
Establishing the persisted data knowledge foundation – turning an outsider into an expert in your company’s data holdings in a matter of days – as discussed in the first part of this blog – is just the starting point. That’s really the “tissue”, you now need to connect it. Digging deeper, I believe there are two answers to Don’s question, documenting data sources is the foundation, then it’s the access patterns to those data sources that can establish true collaboration.
We at Immuta are somewhat of an anti-API shop – why? Well, we are here to enable data science, not force hands. Data Scientists and Business Intelligence Analysts have enough on their plate, they are paid to analyze. As soon as you require them to learn another nascent API, you’ve lost your audience. Instead, we believe in presenting the single sign-on access to your data foundation through the lowest common denominator, either:
This allows the data to be accessed in a completely language agnostic way (maybe saying SQL is language agnostic is a stretch, but I’d argue any data science language has SQL integration).
Once you have agnostic access patterns to the knowledge foundation of data you’ve created, you’re now ready to build the “tissue connections”. Let me explain this a bit further. I briefly mentioned the fragility of using Git for establishing your foundation in the first blog. This is because everyone is still using their own or shared connections across various database technologies throughout your organization. You need a single access point for your data, this helps in code portability as well as a security injection point. For example:
You can see we are connected to PostgreSQL and listing the tables. This connection actually represents the single sign-on access to the data sources across your organization, remembering from blog 1, a data source could represent a complex join across tables. In fact, the “loans” table is actually in SQL Server and not PostgreSQL, and it’s also a join across tables in SQL Server, however, it’s available through this PostgreSQL connection as a single table. Through our data platform, you can attach any SQL technology with your representative data sources, and have it exposed as if it lives (it does not) in the single PostgreSQL database – connect HIVE, Oracle, etc. You can also create new data sources from other data sources, imagine joining our loans data source (which is a join itself in SQL Server) to data in HIVE also exposed as a data source to create a brand new data source, which has the captured knowledge, dos and don’ts, nuances, purposes captured with it for discovery, and usage.
Similarly, we have the filesystem access pattern, and each data source is represented as directories. That’s right, you have filesystem access to data sources across your organization, it’s almost like a shared drive that represents your databases. It’s all done through virtualization and no data are being copied unless it’s read, similar to how PostgreSQL acts as a passthrough to the other database technologies. So, just like SQL, listing my directories (in Jupyter) I see:
(you can chunk up the data within those directories into subdirectories and files based on your data source configuration, but too much for this blog).
Now that you have that understanding, let’s delve into why this is so critical and creates the “connective” in the “data connective tissue”:
Portability and Reuse
We were pretty excited to see the latest announcement from Kaggle (Kaggle is a site that hosts data science competitions; the best data scientists around the world compete and solve hard problems). Kaggle released “Kaggle Scripts“; at its core, it is a way to share and collaborate on data visualization and algorithms through Jupyter Notebook. What is not immediately evident, but makes this all work, is the fact that when Kaggle spins up the Jupyter container for the Data Scientist to write their “script” and share it with others, they attach in the filesystem visible to Jupyter the data behind the competition. That’s right, so anyone can take that script and fork it for themselves, because the data are ubiquitous across the scripts. You see my point, if everyone accesses data the same way, through the same patterns, the data almost fade into the background and you are left to focus on the algorithm at hand. Kaggle Scripts creates the data connective tissue for the competition’s data just like we strive to create your organization’s data connective tissue across all your holdings and also provide the capability to build and share scripts against those data. We also enable rapid productization of those scripts through this same paradigm using Docker technology, but that’s for another day/blog.
Single sign-on access to the data across your organization has a very large side benefit, which is that it provides a single injection point for all policy logic. Instead of each database owner managing groups and access, when data sources are hooked in, they can also include policies. Those policies can restrict access to data objects, or columns within those objects, and policies do not only have to be security driven, but workflow driven. For example, if you only want the last 30 days of data to be used in reporting, that can be a policy against that data source. Let’s see this in action, we have two users, Mike and Jeff, Jeff can see everything, Mike has some rows and columns restricted:
As you can see, Mike sees less data, and has personal information in two columns hidden from him. Noting, again, that this is the same PostgreSQL connection string, but logged in as each user, querying a loans data source, created from a SQL Server join with these fine grained policies included. The filesystem acts the exact same way.
The implications are that if Mike were to fork a Python analytic written by Jeff reading from the filesystem that creates some visualizations, Mike would see what he’s allowed to see, Jeff would see what he’s allowed to see, and there was no data connecting at all with the exact same code executing. You are able to successfully abstract where the data are coming from and how it’s being secured — focus on the analytics.
So back to Don’s question, if you expose a mechanism to catalog and capture knowledge with respect to data sources, make that discoverable, and expose it through access patterns that are natural and agnostic, it provides the connectivity needed to truly begin sharing and collaborating effectively, securely, and accurately.