alt text

Q: What’s the fastest way for management to lose faith in their newly formed data science team?

A: Bad predictions and invalid reporting leading to embarrassment, or worse, costing the company money.

This not only happens because the Data Science team isn’t skilled enough in the art of data science or business intelligence, rather, it can as frequently, if not more, be the fault of missing or misinterpreted data. If you’re not seeing the full picture, or believing an “x” is really a “y”, you are inadvertently creating “garbage out”.

There’s always quite a bit of focus on which algorithms to use and inventive feature engineering, I would call this the “Science” in Data Science. However, organizations seem to struggle on the “Data” part of Data Science because it seems easy and obvious from the start. It isn’t easy. It’s not that the data aren't there (it’s probably sitting as a big heaping mess in your Hadoop cluster), it’s that the connective tissue between the data and the science is too weak.

Data have dos and don'ts, nuances, purposes. People understand those purposes, nuances, and dos and don’ts, and it’s people you know, don’t know, or may never know. So how do you commoditize the understanding of data across your organization and make it feel ubiquitous? Stop the “garbage in”? This is your data connective tissue, it allows the knowledge of many, over long periods of time, to persist and seem commonplace, allowing accurate insights. Without this, you are risking the future of your data driven aspirations.

So what are some examples of data connective tissue? I’ve seen teams that store their SQL queries in git. Those commits include the dos and don’ts, nuances, purposes, versioning, what is in this column, and why you shouldn’t use these data for a purpose which it looks like it should be used for. But more often than not, it’s up to the individual members of the team to tap someone on the shoulder to get a connection string to a database, or an account on their Hadoop cluster. It’s also on them to ask questions, and not assume what data mean, if they can even find the right person to ask. Worse is security/compliance implications. Should your team all share the same SQL connection? Do they really need to see the PII details in your data? Git sounds like a good solution, but it’s brittle as self-service and self-enforced, not discoverable, and assumes connectivity to all, and ignores security altogether.

We at Immuta believe in enabling data science teams and it starts with this data connective tissue. Teams should build on each other’s knowledge, create a “data source” based on a SQL statement, perhaps a very complex join, name it, save it - “worldwide store transacting accounts”. There’s a lineage of the data owner for that data source, knowledge about that data source is captured within the source, sub-queries within that source are captured and presented, scripts processing that source are linked to it, and users can fork and/or version that data source -- with permission from the original data owner. The data sources can be tagged for discovery, shopped for, or suggested by an under-the-covers recommendation engine, and subscribed to.

Data source access patterns are also important. You should not pass around multiple varying connections to all your technologies and data silos. Instead, data sources should look just like that, a single source in a single connection string. You need a single sign-on to all your data across your company with a unified point for security injection. You can join across data sources and across database technologies through your single connection, and then share that insight and build more data sources. It’s a literal store of data knowledge, slices of data with purposes, and it’s associated analytics. A new data scientist will feel like a veteran in days.

Feeling like a data veteran removes doubt, allows innovation, and most importantly, avoids mistakes. It’s the data connective tissue which provides this.

This is the first in a two part blog series on data connective tissue. In Part 2 we’ll explain how dynamic security provided by this paradigm can be extremely powerful and how access patterns beyond SQL can provide ultimate flexibility in analytic sharing and code portability.