I’m just back from attending NYU Information Law Institute’s “Algorithms and Explanations” conference. The focus was on governing machine learning, from medical and criminal justice applications to credit scoring—issues we think a lot about at Immuta.
And while I’m still digesting the presentations and discussions that took place, I thought I’d post two immediate takeaways here on the blog:
First, the techniques that we lump together and call “machine learning” aren’t as new as we think. In fact, they’re actually much older than is commonly recognized. In the early 1940s, for example, Warren McCulloch and Walter Pitts were famously writing about neural networks. In the late 1950s, Frank Rosenblatt was writing about perceptrons, which form core components for many neural nets. Academic literature in the 1970s was explaining back-propagation algorithms, another key element of the powerful machine learning methods in vogue today.
So why does machine learning seem so new?
The answer, I think, is that its power is new. The core components of machine learning might not be so recent, but over the last decade or so we’ve seen a massive increase in the data that organizations are able to collect and to process. And that’s unlocked many machine learning methods that have been around for years.
Which brings me to the second insight: The key to regulating machine learning, and to understanding machine learning models, lies in understanding the data itself. Because so many models are so deeply opaque—oftentimes referred to as “inscrutable systems”—one of the best opportunities to understand how they work lies in understanding the data on which they are trained.
Features in the data itself, for example, might need to dictate what types of models are used, or how the models are deployed in the real world (and what risks each deployment poses). And that gives us at Immuta a whole host of ways to make governing machine learning easier. Because we sit as a unified access point for our customers’ data, we can track key features of the data and key facts about that data, which can make understanding models easier. Because of our central role in the data usage lifecycle, we can help guide model deployment in ways that increase accuracy and reduce risk.
Is any of this aspirational? Sure. The field of machine learning is still in its infancy, and even more so the ways to govern it. But the promise of machine learning is already proving too great to ignore, and organizations—from governments to enterprises to NGOs—are rushing to take advantage of its benefits as a result. The sooner we can find ways to minimize the risks machine learning entails, and the easier we can make that process, the more powerful machine learning will actually be.
Some recommended reading from the conference:
- Equality of Opportunity in Supervised Learning by Mortiz Hardt, Eric Price, and Nathan Srebro: https://pdfs.semanticscholar.org/f185/58735e70147174e11c5d81f191cc67ccf425.pdf
- Human Decisions and Machine Predictions, by Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan: https://www.cs.cornell.edu/home/kleinber/w23180.pdf
- Private Traits and Attributes Are Predictable from Digital Records of Human Behavior, by Michal Kosinskia, David Stillwella, and Thore Graepelb: http://www.pnas.org/content/110/15/5802.full
- The Foundations of Algorithmic Bias, by Zachary Lipton: http://approximatelycorrect.com/2016/11/07/the-foundations-of-algorithmic-bias/
- The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, by Chris Anderson: https://www.wired.com/2008/06/pb-theory/
- Statistical Modeling: The Two Cultures, by Leo Breiman: https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726