Skip main navigation

What Legal Scholars Should Learn About Machine Learning

This is an excerpt from the article “What Legal Scholars Should Learn About Machine Learning" by David Lehr & Paul Ohm.
Glasses and book


Legal scholars have begun to focus intently on machine learning — the name for a large family of techniques used for sophisticated new forms of data analysis that are becoming key tools of prediction and decision-making. We think this burgeoning scholarship has tended to treat machine learning too much as a monolith and an abstraction, largely ignoring some of its most consequential stages. As a result, many potential harms and benefits of automated decision-making have not yet been articulated, and policy solutions for addressing those impacts remain underdeveloped.

To fill these gaps in legal scholarship, in this Article we provide a rich breakdown of the process of machine learning. We divide this process roughly into eight steps: problem definition, data collection, data cleaning, summary statistics review, data partitioning, model selection, model training, and model deployment. Far from a straight linear path, most machine learning dances back and forth across these steps, whirling through successive passes of model building and refinement.

Simplifying this mapping, we contend that legal scholars should think of machine learning as consisting of two distinct workflows: “playing with the data,” which comprises the first seven steps of our breakdown, and “the running model,” which describes a machine-learning algorithm deployed and making decisions in the real world. Our core claim is that almost all of the significant legal scholarship to date has focused on the implications of the running model — the predictive policing algorithm directing the deployment of officers, the face recognition system identifying suspects,2 or the autonomous automobile navigating a turn — and has neglected most of the possibilities and pitfalls of playing with the data. Particularly in the fields of criminal justice and criminal procedure, machine-learning systems are seen as inscrutable black boxes by scholars focused on the Fourth Amendment.

Black boxes also sit at the heart of important work by Frank Pasquale and Danielle Citron who, together and separately, have authored important articles on the rise of automated decision-making in many contexts, such as the delivery of government benefits and credit scoring. As important as we find this work, we think it can be strengthened by giving more attention to machine learning’s playing-with-the-data stages.

A few notable and important articles pay some attention to playing with the data. Most significantly, an article by Solon Barocas and Andrew Selbst on bias in employment decision-making focuses astutely on problems that creep in during data collection. The article does tend to neglect many of the other stages of playing with the data, but we recognize this as a side effect of the topic they are studying — the problem of bias. Bias seems to emerge in data-related stages most directly and perhaps even exclusively.

By not paying attention to other stages of machine learning, scholars have overlooked the fact that the two workflows of machine learning give rise to very different issues. The potential harms and benefits that can creep in while playing with the data differ from those of the running model. For example, Barocas and Selbst documented the “garbage in, garbage out” problem, which can make machine-learning models discriminatory, but, from the vantage point of the running model, this “garbage” is a static, unavoidable feature of the data. Only one who is attentive to the many ways in which data can be selected and shaped — say, during data cleaning or model training — will characterize fully the source of the stink. Similarly, a benefit of choosing certain machine-learning algorithms is the ability to place weight on particular types of errors over others — for example, to favor false negatives over false positives in criminal justice contexts — but this choice is one that must be made when playing with the data.

Another reason legal scholars in particular need to focus on playing with the data is that combatting harms at the running-model stage is often too little too late. Because playing with the data occurs earlier in time and entails much more human involvement than the running model, this phase provides more opportunities and behavioral levers for policy prescriptions. As many have documented, a running model is often viewed as an inscrutable black box, but there are opportunities for auditing (record-keeping requirements, keystroke loggers, etc.) and mandated interpretability during playing with the data. We can ban certain approaches — say, deep learning techniques like convolutional neural nets, if our concern is inscrutability — during playing with the data, but, with a running model, all we can do is rue the choice that has already been made. These possibilities may be neither necessary nor sufficient to address the potential harms of machine learning, but they are likely to be missed by those with a single-minded focus on the running model.

Greater attention to playing with the data can also advance contemporary debates about machine learning. Regulation skeptics and industry members often rely on descriptions of machine learning as “more art than science,” but we think this inappropriately assumes that black-box algorithms have black-box workflows; as we show, the steps of playing with the data are actually quite articulable. Additionally, many commentators have argued that we must preserve a “human in the loop” of machine learning, but most of them are referring to the running model as the relevant loop. We think there are different — perhaps more imperative — reasons to maintain humans in the underappreciated playing-with-the-data loop as well. We are not saying that scholars ought to neglect the running model.

The best assessments of the promises, perils, and prescriptions for automated decision-making will consider both phases of machine learning. However, widening the view to encompass earlier stages will be crucial for solving some seemingly intractable problems of our increasingly automated world.

Our Article proceeds in three parts. Part I surveys the burgeoning literature at the intersection of law and machine learning, highlighting the relatively limited conception of machine learning these articles seem to adopt. Part II provides a detailed explication of the stages of machine learning. Here, we highlight aspects of machine learning that have yet to figure prominently in legal scholarship. Finally, Part III builds on this primer, demonstrating how a more complete understanding of machine learning will help us diagnose harms we have not yet recognized, as well as benefits and prescriptions we have not yet tried to deploy.

If you are interested in this topic, you can get the full article below.

David Lehr & Paul Ohm, Playing with the Data: What Legal Scholars Should Learn About Machine Learning, 51 U.C. DAVIS L. REV. 653 (2017), available at

© David Lehr & Paul Ohm
This article is from the free online

AI For Lawyers (II): Tools for Legal Professionals

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now