This Guy Trains Computers to Find Future Criminals
This Guy Trains Computers to Find Future Criminals
Richard Berk says his algorithms take the bias out of
criminal justice. But could they make it worse?
by Joshua Brustein July 18, 2016
When historians look back at the turmoil over prejudice
and policing in the U.S. over the past few years, they’re unlikely to dwell on
the case of Eric Loomis. Police in La Crosse, Wis., arrested Loomis in February
2013 for driving a car that was used in a drive-by shooting. He had been
arrested a dozen times before. Loomis took a plea, and was sentenced to six
years in prison plus five years of probation.
The episode was unremarkable compared with the deaths of
Philando Castile and Alton Sterling at the hands of police, which were captured
on camera and distributed widely online. But Loomis’s story marks an important
point in a quieter debate over the role of fairness and technology in policing.
Before his sentence, the judge in the case received an automatically generated
risk score that determined Loomis was likely to commit violent crimes in the
future.
Risk scores, generated by algorithms, are an increasingly
common factor in sentencing. Computers crunch data—arrests, type of crime
committed, and demographic information—and a risk rating is generated. The idea
is to create a guide that’s less likely to be subject to unconscious biases,
the mood of a judge, or other human shortcomings. Similar tools are used to
decide which blocks police officers should patrol, where to put inmates in
prison, and who to let out on parole. Supporters of these tools claim they’ll
help solve historical inequities, but their critics say they have the potential
to aggravate them, by hiding old prejudices under the veneer of computerized
precision. Some people see them as a sterilized version of what brought
protesters into the streets at Black Lives Matter rallies.
Loomis is a surprising fulcrum in this controversy: He’s
a white man. But when Loomis challenged the state’s use of a risk score in his
sentence, he cited many of the fundamental criticisms of the tools: that
they’re too mysterious to be used in court, that they punish people for the
crimes of others, and that they hold your demographics against you. Last week
the Wisconsin Supreme Court ruled against Loomis, but the decision validated
some of his core claims. The case, say legal experts, could serve as a jumping-off
point for legal challenges questioning the constitutionality of these kinds of
techniques.
To understand the algorithms being used all over the
country, it’s good to talk to Richard Berk. He’s been writing them for decades
(though he didn’t write the tool that created Loomis’s risk score). Berk, a
professor at the University of Pennsylvania, is a shortish, bald guy, whose
solid stature and I-dare-you-to-disagree-with-me demeanor might lead people to
mistake him for an ex-cop. In fact, he’s a career statistician.
His tools have been used by prisons to determine which
inmates to place in restrictive settings; parole departments to choose how
closely to supervise people being released from prison; and police officers to
predict whether people arrested for domestic violence will re-offend. He once
created an algorithm that would tell the Occupational Safety and Health
Administration which workplaces were likely to commit safety violations, but
says the agency never used it for anything. Starting this fall, the state of
Pennsylvania plans to run a pilot program using Berk’s system in sentencing
decisions.
As his work has been put into use across the country,
Berk’s academic pursuits have become progressively fantastical. He’s currently
working on an algorithm that he says will be able to predict at the time of
someone’s birth how likely she is to commit a crime by the time she turns 18.
The only limit to applications like this, in Berk’s mind, is the data he can
find to feed into them.
“The policy position that is taken is that it’s much more
dangerous to release Darth Vader than it is to incarcerate Luke Skywalker”
This kind of talk makes people uncomfortable, something
Berk was clearly aware of on a sunny Thursday morning in May as he headed into
a conference in the basement of a campus building at Penn to play the role of
least popular man in the room. He was scheduled to participate in the first
panel of the day, which was essentially a referendum on his work. Berk settled
into his chair and prepared for a spirited debate about whether what he does
all day is good for society.
The moderator, a researcher named Sandra Mayson, took the
podium. “This panel is the Minority Report panel,” she said, referring to the
Tom Cruise movie where the government employs a trio of psychic mutants to
identify future murderers, then arrests these “pre-criminals” before their
offenses occur. The comparison is so common it’s become a kind of joke. “I use
it too, occasionally, because there’s no way to avoid it," Berk said
later.
For the next hour, the other members of the panel took
turns questioning the scientific integrity, utility, and basic fairness of
predictive techniques such as Berk’s. As it went on, he began to fidget in
frustration. Berk leaned all the way back in his chair and crossed his hands
over his stomach. He leaned all the way forward and flexed his fingers. He
scribbled a few notes. He rested his chin in one hand like a bored teenager and
stared off into space.
Eventually, the debate was too much for him: “Here’s what
I, maybe hyperbolically, get out of this,” Berk said. “No data are any good,
the criminal justice system sucks, and all the actors in the criminal justice
system are biased by race and gender. If that’s the takeaway message, we might
as well all go home. There’s nothing more to do.” The room tittered with
awkward laughter.
Berk’s work on crime started in the late 1960s, when he
was splitting his time between grad school and a social work job in Baltimore.
The city exploded in violence following the assassination of Martin Luther King
Jr. Berk’s graduate school thesis examined the looting patterns during the
riots. “You couldn’t really be alive and sentient at that moment in time and
not be concerned about what was going on in crime and justice,” he said. “Very
much like today with the Ferguson stuff.”
In the mid-1990s, Berk began focusing on machine
learning, where computers look for patterns in data sets too large for humans
to sift through manually. To make a model, Berk inputs tens of thousands of
profiles into a computer. Each one includes the data of someone who has been
arrested, including how old they were when first arrested, what neighborhood
they’re from, how long they’ve spent in jail, and so on. The data also contain
information about who was re-arrested. The computer finds patterns, and those
serve as the basis for predictions about which arrestees will re-offend.
To Berk, a big advantage of machine learning is that it
eliminates the need to understand what causes someone to be violent. “For these
problems, we don’t have good theory,” he said. Feed the computer enough data
and it can figure it out on its own, without deciding on a philosophy of the
origins of criminal proclivity. This is a seductive idea. But it’s also one
that comes under criticism each time a supposedly neutral algorithm in any
field produces worryingly non-neutral results. In one widely cited study,
researchers showed that Google’s automated ad-serving software was more likely
to show ads for high-paying jobs to men than to women. Another found that ads
for arrest records show up more often when searching the web for distinctly
black names than for white ones.
Computer scientists have a maxim, “Garbage in, garbage
out.” In this case, the garbage would be decades of racial and socioeconomic
disparities in the criminal justice system. Predictions about future crimes
based on data about historical crime statistics have the potential to equate
past patterns of policing with the predisposition of people in certain
groups—mostly poor and nonwhite—to commit crimes.
Berk readily acknowledges this as a concern, then quickly
dismisses it. Race isn’t an input in any of his systems, and he says his own
research has shown his algorithms produce similar risk scores regardless of
race. He also argues that the tools he creates aren’t used for punishment—more often
they’re used, he said, to reverse long-running patterns of overly harsh
sentencing, by identifying people whom judges and probation officers shouldn’t
worry about.
Berk began working with Philadelphia’s Adult Probation
and Parole Department in 2006. At the time, the city had a big murder problem
and a small budget. There were a lot of people in the city’s probation and
parole programs. City Hall wanted to know which people it truly needed to
watch. Berk and a small team of researchers from the University of Pennsylvania
wrote a model to identify which people were most likely to commit murder or
attempted murder while on probation or parole. Berk generally works for free,
and was never on Philadelphia’s payroll.
A common question, of course, is how accurate risk scores
are. Berk says that in his own work, between 29 percent and 38 percent of
predictions about whether someone is low-risk end up being wrong. But focusing
on accuracy misses the point, he says. When it comes to crime, sometimes the
best answers aren’t the most statistically precise ones. Just like weathermen
err on the side of predicting rain because no one wants to get caught without
an umbrella, court systems want technology that intentionally overpredicts the
risk that any individual is a crime risk. The same person could end up being
described as either high-risk or not depending on where the government decides
to set that line. “The policy position that is taken is that it’s much more
dangerous to release Darth Vader than it is to incarcerate Luke Skywalker,”
Berk said.
“Every mark of poverty serves as a risk factor”
Philadelphia’s plan was to offer cognitive behavioral
therapy to the highest-risk people, and offset the costs by spending less money
supervising everyone else. When Berk posed the Darth Vader question, the parole
department initially determined it’d be 10 times worse, according to Geoffrey
Barnes, who worked on the project. Berk figured that at that threshold the
algorithm would name 8,000 to 9,000 people as potential pre-murderers.
Officials realized they couldn’t afford to pay for that much therapy, and asked
for a model that was less harsh. Berk’s team twisted the dials accordingly. “We’re
intentionally making the model less accurate, but trying to make sure it
produces the right kind of error when it does,” Barnes said.
The program later expanded to group everyone into high-,
medium-, and low-risk populations, and the city significantly reduced how
closely it watched parolees Berk’s system identified as low-risk. In a 2010
study, Berk and city officials reported that people who were given more lenient
treatment were less likely to be arrested for violent crimes than people with
similar risk scores who stayed with traditional parole or probation. People
classified as high-risk were almost four times more likely to be charged with
violent crimes.
Since then, Berk has created similar programs in
Maryland’s and Pennsylvania’s statewide parole systems. In Pennsylvania, an
internal analysis showed that between 2011 and 2014 about 15 percent of people
who came up for parole received different decisions because of their risk
scores. Those who were released during that period were significantly less
likely to be re-arrested than those who had been released in years past. The
conclusion: Berk’s software was helping the state make smarter decisions.
Laura Treaster, a spokeswoman for the state’s Board of
Probation and Parole, says Pennsylvania isn’t sure how its risk scores are
impacted by race. “This has not been analyzed yet,” she said. “However, it
needs to be noted that parole is very different than sentencing. The board is
not determining guilt or innocence. We are looking at risk.”
Sentencing, though, is the next frontier for Berk’s risk
scores. And using algorithms to decide how long someone goes to jail is proving
more controversial than using them to decide when to let people out early.
Wisconsin courts use Compas, a popular commercial tool
made by a Michigan-based company called Northpointe. By the company’s account,
the people it deems high-risk are re-arrested within two years in about 70
percent of cases. Part of Loomis’s challenge was specific to Northpointe’s
practice of declining to share specific information about how its tool
generates scores, citing competitive reasons. Not allowing a defendant to
assess the evidence against him violated due process, he argued. (Berk shares
the code for his systems, and criticizes commercial products such as
Northpointe’s for not doing the same.)
As the court was considering Loomis’s appeal, the
journalism website ProPublica published an investigation looking at 7,000
Compas risk scores in a single county in Florida over the course of 2013 and
2014. It found that black people were almost twice as likely as white people to
be labeled high-risk, then not commit a crime, while it was much more common
for white people who were labeled low-risk to re-offend than black people who
received a low-risk score. Northpointe challenged the findings, saying
ProPublica had miscategorized many risk scores and ignored results that didn’t
support its thesis. Its analysis of the same data found no racial disparities.
Even as it upheld Loomis’s sentence, the Wisconsin
Supreme Court cited the research on race to raise concerns about the use of
tools like Compas. Going forward, it requires risk scores to be accompanied by
disclaimers about their nontransparent nature and various caveats about their
conclusions. It also says they can’t be used as the determining factor in a
sentencing decision. The decision was the first time that such a high court had
signaled ambivalence about the use of risk scores in sentencing.
Sonja Starr, a professor at the University of Michigan’s
law school and a prominent critic of risk assessment, thinks that Loomis’s case
foreshadows stronger legal arguments to come. Loomis made a demographic
argument, saying that Compas rated him as riskier because of his gender,
reflecting the historical patterns of men being arrested at higher rates than
women. But he didn’t frame it as an argument that Compas violated the Equal
Protection Clause of the 14th Amendment, which allowed the court to sidestep
the core issue.
Loomis also didn’t argue that the risk scores serve to
discriminate against poor people. “That’s the part that seems to concern
judges, that every mark of poverty serves as a risk factor,” Starr said. “We
should very easily see more successful challenges in other cases.”
Officials in Pennsylvania, which has been slowly
preparing to use risk assessment in sentencing for the past six years, are
sensitive to these potential pitfalls. The state’s experience shows how tricky
it is to create an algorithm through the public policy process. To come up with
a politically palatable risk tool, Pennsylvania established a sentencing
commission. It quickly rejected commercial products like Compas, saying they
were too expensive and too mysterious, so the commission began creating its own
system.
Race was discarded immediately as an input. But every
other factor became a matter of debate. When the state initially wanted to
include location, which it determined to be statistically useful in predicting
who would re-offend, the Pennsylvania Association of Criminal Defense Lawyers
argued that it was a proxy for race, given patterns of housing segregation. The
commission eventually dropped the use of location. Also in question: the
system’s use of arrests, instead of convictions, since it seems to punish
people who live in communities that are policed more aggressively.
Berk argues that eliminating sensitive factors weakens
the predictive power of the algorithms. “If you want me to do a totally
race-neutral forecast, you’ve got to tell me what variables you’re going to
allow me to use, and nobody can, because everything is confounded with race and
gender,” he said.
Starr says this argument confuses the differing standards
in academic research and the legal system. In social science, it can be useful
to calculate the relative likelihood that members of certain groups will do
certain things. But that doesn’t mean a specific person’s future should be
calculated based on an analysis of population wide crime stats, especially when
the data set being used reflects decades of racial and socioeconomic disparities.
It amounts to a computerized version of racial profiling, Starr argued. “If the
variables aren’t appropriate, you shouldn’t be relying on them," she said.
Late this spring, Berk traveled to Norway to meet with a
group of researchers from the University of Oslo. The Norwegian government
gathers an immense amount of information about the country’s citizens and
connects each of them to a single identification file, presenting a tantalizing
set of potential inputs.
Torbjørn Skardhamar, a professor at the university, was
interested in exploring how he could use machine learning to make long-term
predictions. He helped set up Berk’s visit. Norway has lagged behind the U.S.
in using predictive analytics in criminal justice, and the men threw around a
few ideas.
Berk wants to predict at the moment of birth whether
people will commit a crime by their 18th birthday, based on factors such as
environment and the history of a new child’s parents. This would be almost
impossible in the U.S., given that much of a person’s biographical information
is spread out across many agencies and subject to many restrictions. He’s not
sure if it’s possible in Norway, either, and he acknowledges he also hasn’t
completely thought through how best to use such information.
Caveats aside, this has the potential to be a capstone
project of Berk’s career. It also takes all of the ethical and political
questions and extends them to their logical conclusion. Even in the movie
Minority Report, the government peered only hours into the future—not years.
Skardhamar, who is new to these techniques, said he’s not afraid of making
mistakes: They’re talking about them now, he said, so they can avoid future
errors. “These are tricky questions,” he said, mulling all the ways the project
could go wrong. “Making them explicit—that’s a good thing.”
Comments
Post a Comment