The Computers are Listening
How the NSA Converts Spoken Words Into Searchable Text
How the NSA Converts Spoken Words Into
Searchable Text
Most people realize that emails and other digital
communications they once considered private can now become part of their
permanent record.
But even as they increasingly use apps that understand what they say, most
people don’t realize that the words they speak are not so private anymore,
either.
Top-secret documents from the archive of former NSA contractor Edward
Snowden show the National Security Agency can now automatically recognize the
content within phone calls by creating rough transcripts and phonetic
representations that can be easily searched and stored.
The documents show NSA analysts celebrating the development of what they
called “Google for Voice” nearly
a decade ago .
Though perfect transcription of natural conversation apparently remains the
Intelligence Community’s “holy grail ,”
the Snowden documents describe
extensive use of keyword searching as well as computer programs designed to
analyze and “extract” the content of voice conversations, and even use
sophisticated algorithms to flag conversations of interest.
The documents include vivid examples of the use of speech recognition in war
zones like Iraq and Afghanistan, as well as in Latin America. But they leave
unclear exactly how widely the spy agency uses this ability, particularly in
programs that pick up considerable amounts of conversations that include people
who live in or are citizens of the United States.
Spying on international telephone calls has always been a staple of NSA
surveillance, but the requirement that an actual person do the listening meant
it was effectively limited to a tiny percentage of the total traffic. By
leveraging advances in automated speech recognition, the NSA has entered the
era of bulk listening.
And this has happened with no apparent public oversight, hearings or
legislative action. Congress hasn’t shown signs of even knowing that it’s going
on.
The USA Freedom Act — the surveillance reform bill that Congress
is currently debating — doesn’t address the topic at all. The bill would
end an NSA program that does not collect voice content: the government’s bulk
collection of domestic calling data, showing who called who and for how long.
Even if becomes law, the bill would leave in place a multitude of mechanisms
exposed by Snowden that scoop up vast amounts of innocent people’s text and
voice communications in the U.S. and across the globe.
Civil liberty experts contacted by The Intercept said the NSA’s
speech-to-text capabilities are a disturbing example of the privacy invasions
that are becoming possible as our analog world transitions to a digital one.
“I think people don’t understand that the economics of surveillance have
totally changed,” Jennifer Granick, civil liberties director at the Stanford Center for Internet and Society ,
told The Intercept .
“Once you have this capability, then the question is: How will it be
deployed? Can you temporarily cache all American phone calls, transcribe all
the phone calls, and do text searching of the content of the calls?” she said.
“It may not be what they are doing right now, but they’ll be able to do it.”
And, she asked: “How would we ever know if they change the policy?”
Indeed, NSA officials have been secretive about their ability to convert
speech to text, and how widely they use it, leaving open any number of
possibilities.
That secrecy is the key, Granick said. “We don’t have any idea how many
innocent people are being affected, or how many of those innocent people are
also Americans.”
I Can Search Against It
NSA whistleblower Thomas Drake, who was trained as a voice processing
crypto-linguist and worked at the agency until 2008, told The Intercept
that he saw a huge push after the September 11, 2001 terror attacks to turn the
massive amounts of voice communications being collected into something more
useful.
Human listening was clearly not going to be the solution. “There weren’t
enough ears,” he said.
The transcripts that emerged from the new systems weren’t perfect, he said.
“But even if it’s not 100 percent, I can still get a lot more information. It’s
far more accessible. I can search against it.”
Converting speech to text makes it easier for the NSA to see what it has
collected and stored, according to Drake. “The breakthrough was being able to
do it on a vast scale,” he said.
More Data, More Power,
Better Performance
The Defense Department, through its Defense Advanced Research Projects
Agency (DARPA ), started funding
academic and commercial research into speech recognition in the early 1970s.
What emerged
were several systems to turn speech into text, all of which slowly but
gradually improved as they were able to work with more data and at faster
speeds.
In a brief interview, Dan Kaufman, director of DARPA’s Information
Innovation Office, indicated that the government’s ability to automate transcription
is still limited.
Kaufman says that automated transcription of phone conversation is “super
hard,” because “there’s a lot of noise on the signal” and “it’s informal as
hell.”
“I would tell you we are not very good at that,” he said.
In an ideal environment like a news broadcast, he said, “we’re getting
pretty good at being able to do these types of translations.”
A 2008
document from the Snowden archive shows that transcribing news
broadcasts was already working well seven years ago, using a program called
Enhanced Video Text and Audio Processing:
(U//FOUO) EViTAP is a fully-automated news monitoring tool. The key feature
of this Intelink-SBU-hosted tool is that it analyzes news in six languages,
including Arabic, Mandarin Chinese, Russian, Spanish, English, and
Farsi/Persian. “How does it work?” you may ask. It integrates Automatic Speech
Recognition (ASR) which provides transcripts of the spoken audio. Next, machine
translation of the ASR transcript translates the native language transcript to
English. Voila! Technology is amazing.
A version of the system the NSA uses is now even available
commercially .
Experts in speech recognition say that in the last decade or so, the pace of
technological improvement has been explosive. As information storage became
cheaper and more efficient, technology companies were able to store massive
amounts of voice data on their servers, allowing them to continually update and
improve the models. Enormous processors, tuned as “deep neural networks” that detect
patterns like human brains do, produce much cleaner transcripts.
And the Snowden documents show that the same kinds of leaps forward seen in
commercial speech-to-text products have also been happening in secret at the
NSA, fueled by the agency’s singular access to astronomical
processing power and its own vast data archives.
In fact, the NSA has been repeatedly releasing new and improved speech
recognition systems for more than a decade.
The first-generation tool, which made keyword-searching of vast amounts of
voice content possible, was rolled out in 2004 and code-named RHINEHART.
“Voice word search technology allows analysts to find and prioritize
intercept based on its intelligence content,” says an internal 2006 NSA memo
entitled “For
Media Mining, the Future Is Now! ”
The memo says that intelligence analysts involved in counterterrorism were
able to identify terms related to bomb-making materials, like “detonator” and
“hydrogen peroxide,” as well as place names like “Baghdad” or people like
“Musharaf.”
RHINEHART was “designed to support both real-time searches ,
in which incoming data is automatically searched by a designated set of
dictionaries, and retrospective searches , in which analysts
can repeatedly search over months of past traffic,” the memo explains (emphasis
in original).
As of 2006, RHINEHART was operating “across a wide variety of missions and
languages” and was “used throughout the NSA/CSS [Central Security Service]
Enterprise.”
But even then, a newer, more sophisticated product was already being rolled
out by the NSA’s Human Language Technology (HLT) program office. The new
system, called VoiceRT, was first introduced in Baghdad, and “designed to index
and tag 1 million cuts per day.”
The goal, according to another
2006 memo , was to use voice processing technology to be able “index, tag
and graph,” all intercepted communications. “Using HLT services, a single
analyst will be able to sort through millions of cuts per day and focus on only
the small percentage that is relevant,” the memo states.
A 2009
memo from the NSA’s British partner, GCHQ , describes how “NSA have had the BBN speech-to-text system
Byblos running at Fort Meade for at least 10 years. (Initially they also had
Dragon.) During this period they have invested heavily in producing their own
corpora of transcribed Sigint in both American English and an increasing range
of other languages.” (GCHQ also noted that it had its own small corpora of
transcribed voice communications, most of which happened to be “Northern Irish
accented speech.”)
VoiceRT, in turn, was surpassed a few years after its launch. According to
the intelligence community’s “Black
Budget ” for fiscal year 2013, VoiceRT was decommissioned and replaced in
2011 and 2012, so that by 2013, NSA could operationalize a new system. This
system, apparently called SPIRITFIRE ,
could handle more data, faster. SPIRITFIRE would be “a more robust voice
processing capability based on speech-to-text keyword search and paired
dialogue transcription.”
Extensive Use Abroad
Voice communications can be collected by the NSA whether they are being sent
by regular phone lines, over cellular networks, or through voice-over-internet
services. Previously released documents
from the Snowden archive describe enormous efforts by the NSA during the last
decade to get access to voice-over-internet content like Skype calls, for
instance. And other documents in the archive chronicle the agency’s adjustment
to the fact that an increasingly large percentage of conversations, even those
that start
as landline or mobile calls, end
up as digitized packets flying through the same fiber-optic cables that the
NSA taps
so effectively for other data and voice communications.
The Snowden archive, as searched and analyzed by The Intercept ,
documents extensive use of speech-to-text by the NSA to search through
international voice intercepts — particularly in Iraq and Afghanistan, as well
as Mexico and Latin America.
For example, speech-to-text was a key but previously unheralded element of
the sophisticated analytical program known as the Real Time Regional Gateway
(RTRG), which started in 2005 when newly appointed NSA chief Keith B.
Alexander, according to the
Washington Post , “wanted
everything: Every Iraqi text message, phone call and e-mail that could be
vacuumed up by the agency’s powerful computers.”
The Real Time Regional Gateway was credited with playing a role in
“breaking up Iraqi insurgent networks and significantly reducing the monthly
death toll from improvised explosive devices.” The indexing and searching
of “voice cuts” was deployed to Iraq in 2006. By 2008, RTRG was operational in
Afghanistan as well.
A slide from a June
2006 NSA powerpoint presentation described the role of VoiceRT:
Keyword spotting extended to Iranian intercepts as well. A
2006
memo reported that RHINEHART had been used successfully by Persian-speaking
analysts who “searched for the words ‘negotiations’ or ‘America’ in their
traffic, and RHINEHART located a very important call that was transcribed
verbatim providing information on an important Iranian target’s discussion of
the formation of a the new Iraqi government.”
According to a 2011 memo, “How
is Human Language Technology (HLT) Progressing? “, NSA that year deployed
“HLT Labs” to Afghanistan, NSA facilities in Texas and Georgia, and listening
posts in Latin America run by the Special
Collection Service, a joint NSA/CIA unit that operates out of embassies and
other locations.
“Spanish is the most mature of our speech-to-text analytics,” the memo says,
noting that the NSA and its Special Collections Service sites in Latin America,
have had “great success searching for Spanish keywords.”
The memo offers an example from NSA Texas, where an analyst newly trained on
the system used a keyword search to find previously unreported information on a
target involved in drug-trafficking. In another case, an official at a Special
Collection Service site in Latin America “was able to find foreign intelligence
regarding a Cuban official in a fraction of the usual time.”
In a 2011 article, “Finding
Nuggets — Quickly — in a Heap of Voice Collection, From Mexico to Afghanistan ,”
an intelligence analysis technical director from NSA Texas described the “rare
life-changing instance” when he learned about human language technology, and
its ability to “find the exact traffic of interest within a mass of
collection.”
Analysts in Texas found the new technology a boon for spying. “From finding
tunnels in Tijuana, identifying bomb threats in the streets of Mexico City, or
shedding light on the shooting of US Customs officials in Potosi, Mexico, the
technology did what it advertised: It accelerated the process of
finding relevant intelligence when time was of the essence ,” he wrote.
(Emphasis in original.)
The author of the memo was also part of a team that introduced the
technology to military leaders in Afghanistan. “From Kandahar to Kabul, we have
traveled the country explaining NSA leaders’ vision and introducing SIGINT
teams to what HLT analytics can do today and to what is still needed to make
this technology a game-changing success,” the memo reads.
Extent of Domestic Use
Remains Unknown
What’s less clear from the archive is how extensively this capability is
used to transcribe or otherwise index and search voice conversations that
primarily involve what the NSA terms “U.S. persons.”
The NSA did not answer a series of detailed questions about automated speech
recognition, even though an NSA “classification
guide ” that is part of the Snowden archive explicitly states that “The fact
that NSA/CSS has created HLT models” for speech-to-text processing as well as
gender, language and voice recognition, is “UNCLASSIFIED.”
Also unclassified: The fact that the processing can sort and prioritize
audio files for human linguists, and that the statistical models are regularly
being improved and updated based on actual intercepts. By contrast, because
they’ve been tuned using actual intercepts, the specific parameters of the
systems are highly classified.
“The National Security Agency employs a variety of technologies in the
course of its authorized foreign-intelligence mission,” spokesperson Vanee’
Vines wrote in an email to The Intercept . “These capabilities,
operated by NSA’s dedicated professionals and overseen by multiple internal and
external authorities, help to deter threats from international terrorists,
human traffickers, cyber criminals, and others who seek to harm our citizens
and allies.”
Vines did not respond to the specific questions about privacy protections in
place related to the processing of domestic or domestic-to-international voice
communications. But she wrote that “NSA always applies rigorous protections
designed to safeguard the privacy not only of U.S. persons, but also of
foreigners abroad, as directed by the President in January 2014.”
The presidentially appointed but independent Privacy
and Civil Liberties Oversight Board (PCLOB) didn’t mention speech-to-text
technology in its public
reports .
“I’m not going to get into whether any program does or does not have that
capability,” PCLOB chairman David Medine told The Intercept.
His board’s reports, he said, contained only information that the
intelligence community agreed could be declassified.
“We went to the intelligence community and asked them to declassify a
significant amount of material,” he said. The “vast majority” of that material
was declassified, he said. But not all — including “facts that we thought could
be declassified without compromising national security.”
Hypothetically, Medine said, the ability to turn voice into text would raise
significant privacy concerns. And it would also raise questions about how the
intelligence agencies “minimize” the retention and dissemination of material—
particularly involving U.S. persons — that doesn’t include information they’re
explicitly allowed to keep.
“Obviously it increases the ability of the government to process information
from more calls,” Medine said. “It would also allow the government to listen in
on more calls, which would raise more of the kind of privacy issues that the
board has raised in the past.”
“I’m not saying the government does or doesn’t do it,” he said, “just that
these would be the consequences.”
A New Learning Curve
Speech recognition expert Bhiksha Raj likens the current era to the early
days of the Internet, when people didn’t fully realize how the things they
typed would last forever.
“When I started using the Internet in the 90s, I was just posting stuff,”
said Raj, an associate professor at Carnegie Mellon University’s Language Technologies Institute . “It
never struck me that 20 years later I could go Google myself and pull all this
up. Imagine if I posted something on alt.binaries.pictures.erotica or something
like that, and now that post is going to embarrass me forever.”
The same is increasingly becoming the case with voice communication, he
said. And the stakes are even higher, given that the majority of the world’s
communication has historically been conducted by voice, and it has
traditionally been considered a private mode of communication.
“People still aren’t realizing quite the magnitude that the problem could
get to,” Raj said. “And it’s not just surveillance,” he said. “People are using
voice services all the time. And where does the voice go? It’s sitting
somewhere. It’s going somewhere. You’re living on trust.” He added: “Right now
I don’t think you can trust anybody.”
The Need for New Rules
Kim Taipale, executive director of the Stilwell Center for Advanced Studies in
Science and Technology Policy , is one of several people who tried
a decade ago to get policymakers to recognize that existing surveillance
law doesn’t adequately deal with new global communication networks and advanced
technologies including speech recognition.
“Things aren’t ephemeral anymore,” Taipale told The Intercept. “We’re
living in a world where many things that were fleeting in the analog world are
now on the permanent record. The question then becomes: what are the
consequences of that and what are the rules going to be to deal with those
consequences?”
Realistically, Taipale said, “the ability of the government to search
voice communication in bulk is one of the things we may have to live
with under some circumstances going forward.” But there at least need to
be “clear public rules and effective oversight to make sure that the
information is only used for appropriate law-enforcement or national
security purposes consistent with Constitutional principles.”
Ultimately, Taipale said, a system where computers flag suspicious
voice communications could be less invasive than one where people do the
listening, given the potential for human abuse and misuse to lead
to privacy violations. “Automated analysis has different privacy
implications,” he said.
But to Jay Stanley, a senior policy analyst with the ACLU’s Speech,
Privacy and Technology Project , the distinction between a human
listening and a computer listening is irrelevant in terms of privacy, possible
consequences, and a chilling effect on speech.
“What people care about in the end, and what creates chilling effects in the
end, are consequences,” he said. “I think that over time, people would learn to
fear computerized eavesdropping just as much as they fear eavesdropping by
humans, because of the consequences that it could bring.”
Indeed, computer listening could raise new
concerns . One of the internal
NSA memos from 2006 says an “important enhancement under development is the
ability for this HLT capability to predict what intercepted data might be of
interest to analysts based on the analysts’ past behavior.”
Citing Amazon’s ability to not just track but predict buyer preferences, the
memo says that an NSA system designed to flag interesting intercepts “offers
the promise of presenting analysts with highly enriched sorting of their
traffic.”
To Phillip Rogaway, a professor of computer science at the University of
California, Davis, keyword-search is probably the “least of our problems.” In
an email to The Intercept , Rogaway warned that “When the NSA
identifies someone as ‘interesting’ based on contemporary NLP [Natural Language
Processing] methods, it might be that there is no human-understandable
explanation as to why beyond: ‘his corpus of discourse resembles those of
others whom we thought interesting'; or the conceptual opposite: ‘his discourse
looks or sounds different from most people’s.'”
If the algorithms NSA computers use to identify threats are too complex for
humans to understand, Rogaway wrote, “it will be impossible to understand the
contours of the surveillance apparatus by which one is judged. All that
people will be able to do is to try your best to behave just like everyone
else.”
Next : The NSA’s best kept open secret.
Readers with information or insight into these programs are encouraged to
get in touch, either by email , or
anonymously via SecureDrop.
Documents published with this article:
RT10
Overview (June 2006)
For
Media Mining, the Future is Now! (August 1, 2006)
For
Media Mining, the Future is Now! (conclusion) (August 7, 2006)
Dealing
With a ‘Tsunami’ of Intercept (August 29, 2006)
Coming
Soon! A Tool that Enables Non-Linguists to Analyze Foreign-TV News
Programs (October 23, 2008)
SIRDCC
Speech Technology WG assessment of current STT technology (December 7,
2009)
Classification
Guide for Human Language Technology (HLT) Models (May 18, 2011)
Finding
Nuggets – Quickly – in a Heap of Voice Collection, From Mexico to
Afghanistan (May 25, 2011)
How
Is Human Language (HLT) Progressing? (September 26, 2011)
“Black
Budget” — FY 2013 Congressional Budget Justification/National Intelligence
Program, p. 262 (February 2012)
“Black
Budget” — FY 2013 Congressional Budget Justification/National Intelligence
Program, pp. 360-364 (February 2012)
Research on the Snowden archive was conducted by Intercept researcher
Andrew Fishman.
Illustrations by Richard Mia for The Intercept.
https://firstlook.org/theintercept/2015/05/05/nsa-speech-recognition-snowden-searchable-text/
Comments
Post a Comment