Why Google, Microsoft and Amazon Love the Sound of Your Voice
Why Google, Microsoft and Amazon Love the Sound of Your
Voice
Speech recognition must get much better if we are to
speak naturally to our gadgets. So the tech industry is vacuuming up all the
conversations it can.
by Jing Cao and Dina Bass December 13, 2016, 3:00 AM PST
Amazon's Echo has made tangible the promise of an
artificially intelligent personal assistant in every home. Those who own the
voice-activated gadget (known colloquially as Alexa, after its female interlocutor)
are prone to proselytizing "her" charms, applauding Alexa's ability
to call an Uber, order pizza or check a 10th-grader's math homework. The
company says more than 5,000 people a day profess their love for Alexa.
On the other hand, Alexa devotees also know that unless
you speak to her very clearly . . . and
. . . slowly, she's likely to say: Sorry, I don't have the answer to that
question. "I love her. I hate her, I love her," one customer wrote on
Amazon's website, while still awarding Alexa five stars. "You will very quickly learn how to talk
to her in a way that she will understand and it's not unlike speaking to a
small frustrating toddler."
Voice recognition has come a long way in the past few
years. But it's still not good enough to popularize the technology for everyday
use and usher in a new era of human-machine interaction, allowing us to talk
with all our gadgets—cars, washing machines, televisions. Despite advances in
speech recognition, most people continue to swipe, tap and click. And probably
will for the foreseeable future.
What's holding back progress? Partly the artificial
intelligence that powers the technology has room to improve. There's also a
serious deficit of data—specifically audio of human voices, speaking in
multiple languages, accents and dialects in often noisy circumstances that can
defeat the code.
So Amazon, Apple, Microsoft and China's Baidu have
embarked on a world-wide hunt for terabytes of human speech. Microsoft has set
up mock apartments in cities around the globe to record volunteers speaking in
a home setting. Every hour, Amazon uploads Alexa queries to a vast digital
warehouse. Baidu is busily collecting every dialect in China. Then they take
all that data and use it to teach their computers how to parse, understand and
respond to commands and queries.
The challenge is finding a way to capture natural,
real-world conversations. Even 95 percent accuracy isn't enough, says Adam
Coates, who runs Baidu's artificial intelligence lab in Sunnyvale, California.
"Our goal is to push the error rate down to 1 percent," he says. "That's where you can really trust the
device to understand what you're saying, and that will be transformative."
Not so long ago, voice recognition was comically
rudimentary. An early version of Microsoft's technology running in Windows transcribed
"mom" as "aunt" during a 2006 demo before an auditorium of
analysts and investors. When Apple debuted Siri five years back, the personal
assistant's gaffes were widely mocked because it, too, routinely spat out
incorrect results or didn't hear the question correctly. When asked if Gillian
Anderson is British, Siri provided a list of English restaurants. Now Microsoft
says its speech engine makes the same number or fewer errors than professional
transcribers, Siri is winning grudging respect, and Alexa has given us a
tantalizing glimpse of the future.
Much of that progress owes a debt to the magic of neural
networks, a form of artificial intelligence based loosely on the architecture
of the human brain. Neural networks learn without being explicitly programmed
but generally require an enormous breadth and diversity of data. The more a
speech recognition engine consumes, the better it gets at understanding
different voices and the closer it gets to the eventual goal of having a
natural conversation in many languages and situations.
Hence the global scramble to capture a multitude of
voices. "The more data we shove in our systems the better it
performs," says Andrew Ng, Baidu's chief scientist. "This is why
speech is such a capital-intensive exercise; not a lot of organizations have
this much data."
When the industry began working seriously on voice
recognition in the 1990s, companies like Microsoft relied on publicly available
data from research institutes such as the Linguistics Data Consortium, a storehouse
of voice and text data founded in 1992 with backing from the U.S. government
and located at the University of Pennsylvania. Then tech companies started
collecting their own voice data, some of it garnered from volunteers who came
in to read and be recorded. Now, with the popularity of speech-controlled
software gaining ground, they harvest much of the data from their own products
and services.
When you tell your phone to search for something, play a
song or guide you to a destination, chances are a company is recording it.
(Apple, Google, Microsoft and Amazon emphasize that they anonymize user data to
protect customer privacy.) When you ask Alexa what the weather is or the latest
football score, the gadget uses the queries to improve its understanding of
natural language (although "she" isn't listening to your
conversations unless you say her name). "By design, Alexa gets smarter as
you use her," says Nikko Strom, senior principal scientist for the
program.
One of the key challenges is getting the technology
conversant with multiple languages, accents and dialects. Nowhere, perhaps, is
this more crucial than in China. Seeking to harvest dialects from all over the
country, Baidu launched a marketing campaign during Chinese New Year earlier
this year. Calling the push a "dialect conservation initiative," the
search giant promised people that if they contributed they would help usher in
a future when they would talk to Baidu using their dialect. In two weeks, the
company recorded more than 1,000 hours of speech to plug into its computers.
Many people did it for free simply because they were proud of their hometown
dialects. A high school teacher in Sichuan was so excited about the program, he
asked a class of students to record more than 1,000 ancient poems in
Sichuanese.
Another challenge: teaching voice recognition technology
to pick up commands over background noise—the clamor of happy hour, say, or the
cacophony of a sports stadium. Microsoft has deployed an Xbox app called Voice
Studio to harvest conversation over the din of users shooting villains or
watching movies. The company offered rewards including points and digital
apparel for avatars and lured hundreds of subjects willing to contribute their
game chatter to Microsoft's speech efforts. The program worked gangbusters in
Brazil, where the local subsidiary promoted the app heavily on the main Xbox
page. The data was used to create the Brazilian Portuguese version of Cortana,
released earlier this year.
Companies are also designing voice recognition systems
for specific situations. Microsoft has been testing technology that can answer
travelers' queries without being distracted by the constant barrage of flight
announcements at airports. The company's technology is also being used in an
automated ordering system for McDonald's drive-thrus. Trained to ignore
scratchy audio, screaming kids and "ums," it can spit out a
complicated order, getting even the condiments right. Amazon is conducting
tests in automobiles, challenging Alexa to work well with road noise and open
windows.
Even as companies scour the world for data, they're
figuring out ways to improve voice recognition with less of it. The technology
being tested at McDonald's is more accurate than other systems that use much
more data, says Xuedong Huang, Microsoft's chief speech scientist, who has been
working on voice recognition at the company for more than two decades.
"You can always have breakthroughs even without using the most data."
Google generally subscribes to a less-is-more philosophy,
deploying a piecemeal approach that uses unintelligible units of sound to build
words and phrases. With its speech recognition system, the company aims to
solve multiple problems with just one change. For its data sets, Google strings
together tens of thousands of audio snippets that are typically two to five
seconds long. The process requires less computing power and can be more easily
tested and tweaked, says Google researcher Françoise Beaufays. For its part,
Baidu is working on more efficient algorithms where learning one language makes
it easier to learn the next twelve. That's particularly important for those
spoken by tens of thousands of people rather than millions, where there just
won't be huge swaths of data no matter what, says Ng, the company's chief
scientist.
Ask researchers like Ng when it will be possible to speak
naturally to your digital assistant and they get wistful. No one really knows.
Neural networks remain mysterious even to those who understand them best. And
much of the work is trial and error; make a tweak here and you're never quite
sure what will happen there. Based on the current technology and methods, the
process will probably take years. But Ng, Huang, Beaufays and other scientists
say you never know when a breakthrough will arrive, catapulting research
forward and turning Alexa and Siri into true conversationalists.
Comments
Post a Comment