Will Artificial Intelligence Win the Caption Contest?
Will Artificial Intelligence Win the Caption Contest?
Neural networks have mastered the ability to label things
in images, and now they’re learning to tell stories from a set of photos.
by Signe Brewster
April 27, 2016
When social-media users upload photographs and caption
them, they don’t just label their contents. They tell a story, which gives the
photos context and additional emotional meaning.
A paper published by Microsoft Research describes an
image captioning system that mimics humans’ unique style of visual
storytelling. Companies like Microsoft, Google, and Facebook have spent years
teaching computers to label the contents of images, but this new research takes
it a step further by teaching a neural-network-based system to infer a story
from several images. Someday it could be used to automatically generate
descriptions for sets of images, or to bring humanlike language to other
applications for artificial intelligence.
“Rather than giving bland or vanilla descriptions of
what’s happening in the images, we put those into a larger narrative context,”
says Frank Ferraro, a Johns Hopkins University PhD student who coauthored the
paper. “You can start making likely inferences of what might be happening.”
Consider an album of pictures depicting a group of
friends celebrating a birthday at a bar. Some of the early pictures show people
ordering beer and drinking it, while a later photo shows someone asleep on a
couch.
“A captioning system might just say, ‘A person lying on a
couch,’” Ferraro says. “But a storytelling system might be able to say, ‘Well,
given that I think these people were out partying or out eating and drinking,
then this person may be drunk.’”
One example listed in the paper includes a series of five
images. They show a family gathered around a table, a plate of shellfish, a
dog, and images from the beach. The neural network described them with a story
reading, “The family got together for a cookout. They had a lot of delicious
food. The dog was happy to be there. They had a great time on the beach. They
even had a swim in the water.”
The team, which was led by Microsoft researcher Margaret
Mitchell and included Microsoft interns like Ferraro and a researcher from
Facebook AI, turned what’s called a sequence-to-sequence recurrent neural
network into a storyteller by training it with images sourced from Flickr. They
had helpers write captions for individual images and for series of images in
specific sequences.
An approach similar to those used to label the contents
of single photos produced stories that were too generic. To counter this, the
team developed a way for the network to choose words that were likely to be
visually salient. They also required that the system not repeat words.
Storytelling is an important part of being human, says
Stanford Vision Lab director Fei-Fei Li, who did not contribute to the
research. Technology that can imitate humans’ techniques for documenting
stories needs to be able to cross-reference objects and characters seen in multiple
pictures and infer relationships between people, objects, and places.
“The published paper is just the beginning toward this
kind of technology,” Li says. “But it is a good step forward to start tackling
such an ambitious project. I look forward to more follow-up work from these
authors and others.”
Comments
Post a Comment