Stanford engineers make editing video as easy as editing text
Stanford engineers make editing video as easy as editing
text
A new algorithm allows video editors to modify talking
head videos as if they were editing text – copying, pasting, or adding and
deleting words.
BY ANDREW MYERS JUNE 5, 2019
In television and film, actors often flub small bits of
otherwise flawless performances. Other times they leave out a critical word.
For editors, the only solution so far is to accept the flaws or fix them with
expensive reshoots.
A new algorithm makes it possible to perform text-based
editing of videos of “talking heads”; that is, speakers from the shoulders up.
Imagine, however, if that editor could modify video using
a text transcript. Much like word processing, the editor could easily add new
words, delete unwanted ones or completely rearrange the pieces by dragging and
dropping them as needed to assemble a finished video that looks almost flawless
to the untrained eye.
A team of researchers from Stanford University, Max Planck
Institute for Informatics, Princeton University and Adobe Research created such
an algorithm for editing talking-head videos – videos showing speakers from the
shoulders up.
The work could be a boon for video editors and producers
but does raise concerns as people increasingly question the validity of images
and videos online, the authors said. However, they propose some guidelines for
using these tools that would alert viewers and performers that the video has
been manipulated.
“Unfortunately, technologies like this will always
attract bad actors,” said Ohad Fried, a postdoctoral scholar at Stanford. “But
the struggle is worth it given the many creative video editing and content
creation applications this enables.”
Reading lips
The application uses the new transcript to extract speech
motions from various video pieces and, using machine learning, convert those
into a final video that appears natural to the viewer – lip-synched and all.
“Visually, it’s seamless. There’s no need to rerecord
anything,” said Fried, who is first author of a paper about the research
published on the pre-publication website arXiv. It will also be in the journal
ACM Transactions on Graphics. Fried works in the lab of Maneesh Agrawala, the
Forest Baskett Professor in the School of Engineering and senior author of the
paper. The project began when Fried was a graduate student working with
computer scientist Adam Finkelstein at Princeton, more than two years ago.
Should an actor or performer flub a word or misspeak, the
editor can simply edit the transcript and the application will assemble the
right word from various words or portions of words spoken elsewhere in the
video. It’s the equivalent of rewriting with video, much like a writer retypes
a misspelled or unfit word. The algorithm does require at least 40 minutes of
original video as input, however, so it won’t yet work with just any video
sequence.
As the transcript is edited, the algorithm selects
segments from elsewhere in the recorded video with motion that can be stitched
to produce the new material. In their raw form these video segments would have
jarring jump cuts and other visual flaws.
To make the video appear more natural, the algorithm
applies intelligent smoothing to the motion parameters and renders a 3D animated
version of the desired result. However, that rendered face is still far from
realistic. As a final step, a machine learning technique called Neural
Rendering converts the low-fidelity digital model into a photorealistic video
in perfect lip-synch.
To test the capabilities of their system the researchers
performed a series of complex edits including adding, removing and changing
words, as well as translations to different languages, and even created full
sentences as if from whole cloth.
In a crowd-sourced study with 138 participants, the
team’s edits were rated as “real” almost 60 percent of the time. The visual
quality is such that it is very close to the original, but Fried said there’s
plenty of room for improvement.
“The implications for movie post-production are big,”
said Ayush Tewari, a student at the Max Planck Institute for Informatics and
the paper’s second author. It presents for the first time the possibility of
fixing filmed dialogue without reshoots.
Ethical concerns
Nonetheless, in an era of synthesized fake videos such
capabilities raise important ethical concerns, Fried added. There are very
valuable and justifiable reasons to want to edit video in this way, namely the
expense and effort required to rerecord or repair such flaws in video content,
or to customize existing audio-visual video content by audience. Instructional
videos might be fine-tuned to different languages or cultural backgrounds, for
instance, or children’s stories could be adapted to different ages.
“This technology is really about better storytelling,”
Fried said.
Fried acknowledges concerns that such a technology might
be used for illicit purposes, but says the risk is worth taking. Photo-editing
software went through a similar reckoning, but in the end, people want to live
in a world where photo-editing software is available.
As a remedy, Fried says there are several options. One is
to develop some sort of opt-in watermarking that would identify any content
that had been edited and provide a full ledger of the edits. Moreover,
researchers could develop better forensics such as digital or non-digital
fingerprinting techniques to determine whether a video had been manipulated for
ulterior purposes. In fact, this research and others like it also build the
essential insights that are needed to develop better manipulation detection.
None of the solutions can fix everything, so viewers must
remain skeptical and cautious, Fried said. Besides, he added, there are already
many other ways to manipulate video that are much easier to execute. He said
that perhaps the most pressing matter is to raise public awareness and
education on video manipulation, so people are better equipped to question and
assess the veracity of synthetic content.
Additional co-authors include Michael Zollhöfer, a
visiting assistant professor at Stanford, and colleagues at the Max Planck
Institute for Informatics, Princeton University and Adobe Research.
The research was funded by the Brown Institute for Media
Innovation, the Max Planck Center for Visual Computing and Communications, a
European Research Council Consolidator Grant, Adobe Research and the Office of
the Dean for Research at Princeton University.
Comments
Post a Comment