Automation in the Newsroom
Automation in the Newsroom
How algorithms are helping reporters expand coverage,
engage audiences, and respond to breaking news
BY CELESTE LECOMPTE September 1, 2015
Philana Patterson, assistant business editor for the
Associated Press, has been covering business since the mid-1990s. Before
joining the AP, she worked as a business reporter for both local newspapers and
Dow Jones Newswires and as a producer at Bloomberg. “I’ve written thousands of
earnings stories, and I’ve edited even more,” she says. “I’m very familiar with
earnings.” Patterson manages more than a dozen staffers on the business news
desk, and her expertise landed her on an AP stylebook committee that sets the
guidelines for AP’s earnings stories. So last year, when the AP needed someone
to train its newest newsroom member on how to write an earnings story,
Patterson was an obvious choice.
The trainee wasn’t a fresh-faced j-school graduate,
responsible for covering a dozen companies a quarter, however. It was a piece
of software called Wordsmith, and by the end of its first year on the job, it
would write more stories than Patterson had in her entire career. Patterson’s
job was to get it up to speed.
Patterson’s task is becoming increasingly common in
newsrooms. Journalists at ProPublica, Forbes, The New York Times, Oregon Public
Broadcasting, Yahoo, and others are using algorithms to help them tell stories
about business and sports as well as education, inequality, public safety, and
more. For most organizations, automating parts of reporting and publishing
efforts is a way to both reduce reporters’ workloads and to take advantage of
new data resources. In the process, automation is raising new questions about
what it means to encode news judgment in algorithms, how to customize stories
to target specific audiences without making ethical missteps, and how to
communicate these new efforts to audiences.
Automation is also opening up new opportunities for journalists
to do what they do best: tell stories that matter. With new tools for
discovering and understanding massive amounts of information, journalists and
publishers alike are finding new ways to identify and report important, very
human tales embedded in big data.
ALGORITHMS AND AUTOMATION
Years of experience, industry standards, and the AP’s own
stylebook all help Patterson and her business desk colleagues know how to tell
an earnings story. But how does a computer know? It needs sets of rules, known
as algorithms, to help it.
An algorithm is designed to accomplish a particular task.
Google’s search algorithm orders your page of results. Facebook’s News Feed
determines which posts you see, and a navigation algorithm determines how
you’ll get to the beach. Wordsmith’s algorithms write stories.
In order to write a story, Wordsmith needs both data
about the specific task and guiding principles about the general one. Your GPS
needs to know where you are now and where you’re going; it also needs to know
that “giving directions” means showing the fastest route from point A to point
B, which depends on a variety of other data like whether streets are one way,
what the speed limits are, and if there’s traffic or construction. Similarly,
to write an earnings story, Wordsmith needs the specific data about a company’s
quarterly earnings, and it also needs to know how to tell an earnings story and
what information it needs to accomplish that goal.
To train Wordsmith, Patterson had to think about the
possible stories the data might tell and which metrics might be important. Did
a company report a profit or a loss? Did it meet, beat, or miss analyst
expectations? Did it do better or worse than it did in the previous quarter or
a year earlier? Deciding which metrics and data might matter was a
head-spinning task. “You have to think of as many variables as you can, and
even then you might not think of every variable,” she says.
Working with other journalists on the business desk, she
settled on a handful of storylines, with all their accompanying variety. She
then worked with software developers at Automated Insights, the Durham, North
Carolina–based company behind Wordsmith, who translated those story models into
code the computer could run to create a unique story for each new earnings
release. Today, the AP produces about 3,500 stories per quarter using the
automated system, and that number is set to grow to more than 4,500 by the
year’s end. Automation is taking off, in large part because of the growing
volume of data available to newsrooms, including data about the areas they
cover and the audiences they serve.
THE RACE TO PUBLISH BUSINESS DATA QUICKLY DATES BACK TO
THE FIRST LLOYD’S LIST IN 1734
The history of the news business is, in some ways, a
history of data. The ability to collect and publish business-critical
information faster than others has been a key value proposition since Lloyd’s
List was first published in London in 1734. Companies like Bloomberg and
Thomson-Reuters have built empires on their ability to provide market data to
business readers. But even outside the business media landscape, data has been
an important part of why customers have turned to news outlets: Box scores, weather,
election results, birth and death announcements, and poll results are all
classic elements of a newspaper.
Just as media have undergone a digital revolution, so
have the data that inform many elements of the news. Information of all types
is increasingly accessible in the form of “structured data”—predictably
organized information, like a spreadsheet, database, or filled-out form. This
makes it well suited for analysis and presentation using computers.
The growth of structured data is at the heart of increasing
automation efforts. Business and sports have long been data-intensive coverage
areas, so it’s no surprise that automation is being used in these areas first.
The sports and business agate were among the first items to exit the print
pages and find new homes online, in part because this kind of information is
easily handled by digital systems. But today, a growing volume of private and
public data is available in digital formats, and new tools make it easier to
pull data out of even non-digital formats.
An annotated AP earnings report
But data isn’t the same as information. Algorithmic
content creation isn’t just about turning a spreadsheet of numbers into a
string of descriptive sentences; it’s about summarizing that data for a
particular purpose.
The Associated Press’s data is provided by Zacks
Investment Research; that company uses human analysts who review Securities and
Exchange Commission data, stock pricing, and press releases to build a custom
feed of the numbers the AP has requested. That data is sent to Automated
Insights, and Wordsmith assembles the stories following the rules Patterson and
her colleagues helped set.
Translating even the simplest data means converting the
loose guidelines a human reporter might follow into concrete rules a computer
can follow. For example, a human reporter might have a general idea of when a
company’s performance was very different from analyst expectations, based on
their knowledge of the industry. But for the algorithm, the AP had to specify
exact ranges for which the spread between actual earnings and expectations is
considered large or small. Wordsmith uses such metrics to decide both which
words are used to describe the data and how the story is structured— for example,
whether the fact that a company missed analyst estimates should be mentioned in
the headline. The story-assembling algorithm uses a predetermined set of
vocabulary and phrases (known as a corpus) that follows the AP’s strict
stylebook rules.
“It’s a lot!” Patterson says. “To come up with a system
to trigger the right type of story, we as reporters and editors and programmers
have to figure out this stuff ahead of time.”
You have to know, “what it is you want your data to tell
you,” says Evan Kodra, a senior data scientist with Lux Research, a
Boston-based market research firm. The more targeted and specific the
questions, the better the results. “It still takes a lot of creativity to
define the problem.”
Editors say that’s one reason they’re incorporating
automation technologies into their workflow: It enables them to focus on the
fundamental work of being a reporter. “Isn’t that our whole job: understanding
the purpose of any kind of narrative before we do it?” asks Scott Klein, an
assistant managing editor at ProPublica. “In a way, our job is figuring out the
purpose of the story and figuring out a way of telling it.”
ProPublica’s first—and so far only—foray into automated
journalism was part of “The Opportunity Gap,” a data-driven analysis of which
states are (or aren’t) providing low-income high school students with the
coursework they need to attend and succeed in college.
Studies have shown that advanced high school coursework
can improve a student’s college outcomes, and in 2011, ProPublica released an
investigation into where low-income high school students have equal access to,
and enrollment in, advanced courses. The analysis was based on a new data set
from the U.S. Department of Education. ProPublica used the data to create an
interactive news app to accompany the story. Website visitors could explore the
data at the federal, state, district, and school levels.
Two years later, the team was preparing to update the app
with current data when Narrative Science, a Chicago-based competitor of
Automated Insights, approached them. The company’s platform, Quill, uses a
similar algorithmic method to produce stories from sets of data. ProPublica had
spent months analyzing, interpreting, reporting out, and correcting the
Department of Education’s data set. “The data were so well structured and we
understood it so well,” says Klein, meaning it was a good fit for automation. They
decided to use automation tools to provide a written narrative to accompany
each of the 52,000 schools in the database.
How an algorithm creates a story
Each of the profiles needed to provide a summary of the
data for an individual school, but it also needed to connect each school with
the broader story. To provide context, ProPublica decided to include both a
summary paragraph outlining the thesis of the broader investigative work and a
comparison with another school to show the local context. To produce the
narratives, ProPublica’s editors provided Narrative Science with their complete
data set as well as some sample write-ups. But the most important part was
selecting the right schools for comparison.
The editors wanted to prioritize comparisons that showed
differences with regard to opportunities, but it wasn’t appropriate to compare
a school in California to a school in Chicago because the economic and policy
conditions can vary widely across such geographic gaps. Based on their
reporting, Klein and data editor Jennifer LaFleur decided to first restrict the
comparison to schools within the same district or state, before highlighting
data that showed similarities or differences between the compared schools.
“Even though the data look the same,” says Kris Hammond, chief scientist and
co-founder of Narrative Science, “there are so many different environmental
conditions that are outside the scope of this data that the comparisons would
not fly and would, in fact, be making false analogies.” This kind of
journalistic insight is critical to finetuning the performance of algorithms.
Like any human reporter, robot journalists need editors.
But the challenge of editing automatically generated stories isn’t in
correcting individual stories; it’s in retraining the robot to avoid making the
same mistake.
In May 2015, The New York Times wrote an article about a
new study on how where you grow up affects your economic opportunities later in
life. The study used tax records to track the fates of 5 million children who
moved among U.S. counties between 1996 and 2012. The study concluded, “The area
in which a child grows up has significant causal effects on her prospects for
upward mobility.” To accompany the article, the Upshot team produced an
interactive piece that highlights data for each of the 2,478 counties included
in the study.
But rather than just present a searchable database or
zoomable map, graphics editors wrote an article that adapts to the user, based
on their current location. By looking at the user’s IP address, key paragraphs
highlight local income statistics and compare them to national averages. The
accompanying map is automatically focused on that county and its neighbors.
Users can choose other locations, but rather than seeing an entirely separate
story, the same story gets new data and a new lede for the new location.
When “The Best and Worst Places to Grow Up” was released,
many users didn’t notice that the text was assembled algorithmically. They just
arrived at the page and thought their version of the story was the only version
of the story. That seamless experience is partially the point, but it comes
with its own editorial demands. “Because people think this is edited by a human
editor, you have to have the same standards, accuracy, quality, and tone.
There’s a big danger in messing things up,” says Gregor Aisch, a graphics
editor for The New York Times.
FOR ITS BEST PLACES TO GROW UP REPORT, THE NEW YORK TIMES
USED ALGORITHMS TO ASSEMBLE STORIES THAT VARIED DEPENDING ON THE USER’S
LOCATION
The story uses pre-assigned blocks of text and follows
specific rules for how to assemble paragraphs based on the available data. In
some cases, it might be as simple as substituting a new number or county name.
For the best- and worst-performing counties, the story got additional sentences
that only appeared in those contexts. In addition to editing the pre-written
chunks of text, editors had to check for flow between sentences in multiple
possible arrangements.
The challenge is even trickier for newsrooms using
systems like Quill or Wordsmith, because these systems use more “word
variables,” and they have more options for how to describe data. So, the same
data might be able to produce a dozen different variations of a story.
For now, the process for editing these stories is more or
less the same as for human writers: reviewing drafts. Klein says most of the
drafts that ProPublica received at first had errors. Data appearing in wrong
parts of the story was the most common mistake. Once editors mark up the
drafts, developers make the changes to the code to ensure it doesn’t happen
again. Over time, ProPublica felt confident it had a system that produced
accurate stories and used language with which its editors were comfortable.
The Associated Press also spent months reviewing drafts,
refining the story algorithms, and verifying the quality of the data supplied
by Zacks. The first quarter that the system was live, editors reviewed drafts
of every story before it was put out onto the wire, checking for errors in both
the data and the story. Now, the majority of stories go live on the wire
without a human editor’s review.
The AP says the only errors it still sees come from
errors in the data passed to the system. Some are simple typos or transposed
numbers, while others depend on more complicated human errors. Unless data is
gathered by a digital sensor, the process almost always starts with humans
doing data entry, which is often where problems are introduced. Since the
project’s inception, Patterson says, only two published errors have been traced
back to the algorithm.
In July, Netflix released second-quarter earnings at the
same time as its stock underwent a 7-to-1 split. But the data Wordsmith
received didn’t reflect the split, so Wordsmith initially reported that the
price of an individual share fell 71 percent and noted that the company had
missed analyst expectations for per-share earnings. Neither of which was true.
In fact, investors who owned the stock saw an increase in the value of their
portfolio; Netflix’s share price has more than doubled since the beginning of
the year. This was, in effect, a human error: The analyst data should have
reflected the stock split. But Wordsmith does not have an automated warning
that kicks in when something anomalous—like a 71 percent drop in share price
from a company like Netflix—appears. The lesson, for automated and
human-generated stories alike: Your data have to be bulletproof, and you need
some form of editorial monitoring to catch outliers.
The story was updated with a correction, following the
same processes as for any human-generated story. But because of the way AP
stories are syndicated, uncorrected versions of the story persist online.
Patterson says it’s wrong to blame automation for that kind of error. “If the
data’s bad you get a bad story,” she says.
Tom Kent, the AP’s standards editor, acknowledges that
mistakes are an issue that the AP takes seriously—but he also points out that
human-written stories aren’t error free, either. “The very stressful job for a
human of putting together figures and keeping data sets separate and not mixing
revenue and income and doing the calculations correctly was a prescription for
mistakes as well,” he points out. According to Patterson, who oversees all
corrections (human or otherwise) for the business desk, the error rate is lower
than it was before automation, though she declined to provide exact figures.
ALGORITHM-ASSISTED JOURNALISM
The imperative to avoid errors prompted the AP to keep
its automated stories simple, which makes them, well, somewhat lifeless. Other
Wordsmith users include descriptions of major factors, such as most
point-contributing players for fantasy sports or top-performing stocks and
categories for financial portfolio summaries, that influence the overall trend
in a data set. The AP has opted to exclude these more analytical facts, often
included in human stories, from automation because of concerns about adding too
much complexity too quickly. “There are things we decided not to do quite yet
that were presented as possibilities,” Patterson acknowledges. “We chose not to
add them to the stories, because we were really committed to making sure that
the accuracy of the stories was intact.”
Instead, the Associated Press has human editors who add
context to many of its automated stories. At least 300 companies are still
watched closely by the AP’s business desk staff. There are 80 companies that
always get additional reporting and context by Associated Press staff; another
220 get reviewed by editors, who may enhance the story with their own reporting
or context. That system has created significant efficiencies for the AP,
freeing up 20 percent of the staff’s time across the business desk, estimates
Lou Ferrara, the vice president and managing editor who oversaw the project.
And that doesn’t take into account the impact of the initiative on the AP’s
customers.
AT AP, AUTOMATED STORIES HAVE FREED UP 20 PERCENT OF THE
BUSINESS DESK’S TIME
One of the biggest impacts of the Associated Press’s
automated earnings project has been its expanded coverage of smaller companies
that are primarily of interest to local markets. The AP’s customers are,
largely, local outlets, and companies of interest to these clients had fallen
out of AP coverage during the cutbacks of the 2000s. For communities, this was
a potentially significant loss. “If there’s a big company, it’s employing
people in your family, your neighbors, people you go to church with,” says
Patterson. “There are a lot of people who are interested in the economic health
of that company.”
In Battle Creek, Michigan, for example, the Kellogg
Company is one of the region’s most important employers, and its fingerprints
are all over town—from thousands of monthly pay stubs at the bank to the names
on a school to the W.K. Kellogg Foundation’s cheery-looking headquarters downtown.
Pat Van Horn is among the locals who worked at Kellogg until she retired in
2010. She and her husband Lance still pay attention to what’s happening at the
company. They have friends who work there, and like many locals, they’ve got
Kellogg stock in their portfolio.
So each quarter, when the company releases its earnings
statements, the Van Horns glance at the Battle Creek Enquirer to see how things
are going. “You know, what was the earnings report this quarter, the dividends
are going to be X amount per share,” says Lance. “We follow them a little bit.”
The Enquirer uses the Associated Press’s earnings stories
as the foundation for its coverage of the company; local reporters add context,
digging deeper on issues that are likely to impact Battle Creek. That frees up
reporters and editors to do the work that the computers can’t do.
For the AP, content licensing is king, making up the vast
majority of the company’s revenue, and newspaper and online customers accounted
for 34 percent of 2014 revenue. Continuing to deliver content those customers
want is key to retaining their business. “It’s not like we’re going to be
growing revenue in the local markets in any particular way,” says Ferrara.
Instead, the AP sees the automated earnings as a way to retain customers,
particularly those that have been hard hit by job losses across the industry.
The AP is doubling down on that strategy; the company has
continued to expand its automation efforts, adding public companies with a
market capitalization above $75 million as well as select Canadian and European
firms. Many of these companies would never have been covered by the AP’s staff
writers. The same is true for other areas of coverage the AP is looking to
automate, including Division II and Division III college football and
basketball games.
By offloading the basic reporting work, the AP hopes it’s
making it easier for local papers to focus on the stories that matter to
community members, like the Van Horns. “We’re not here merely to be just
churning out numbers,” says Lisa Gibbs, the AP’s business desk editor. “We’re
really writing these stories for customers who are more likely to have shopped
at a Walmart than to own individual stock in Walmart.”
Gibbs was hired just after the introduction of the
automated earnings stories. She says it was an opportunity for the team to
rethink how the company was going to cover business. With automation, Gibbs
says her team has been able to focus on doing the kinds of medium-sized
enterprise stories that had been squeezed out before. She points to the example
of a piece by business reporter Matthew Perrone, who covers the U.S. Food and
Drug Administration, which reported on a lack of regulatory oversight for the
growing number of stem cell clinics. “We were able to take some time, send him
to travel to some of these clinics, and ultimately publish the story,” she
says.
The AP isn’t alone in using automation to support its
reporting efforts. One of the first places to adopt automation in the newsroom
was the Los Angeles Times. The paper’s Homicide Report maintains a database
with information about every homicide reported by the Los Angeles County
coroner’s office; each victim profile includes a brief automatically generated
write-up. It’s up to reporters to decide which stories deserve more in-depth
reporting.
The LA Times built on the lessons from that project with
the introduction in 2011 of Quakebot. Ken Schwencke, a digital editor on the LA
Times’s data desk at the time, used data from the USGS Earthquake Notification
Service to automatically generate short reports on earthquakes above the
“newsworthy” threshold of a 3.0 magnitude. LA Times reporters review the
stories, publish them, and update the story with additional information as it
becomes available. Quakebot’s big advantage is speed; a story can be posted
online in under five minutes. This kind of assistive role is one that many news
organizations insist is the foundation of their automation efforts.
The AP recently hired its first “news automation editor,”
Justin Myers. He sits on the editorial team and is represented by the News
Media Guild, just like his editorial colleagues. His job is to help figure out
how to streamline editorial processes and “give time back to the writers,
editors, and producers, who in a lot of cases are slogging through whatever
processes we’ve built up over the years, rather than focusing on doing
journalism.”
Myers has spent his first few months on the job mostly
interviewing reporters, editors, and producers to find out what work he can
help take off their plates. The number one question he’s asking: “How do you
spend your time?” If it’s possible to automate some of a staffer’s burdensome
tasks, Myers is happy to help. “Let’s have a computer do what a computer’s good
at, and let’s have a human do what a human’s good at,” he says.
Alexis Lloyd, creative director of The New York Times
R&D Lab, agrees. The general public’s thinking about automation hasn’t been
updated since the 1950s, she says. Typically, we imagine an all-or-nothing
scenario: all with humans or all with machines. She says that’s wrong; across
all kinds of industries the approach to automation has changed to focus on more
assistive technologies. “We’ve been thinking that the future of computational
journalism and automation will—and should—be a collaborative one, where you
have machines and people working together in a very conversational way,” she
says.
Several news organizations are using automation to
support their reporters’ work behind the scenes, too. Lloyd mentioned Editor, a
new tool that integrates with the company’s content management system to help
reporters tag content by providing automated suggestions. Similar efforts are
under way at BBC News Labs, with a tool called Juicer.
These tools support news organizations in their push to
develop new storytelling formats that highlight the relationships between news
events and help provide readers with richer context. Most of these efforts
require large amounts of detailed metadata that can help link together stories
that have in common people, places, or ideas. Adding metadata is a frustrating
task for most reporters, who are typically more concerned with crafting their
story than dissecting it. Automation is a way to expand the use of
metadata—without putting an extra burden on reporters and editors.
Behind-the-scenes tools can also help reporters in more
proactive ways. For example, another tool from The New York Times R&D Lab
automatically tracks its stories on Reddit, looking for hot conversations, and
alerts journalists when there’s an active discussion of their work they might
be interested in monitoring or participating in.
This is one of the most promising areas for automation in
the newsroom, says Nick Diakopoulos, an assistant professor at the University of
Maryland’s Philip Merrill College of Journalism, who has been studying the use
of algorithms in media. By tracking social media or other public data sets,
automation tools can help support newsgathering in a digital environment. Using
automation tools like these can raise journalists’ awareness of issues, help
them pay attention to important data sets, or listen to conversations and react
more quickly, he says.
PERSONALIZATION AND REVENUE
Automation can also become a useful tool for connecting
with audiences more directly. In June, journalists at Oregon Public
Broadcasting (OPB) rolled out a news app to accompany a series on earthquake
preparedness in the state. The app, called Aftershock, provides a personalized
report about the likely impacts of a 9.0 magnitude earthquake on any user’s
location within the state, based on a combination of data sets.
The scenario isn’t just speculative; the region is widely
expected to face a massive quake of just this sort, known as the Cascadia
quake. “OPB has been doing a bunch of coverage on the Cascadia quake and how
people can prepare, but a lot of people don’t care about a topic until it
affects them directly,” says OPB’s Jason Bernert. “When you put them in the
center of the story, they take an interest.”
OREGON PUBLIC BROADCASTING’S AUTOMATED LOCALIZED
EARTHQUAKE PREPAREDNESS REPORTS WOKE UP LISTENERS TO THEIR VULNERABILITIES
Aftershock uses data sets on earthquake impacts that were
modeled by the Oregon Department of Geology and Mineral Industries and impact
zones defined by the Oregon Resilience Plan report. The data mixes and matches
ratings for things like shaking, soil liquefaction, landslide risk, and
tsunamis. In total, there are 384 possible combinations, and users see a
version of the story that’s relevant to the location they’ve selected. As with
The New York Times’s “Best and Worst Places to Grow Up” interactive, the news
app dynamically stitches together the various elements of the story—which the
OPB team calls “snuggets,” a portmanteau of “story nuggets”—based on the data
for each location.
Some of the data applies to broad regions of the state,
but other data sets have estimated impacts for regions as small as 500 meters.
Aftershock takes advantage of that granularity by showing users the expected
impacts for specific addresses. “There’s a big difference between ‘a 9.0
earthquake in Oregon’ and ‘your area is where shaking is going to be the
worst,’” says Bernert. “It has a different emotional response for people to
start different conversations.”
The editorial appeal of projects like this is clear—but
personalization has the potential to attract the interest of the business side
as well. Following the mid-July publication of The New Yorker’s in-depth
article about the Cascadia quake, Bernert says, Aftershock’s traffic soared.
For a few days afterward, the site was handling 300 times the usual number of
requests. Other OPB reporting on the Cascadia quake saw an increase in traffic,
but “the real social driver was Aftershock,” he says. On Facebook, users were
sharing Aftershock and saying, “This is what’s going to happen to me; I better
go out and get prepared” and encouraging others to check out how they would be
impacted as well.
Although none of the current implementations have focused
on monetization efforts specifically related to personalized content, it is the
aspect of automation that could have the largest effect on potential news
revenue.
Automation is already being used today to personalize
some news organizations’ homepages or to provide “recommended for you”
features. By further increasing engagement with users, automation that
personalizes content could have positive impacts on revenue from advertising
and subscriptions. This kind of personalization provokes anxiety among many
news professionals, who worry that personalization will limit readers’ exposure
to the stories editors might deem important in favor of things that are
frivolous. As Mark Zuckerberg said when describing the value of the News Feed,
“A squirrel dying in front of your house may be more relevant to your interests
right now than people dying in Africa.”
For now, most article personalization efforts focus on
types of users, much as Aftershock uses automation to match specific addresses
to general scenarios. It’s more like having a shirt with the right collar and
sleeve-length measurements than a handmade, custom-tailored one. “We haven’t
got it down to the person yet,” acknowledges Joe Procopio, chief product officer
for Automated Insights.
That’s true of even the most sophisticated algorithms
that are used by credit agencies, retailers, and personnel companies; vast
quantities of personal info are crunched to pigeonhole users as a “type” that
can be used to predict loan default risk, send perfect-for-you coupons, or
characterize your management style. Take the example of Crystal, an artificial
intelligence tool that helps you write better e-mails for specific individuals,
based on their online profiles. The program reviews things someone has written
online, such as their LinkedIn profile, and identifies them as one of 64 types.
Each type has associated communication tips about things like vocabulary to
avoid, how much detail to include, and how formal the language should be.
Automated Insights provides this kind of customization to
its commercial customers already—one car-sales website uses Wordsmith to show
users slightly different descriptions of the vehicles based on their profile,
Procopio says. A first-time car buyer might be shown a description that
emphasizes the car’s fuel performance, while a mother in the market for a
family vehicle might see descriptions that emphasize safety ratings. In both
cases, the information in the profiles is the same, but different features are
prioritized.
Diakopoulos says lack of data is a significant barrier to
such personalization of stories. To push further into true personalization,
news organizations would need to collect a lot more information about their
users—and develop strategies for how to address stories to those different
types of users. “News organizations aren’t very good about even having user
models,” Diakopoulos points out. “They don’t really know who’s on their site.
That’s very different than having a robust user profile and an ability to adapt
the page based on the cookie profile and so on.”
Even if users were to agree to provide more detailed
information—by logging in with Facebook or LinkedIn, say—journalists would
still need to be at the helm of efforts to target those users with content in
specific ways. The complexity involved in automating a story with just one
variable—geography—would become exponentially more difficult. For newsrooms,
that presents significant challenges for editing, fact checking, and writing
multiple variations of the “snuggets” to be used in the stories. “There is
concern,” says Procopio, “that someone is going to read a story and not get all
the facts because it’s biased toward that person. I don’t think that concern is
warranted.”
As the technology improves, the potential value of
personalization from a revenue perspective will certainly become more
important. Frank Pasquale, a professor at the University of Maryland Francis
King Carey School of Law and author of a recent book on the pervasive power of
algorithms, “The Black Box Society,” argues that if stories can eventually be
customized for users based on factors like their income, where they live, or
any of the micro-categories (e.g., “cat lover,” “Walmart shopper,” or “STD sufferer”)
that data brokers collect from our online lives, newsrooms will almost
certainly face pressure to do so. “That’s going to be seen, eventually, as
revenue maximizing,” Pasquale says.
He suggests a question for newsrooms to consider as they
apply personalization: “To what extent is this ‘dead-squirrel’ personalization
and to what extent is this personalization that draws people creatively into
stories about other parts of the world?”
To focus on the latter, one option is to rely less on
broad personal data that sparks fears about who algorithms assume a user is and
instead focus on what’s relevant about a user’s relationship to a particular
story. For example, The New York Times’s Upshot team has recently published a
few stories that use in-story interactions to adapt a story to a user’s
existing knowledge or views on the subject.
In one case, users were asked to draw a line on a graph
they thought represented college enrollment rates across economic groups. Based
on the line drawn, users were shown one of 16 different versions of the story,
each of which explained the real data while comparing them to the user’s own
assumptions about the issue. It was a simple but very successful piece of
explanatory journalism because it focused the written article on information
most relevant to the reader, without changing the reported parts of the
stories. Projects like these also have the advantage of built-in transparency
about what characteristics are being used to automate the story’s creation.
Users actively provide the information to the system in order to get the
information they want, and that input data is clearly linked to the story
itself.
SMALL NEWS ORGANIZATIONS COULD, IN FACT, HAVE THE MOST TO
GAIN FROM AUTOMATED JOURNALISM. THEY ARE WELL-POSITIONED TO DRAW ON LOCAL DATA
TO WRITE STORIES
Transparency is one of the stickiest issues facing
automation systems, particularly as they intersect with personalization. Kent,
the AP’s standards editor, thinks concerns about algorithmic transparency are
overblown when it comes to automatically generating content. “Human journalism
isn’t all that transparent,” he says. “News organizations do not accompany
their articles with a whole description of what was on the journalist’s mind
that could have affected his thinking process, whether he had a head cold, had
just been hung up on by a customer service rep of the company he was writing
about, and so on.”
Because the rules governing how automated stories get
assembled are available for scrutiny, automated journalism may be more
transparent than stories written by humans, he argues. But for the majority of
projects, it’s hard to know what value readers might find in disclosures, even
if they were presented. Mike Dewar, a data scientist in The New York Times
R&D Lab, has written about the futility of publishing documentation if none
of the intended audience can read it. Instead of publishing just open source
data or documentation on algorithms, he argues, the community needs to adopt
common standards and procedures.
That kind of standardization could benefit non-technical
users, who would become more familiar with how such projects work and what to
expect. Standardization could also help smaller newsrooms experiment with
automation. At OPB, Aftershock was a hugely successful project, but it required
some heavy lifting from the small public media team. Bernert and his colleague
Anthony Schick built the app during a three-day build-a-thon sponsored by the
University of Oregon’s journalism school, with the pro-bono assistance of a
local interactive design firm, students, and academics. “There’s a lot of value
to this kind of work,” Bernert says. “But how do we make it sustainable for a
small public media newsroom?” Having a larger shared set of the technologies
and methodologies would help.
Small news organizations could, in fact, have the most to
gain from using automation. While Wordsmith and Quill are focused on expanding
in big-dollar markets like financial information and insurance, they’ve
demonstrated their technology on a variety of local data, such as water quality
reports from public beaches and public bike-share station activity. Local news
organizations could be well positioned to take advantage of this kind of
structured data using automation, either by expanding their coverage or by
creating new products. Commercial providers once siphoned off some of news
organizations’ most important revenue streams by finding better ways to deliver
classified ads, job listings, home sales, and other information—much of which
is available in the form of structured data. Automation could be one way for
news organizations to recapture some of that revenue.
After all, automation is about putting narratives around
data, and news organizations have the skills and experience needed to do just
that.
Comments
Post a Comment