Amazon's move off Oracle caused Prime Day outage in one of its biggest warehouses, internal report says
Amazon's move off Oracle caused Prime Day outage in one
of its biggest warehouses, internal report says
·
Amazon's move off Oracle's database software was
the main reason for an outage in one of its biggest warehouses on Prime Day,
according to internal documents obtained by CNBC.
·
The outage highlights the challenges Amazon
could face as it looks to move completely off Oracle’s database by 2020.
·
Amazon and Oracle have been in a heated battle
of words in recent years over the performance of their database software and
cloud tools.
By Eugene Kim October 23, 2018 CNBC.com
Amazon is learning how hard it can be to move off of
Oracle's database software.
On Prime Day, while the e-retailer was dealing with a
major website glitch that slowed sales, the company was also dealing with a
technical problem in Ohio at one of its biggest warehouses, leading to
thousands of delayed package deliveries, according to an internal report
obtained by CNBC.
The problem was in large part due to Amazon's migration
from Oracle's database to its own technology, the documents show. The outage
underscores the challenge Amazon faces as it looks to move completely off
Oracle's database by 2020, and how difficult it is to re-create that level of
reliability. It also shows that Oracle's database is more efficient in some
aspects than Amazon's rival software, a point that Oracle will likely emphasize
during this week's annual OpenWorld conference in San Francisco.
Following the Prime Day outage, Amazon engineers filled
out a 25-page report, which Amazon calls a correction of error. It's a standard
process that Amazon uses to try to understand why a major incident took place
and how to keep it from happening in the future.
The report shows that Amazon struggled to identify the
root cause of the Prime Day issue because of a feature it lost after the
database was moved over. It also failed to come up with a contingency plan in
case of an error in its newly installed database, called Aurora PostgreSQL, the
documents show.
In one question, engineers were asked why Amazon's
warehouse database didn't face the same problem "during the previous peak
when it was on Oracle." They responded by saying that "Oracle and
Aurora PostgreSQL are two different [database] technologies" that handle
"savepoints" differently.
Savepoints are an important database tool for tracking
and recovering individual transactions. On Prime Day, an excessive number of savepoints
was created, and Amazon's Aurora software wasn't able to handle the pressure,
slowing down the overall database performance, the report said.
Could have happened anyway
"It's quite possible the outage would not have
occurred if Amazon had stuck with Oracle," said Matt Caesar, a computer
science professor at the University of Illinois at Urbana-Champaign, after CNBC
shared the details of the document. "Also, it appears they would have been
able to diagnose the problem sooner if they were using Oracle's database, which
could possibly have reduced the outage duration."
An Amazon spokesperson played down the issue in an
emailed statement and said there was no outage, even though the internal
document states that the database "degradation resulted in lags and
complete outages."
"It is important to point out that there was never
an outage at the facility, and the issue only resulted in delaying shipping of
about one percent of packages for a short period of time," the
spokesperson said. "This issue was quickly diagnosed and resolved."
The Ohio warehouse is the largest of the 13 warehouses
that moved its database off Oracle prior to Prime Day. During the Prime Day
period, it handled over 1.1 million packages per day, the documents say. All
services and software that handle inventory and shipping data had been migrated
to Aurora in those warehouses.
The outage, which lasted for hours on Prime Day, resulted
in over 15,000 delayed packages and roughly $90,000 in wasted labor costs,
according to the report. Those costs don't include all the lost hours spent by
engineers troubleshooting and fixing the errors or any potential lost sales.
In a section titled, "Lessons Learned," Amazon
engineers wrote that, "Savepoint behaves differently in Aurora PostgreSQL
than in Oracle," suggesting Oracle's software would have handled the issue
more efficiently. It also says SQL statement data did not exist for analysis in
PostgreSQL, and having access to that data "would have helped
pinpoint" the root cause of the problem.
The outage may have been less severe had Amazon been more
prepared. In one part of the document, the company said it "took a long
time to mitigate" the problem because of a "lack of a reaction plan
when the underlying PostgreSQL DB experiences performance issues." The
document also said a "well-established reaction plan or runbook"
could have helped "mitigate the impact sooner."
"My guess is that they changed databases a while
ago, didn't test the exact load model that occurred during their Amazon Prime
Day and got surprised, badly," Henning Schulzrinne, a computer science
professor at Columbia University, said after reviewing the document.
Amazon and Oracle have been in a heated battle of words
in recent years, as Amazon has expanded its software offerings to more directly
compete with Oracle. CNBC reported in August that Amazon is working on moving
its entire database off Oracle by early 2020.
'It's really, really hard'
Oracle Chairman and co-founder Larry Ellison isn't buying
it. On the company's earnings call in December, Ellison said Amazon "is
not moving off of Oracle." He reiterated his point at an August event,
saying, "I don't think they can do it."
"They've had 10 years to get off Oracle, and they're
still on Oracle," he said. "And it's not going to be easy for them to
use their own technology. It's not going to be cost-effective. I mean, it's
really, really hard."
Patrick Moorhead, principal analyst at Moor Insights
& Strategy, said the incident shows how hard it is for older applications,
like those used in Amazon's warehouses, to move off Oracle, which has spent
decades working with the world's largest enterprises.
"AWS Aurora is designed for forward-looking
applications and Oracle for more legacy applications," he said.
Comments
Post a Comment