Readers may be aware that on Tuesday of this week, Amazon Web Services (AWS) encountered an hours-long services outage at a major U.S. East Coast data center complex that affected large numbers of web users and a fair number of Cloud services customers. Reports of major system outages have gained higher visibility of-late because of the realities of today’s globally extended technology networks that deploy transactional, analytical and storage technology needs virtually in the Cloud. The likely root-cause of this week’s outage will provide more important learning for Cloud based supply chain management technology providers.
According to published reports from The Wall Street Journal and other publications, an AWS employee, in an effort to speed-up the company’s cloud-storage billing system was attempting to take some storage servers offline but may have mistyped a system command. This affected far more servers than intended. That reportedly led to a cascade of storage server failures that ultimately suspended other Amazon Cloud services. Amazon further acknowledged that it took longer than expected to be able to restart affected servers.
Not only did the outage impact the services of web sites such as Business Insider, Medium, Slack Technologies, Quora and others, the WSJ cites data from Cyence Inc. estimating that the outage may have cost companies in the S&P 500 upwards of $150 million, and from website monitoring company Apica indicating that 54 of the web’s top 100 retailers saw website performance erode by 20 percent or more.
Responding to this week’s outage, Amazon indicates it will add safeguards to prevent server capacity from eroding too quickly or below a critical minimum level. We have no doubt that such actions will be taken swiftly, especially if CEO Jeff Bezos is involved.
A published report from theguardian contexed this week’s outage as “a cautionary tale about the future of the Internet.” The Guardian report indicated: “when that (Cloud services) model works, it works brilliantly, providing low barrier to entry for small firms needing an online presence, economies of scale for larger companies warning world-class hosting- and huge profits for Amazon itself.” Further noted was that the US-East region, where the outage was centered, contains some of AWS’s most visible customers because it is a hub for the U.S.’s largely east coast based publishing industry. The outage ironically further impacted site downtime service site downforeveryoneorjustme.com which allows users to ascertain whether a particular site is offline for everyone. Once more, certain smart home users reported losing control of house functions such as locks and lights because of the cascading services failures.
The Guardian report does indicate that Amazon’s Cloud is far more stable than what most companies could build on their own. That is an important grounding since human error can occur on any system, regardless of location and management.
Supply Chain Matters would thus submit that this week’s outage provides a very important reminder, namely that even with Amazon’s tech savvy and deep-pocket investment resources, a literal typo error provided a single-point of cascading failures. This should be another reminder to all other Cloud based services providers that assuming business critical technology support such as processes and decision-making surrounding supply chain management and customer fulfillment not only comes with certain service-level, data-security and service uptime responsibilities, but with the responsibility to ensure that any disruption, regardless of cause, can be mitigated by failover or back-up compute capacity.
In supply chain planning and customer fulfillment vernacular, we refer to the term “safety stock” to imply a contingency level of inventory or capacity just-in-case if an unexpected operational disruption occurs, an organization can continue to fulfill other customer needs. That same analogy applies to the Cloud. Line of business and systems selection teams should insure that any Cloud provider can not only articulate a stated uptime performance commitment but a recovery plan as-well.
© Copyright 2016. The Ferrari Consulting and Research Group and the Supply Chain Matters® blog. All rights reserved.