Fastly conducts post-mortem after global Internet outage

Read time 3min 20sec

US-based cloud computing services provider Fastly says the global Internet outage experienced yesterday was as a result of an “undiscovered software bug”.

The provider of real-time content delivery network services suffered an outage in the early hours of yesterday that reportedly affected a number of Web sites across the globe.

In a statement to ITWeb, a Fastly spokesperson says: “We identified a service configuration that triggered disruptions across our POPs [points of presence] globally and have disabled that configuration. Our global network is coming back online.”

Fastly describes its network as an edge cloud platform, which is designed to help developers extend their core cloud infrastructure to the edge of the network, closer to users.

The Fastly edge cloud platform includes its content delivery network, image optimisation, video and streaming, cloud security, and load-balancing services.

The company’s cloud security services include denial-of-service attack protection, bot mitigation, and a Web application firewall. Its Web application firewall uses the Open Web Application Security Project ModSecurity Core Rule Set alongside its own ruleset.

In a blog post, Nick Rockwell, senior vice-president of engineering and infrastructure at Fastly, says: “We experienced a global outage due to an undiscovered software bug that surfaced on 8 June when it was triggered by a valid customer configuration change.

“We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95% of our network was operating as normal.

“This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them.”

Describing what happened, Rockwell explains that on 12 May, Fastly began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.

Early on 8 June, he says, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of Fastly’s network to return errors.

Timeline of events

According to Rockwell, here is a timeline of the day’s activity (all times are in UTC):

09:47 Initial onset of global disruption

09:48 Global disruption identified by Fastly monitoring

09:58 Status post is published

10:27 Fastly Engineering identified the customer configuration

10:36 Impacted services began to recover

11:00 Majority of services recovered

12:35 Incident mitigated

12:44 Status post resolved

17:25 Bug fix deployment began

“Once the immediate effects were mitigated, we turned our attention to fixing the bug and communicating with our customers. We created a permanent fix for the bug and began deploying it at 17:25,” says Rockwell.

He adds the company is deploying the bug fix across its network as quickly and safely as possible.

“We are conducting a complete post-mortem of the processes and practices we followed during this incident. We’ll figure out why we didn’t detect the bug during our software quality assurance and testing processes. We’ll evaluate ways to improve our remediation time.

“We have been – and will continue to – innovate and invest in fundamental changes to the safety of our underlying platforms. Broadly, this means fully leveraging the isolation capabilities of WebAssembly and Compute@Edge to build greater resiliency from the ground up. We’ll continue to update our community as we make progress toward this goal.”

Concluding, Rockwell says even though there were specific conditions that triggered this outage, Fastly should have anticipated it.

“We provide mission-critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologise to our customers, and those who rely on them, for the outage and sincerely thank the community for its support.”

See also