Fastly conducts post-mortem after global Internet outage

Johannesburg, 09 Jun 2021

US-based cloud computing services provider Fastly says the global Internet outage experienced yesterday was as a result of an “undiscovered software bug”.

The provider of real-time content delivery network services suffered an outage in the early hours of yesterday that reportedly affected a number of Web sites across the globe.

In a statement to ITWeb, a Fastly spokesperson says: “We identified a service configuration that triggered disruptions across our POPs [points of presence] globally and have disabled that configuration. Our global network is coming back online.”

Fastly describes its network as an edge cloud platform, which is designed to help developers extend their core cloud infrastructure to the edge of the network, closer to users.

The Fastly edge cloud platform includes its content delivery network, image optimisation, video and streaming, cloud security, and load-balancing services.

The company’s cloud security services include denial-of-service attack protection, bot mitigation, and a Web application firewall. Its Web application firewall uses the Open Web Application Security Project ModSecurity Core Rule Set alongside its own ruleset.

Timeline of events

According to Rockwell, here is a timeline of the day’s activity (all times are in UTC):

09:47 Initial onset of global disruption

09:48 Global disruption identified by Fastly monitoring

09:58 Status post is published

10:27 Fastly Engineering identified the customer configuration

10:36 Impacted services began to recover

11:00 Majority of services recovered

12:35 Incident mitigated

12:44 Status post resolved

17:25 Bug fix deployment began

“Once the immediate effects were mitigated, we turned our attention to fixing the bug and communicating with our customers. We created a permanent fix for the bug and began deploying it at 17:25,” says Rockwell.

He adds the company is deploying the bug fix across its network as quickly and safely as possible.

“We are conducting a complete post-mortem of the processes and practices we followed during this incident. We’ll figure out why we didn’t detect the bug during our software quality assurance and testing processes. We’ll evaluate ways to improve our remediation time.

“We have been – and will continue to – innovate and invest in fundamental changes to the safety of our underlying platforms. Broadly, this means fully leveraging the isolation capabilities of WebAssembly and Compute@Edge to build greater resiliency from the ground up. We’ll continue to update our community as we make progress toward this goal.”

Concluding, Rockwell says even though there were specific conditions that triggered this outage, Fastly should have anticipated it.

“We provide mission-critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologise to our customers, and those who rely on them, for the outage and sincerely thank the community for its support.”