About
Subscribe

IS explains MWeb Business outage

Staff Writer
By Staff Writer, ITWeb
Johannesburg, 21 Jul 2015
An undetected error in the provisioning system occurred, meaning no action was taken, which resulted in a termination process being triggered.
An undetected error in the provisioning system occurred, meaning no action was taken, which resulted in a termination process being triggered.

ITWeb asked Internet Solutions (IS) to explain what caused the widespread virtual machine outages at its MWeb Business subsidiary over the course of the past few days.

IS spokesperson Michelle Atkins responds to our questions.

ITWeb: We've heard from some customers that a management tool called Solid was updated, with an error in a script causing VMs to be forward-dated past their expiry date, and therefore to be immediately de-provisioned and erased. Is that correct?

Atkins: It is partly correct. An undetected error in the provisioning system occurred. As it was undetected, no action was taken and this resulted in a termination process being triggered. This termination process was the de-provisioning of some of the MWEb Business virtual machines.

ITWeb: Is Solid it an in-house tool or a third-party product?

Atkins: Solid is a third party product.

ITWeb: Why did this affect all VMs, not just the ones being newly provisioned?

Atkins: The undetected error that occurred resulted in service termination across the virtual platform, as the platform was pulling incorrect information, due to the error.

ITWeb: Why wasn't sufficient testing conducted to catch an error of this magnitude? Bugs are normal, but something as radical as de-provisioning an entire VM fleet would trip even the most lenient unit test, surely?

Atkins: Testing is always conducted as a matter of process, however actions taken in this instance coupled with the undetected error resulted in products and services being unavailable for a period of time, as well as the full de-provisioning of virtual machines.

ITWeb: Why are end-date VMs instantly gunned, rather than triggering a notification and a grace period? Wouldn't that avoid loss through billing errors, userconfig glitches, and indeed system-wide management disasters like this?

Atkins: The processes that are in place for what would be considered standard do limit full de-provisioning without necessary authentication. However in this instance the undetected error and subsequent actions resulted in the automatic de-provision of the virtual machines. We have immediately instituted additional protection mechanisms and will be looking at further improvements to prevent an error like this from ever affecting our customers again.

Share