As we attempt to provide round-the-clock IT services without employing out-of-hours staff, our worst nightmare for an "incident" is one that would affect the whole machine room, starting after everyone has left work on a Friday evening, with the whole weekend until normal work resumes on Monday morning. It would be even worse if this were to occur just before the start of a new academic year.
So guess what happened last weekend?
One of the power supplies in our main machine room caught fire as a result of an electrical fault. Although the fire was quickly contained, the emergency services shut off all power to the machine room as a precaution. Several of our major services became unavailable and staff had to be called in over the weekend to fix everything.
The good news was that those services that are designed to automatically fail-over to the backup machine room did so. Also, the support team had all our top priority services back up by six o'clock on the Saturday. Our disaster recovery plan aims to have them restored within 24 hours so this was a good result. (Technically, this wasn't a "disaster" in the terms of our DR plan because we were able to use the main site once the power was restored, but it seems to me that the result still stands).
Even though the overall result was not too shabby, there are a lot of things that our support teams will learn from this experience. I'll be interested to see the results of the post-event analysis.
So guess what happened last weekend?
One of the power supplies in our main machine room caught fire as a result of an electrical fault. Although the fire was quickly contained, the emergency services shut off all power to the machine room as a precaution. Several of our major services became unavailable and staff had to be called in over the weekend to fix everything.
The good news was that those services that are designed to automatically fail-over to the backup machine room did so. Also, the support team had all our top priority services back up by six o'clock on the Saturday. Our disaster recovery plan aims to have them restored within 24 hours so this was a good result. (Technically, this wasn't a "disaster" in the terms of our DR plan because we were able to use the main site once the power was restored, but it seems to me that the result still stands).
Even though the overall result was not too shabby, there are a lot of things that our support teams will learn from this experience. I'll be interested to see the results of the post-event analysis.
Comments