Cause and Remedy to Wednesday March 13, 1-hour Internet Outage
March 19, 2013 Leave a comment
On Wednesday March 13, at about 9am, most of Utah County experienced a power glitch caused by Rocky Mountain Power, which was passed through to the Eagle Mountain City power grid. Power surges like this can be very damaging to sensitive electronic equipment, and because of this, our central office, which houses several million dollars’ worth of very sensitive, state-of-the art communications equipment including a digital telephone switch, fiber electronics, and all kinds of internet switches, routers and servers, is engineered to never touch the power grid directly.
Inside our switch room is a wall of batteries in an array, which are designed to back up the power to the equipment and absorb any surges. Backing up the batteries are very large generators, which will kick in after 3 seconds of a power outage. Last November, our switch manager began the planning and engineering process to replace the battery array, because the existing batteries had been in place since the days of Eagle Mountain Telecom. Two weeks ago, contractors began preparatory work to replace the batteries.
Unfortunately, this power failure came at a very bad time for us, just before our entire battery upgrade was due to take place. When the glitch came, the older batteries did not perform to spec, and since the outage was shorter than 3 seconds, the generator did not kick in. This resulted in the entire main phone switch shutting down, and rebooting, which is never a good thing for such a sophisticated piece of equipment because so many systems are integrated and the chances of everything rebooting seamlessly are slim. When the switch rebooted, several trunking systems stayed down, which is why there was a brief city-wide loss of dialtone. Our techs had to manually bring the trunks back online, and dial-tone was reestablished in about 30 minutes. The telephone system was back to 100% capabilities within an hour.
After the phone system was back online, which is our priority in a crisis in accordance with public utility regulations, our engineers turned their attention to restoring internet service. Due to the power surge and the battery array failure, a core internet switch had been fried. This switch had to be replaced with a spare that we kept on hand for just such an emergency situation. (Now, when I say switch, I don’t mean the kind of consumer-grade switches you would pick up at Radioshack or Best Buy, this is carrier-grade equipment that costs tens of thousands of dollars per switch.) Internet service was restored to the greater network by about an hour after the power glitch, once the internet switch had been replaced and reconfigured, and our IT administrator, Darin, did a fantastic job getting that up again so quickly.
We were initially concerned that this power surge could have damaged some customer equipment, like what occurred a few months ago when the city did their power grid upgrade. That power surge fried a few hundred power inverters on people’s homes, and we spent a couple of days having to replace those. However, we were pleased to see that only one power inverter in the entire customer base had been damaged by this power surge, which outlined just how extreme and unusual that big city power surge had been a few months ago.
After the internet had been generally reestablished just after 10am, there were of course a few stragglers, which is our term for individual customers that did not come back online for various reasons. There were a couple of individual neighborhood fiber electronics cards that were also damaged, and needed to be replaced. Eagle Park, Autumn Ridge, and Hidden Canyon were affected by these card failures, and those neighborhoods took about 2 ½ hours to bring back online. We had some up and down speed issues through the day too as we were testing and validating various equipment and systems.
As individual customers called in, posted on facebook, or alarms showed in the network management system, we reset the individual connection to those customers homes remotely from our servers. If a customer could not be reset remotely, we immediately dispatched a field technician to go into the customer’s home to reset their modem or router. Altogether, there were about 119 individual customers we had to resolve lingering issues for, and by about 6pm that night we had taken care of all but 7 customers who needed further action.
One instruction we must stress to all customers—please do not factory reset your routers after a power outage, because that will wipe the PPPoE username and password from the router, and you will not be able to reconnect with our servers. A simple power cycle (pull the power cord out the back and plug it back in so the router reboots) is all you need to do to refresh your connection. A lot of the truck-rolls we had to do on Wednesday were simply because the customer decided to take the very drastic and unnecessary step of factory resetting their modem or router. When we have to roll a truck simply to reprogram a router, this diverts resources we could be using to help a greater number of customers with real issues not caused by their own actions.
So, the big question customers are asking is: what are you doing to prevent this from every happening again if similar power outages should occur?
The good news is, this very week, on the 18th March, we completed the process of replacing our entire old battery array. This was a several-hundred thousand dollar upgrade. Our new array consists of 28 brand new carrier-grade engineered batteries, stacked in four rows of seven, which will power and protect our equipment properly. This will prevent future switch reboots and equipment failures due to power failures. Our equipment is safe again.
There were a lot of positives to come out of Wednesday’s events too. Last week we launched an online live chat and ticketing system. When our own office phones were down, remote staff who had internet service were able to still talk to customers, keep them updated, and customers could generate trouble tickets automatically online if they also had mobile data, which our surveys show most of our customers do also pay for. Many customers used their mobile data to talk to us on Facebook and let us know they were down at home. We were able to use Facebook to give general updates and start trouble tickets. This was possible because a large portion of our customers are our friends on Facebook. If haven’t liked our page on Facebook yet, please do so at http://www.facebook.com/directcom.eaglemtn
Several customers also asked us to start a Twitter feed for outage notifications, which we will look at launching.
We tried to keep customers updated as best we could with the tools we had on hand. We try to provide appropriate network information, but sometimes giving too much information, especially about sensitive network equipment, can raise more concerns for customers, but we know a lot of our tech-savvy customers, and the many people working from home in Eagle Mountain, like regular updates. Even in providing a report of this kind into the causes of an outage, and the steps we are taking to prevent this from every happening again, we realize we open ourselves to a whole lot more questions from customers. For customers really interested in the inner workings of our central office, we will be hosting a tour of our facilities the week of Pony Express Days, and you can message us on Facebook to sign up for that tour.
David Wall posted to our Facebook page: “DC ,You guys are awesome even for posting this kind of info. No other Internet provider provides such quick personal info to things that take place. The rest of our family are all on Centurylink and they would never take the time to post why your lines are down or when they would come back up. Thanks DC!”