How not to handle a network outage
August 23, 1999
by Jason K. Krause
(IDG) -- One of the nation's largest telecommunications companies and its network-gear supplier suffered an embarrassing slipup when a frame-relay network crashed due to a software bug. Sounds a lot like MCI WorldCom's crash earlier this month, but in fact, AT&T went through it a year and a half ago. The way the two companies handled their respective outages made all the difference.
For a backbone provider like MCI WorldCom, bandwidth is a commodity. To undersell the other bandwidth providers in the industry, companies like MCI have taken to farming out engineering and customer service operations. "The debate going on internally in the telecom industry is 'Do we have strong networking groups inside, or do we simply outsource to Cisco, Lucent or Nortel?'" says Ford Cavallari, telecom consultant with Renaissance Worldwide. "MCI has chosen to outsource, which is fine when things work, but when a catastrophe hits, you're in trouble."
By outsourcing, companies can cut costs, but the move undermines their ability to launch a coherent and swift engineering response to an outage. That may explain why it took MCI 10 days to remedy the situation. Its failure to address the problem quickly, and its ultimate placement of blame on partner Lucent, hardly inspires confidence that the company will be able to ensure that such a disaster doesn't happen again.
The problems run deep. "MCI WorldCom is made up of so many large companies in their own right that they all have a bad habit of blaming a different unit," says Tim Chase, director of network operations for Alpha.net, a Milwaukee-based ISP. "When I have a problem and call, for example, the MFS unit, they tell me its UUNet's fault. When I call UUNet, they send me back to MFS. I'm not surprised that they were so disingenuous with this problem."
During the crisis, company executives were conspicuously unavailable. "The longer the problem went on, fewer VPs seemed to be around," says Chase. "Only the occasional technician would accidentally pick up the phone if you were lucky."
A quick comparison with AT&T's handling of its outage last year underlines everything MCI did wrong when its own network crashed earlier this month. What Happened In April of 1998, AT&T and Cisco suffered one of the biggest crashes in Internet history. While AT&T engineers were upgrading some software in a network switch, a computer bug brought down the entire AT&T frame-relay network, cutting service for millions of people. Starting Aug. 5, a glitch in some Lucent software intermittently interrupted service on MCI WorldCom's backbone. The telco anticipated an outage of 24 hours, but the problem wound up affecting some 3,000 business customers over 10 days.
How They Handled It
While there was plenty of blame to go around – AT&T could have simply blamed Cisco for giving it faulty software, and AT&T could have been criticized for not stopping the crash quicker – no fingers were pointed. Instead, Frank Ianna, president of network services for AT&T, gave both the press and customers updates on the crash every couple of hours, detailing AT&T and Cisco's joint effort to fix the problem.
MCI WorldCom issued an alert to its sales force, which was given the option to deliver a notice to customers by e-mail, hand delivery or telephone – or not at all. After a deafening silence from company executives on the 10-day network outage, MCI WorldCom CEO Bernie Ebbers finally took the podium to discuss the situation. How did he explain the failure, and reassure customers that the network would not suffer such a failure in the future? He didn't. Instead, he blamed Lucent.
For AT&T customers, the network was out of commission for anywhere from six to 26 hours. AT&T decided to waive all charges for service until it completed an analysis of the root cause. That didn't happen until April 29, more than two weeks after the outage. The cause was identified as some faulty Cisco software, but rather than let Cisco take the fall, AT&T and Cisco engineers pledged to throw their full effort into safeguarding against such a crash in the future. By handling the situation aggressively – and publicly – AT&T actually enhanced its image as a robust networking company.
Customers affected by the MCI WorldCom failure have been offered two days of free service for each of the 10 days service was interrupted – not particularly generous, say some customers. A few, including the Chicago Board of Trade, have already threatened to take their business elsewhere.
Come together: In search of a Windows NT community
RELATED IDG.net STORIES:
MCI WorldCom migration may be cause of frame-relay woes
|Back to the top||
© 2001 Cable News Network. All Rights Reserved.|
Terms under which this service is provided to you.
Read our privacy guidelines.