The massively disruptive computer outage at the Federal Aviation Administration this week that caused thousands of cancelled or delayed flights has put Americans uncomfortably face-to-face with the technology behind US air travel — for at least the second time in a month.
As the country once again picks up the pieces, beleaguered air travelers may be wondering why flying suddenly seems so vulnerable to devastating IT problems.
The answer involves not just aging hardware and software, but also institutional failures that have made updating the technology more challenging, according to current and former industry officials, government reports and outside analysts.
Over the years — and in the face of exploding demand for air travel — bureaucratic snafus and deferred maintenance have contributed to an increasingly brittle system, even as it grows ever more sophisticated with far more points of failure than many consumers may realize.
Southwest Airlines’ recent days-long collapse of its entire system — in the middle of a winter storm and during the most critical travel period of the year, no less — and Wednesday’s widespread flight disruptions may have put many of these problems front and center for US passengers, but they are just the latest manifestation of a longstanding and enormously complicated issue.
Airline travelers get a month from Hell
The glitch at the center of this week’s headache was a corrupted database file in a pilots’ advisory system that issues warnings, known as NOTAMs, of various hazards that could affect a flight, ranging from notices of closed runways to the presence of nearby construction equipment. The damaged file was also present in the FAA’s backup system, a source familiar with the matter told CNN, which first reported the detail on Wednesday.
Officials moved to reboot the main NOTAM system early Wednesday morning, but it failed to be completely restored by the time rush hour began on the East Coast, leading to the FAA ground stop. A senior US official told CNN Wednesday there was no evidence of foul play in the incident, a detail the FAA later publicly confirmed.
“The FAA is continuing a thorough review to determine the root cause of the Notice to Air Missions (NOTAM) system outage,” the agency said in a statement Wednesday evening. “Our preliminary work has traced the outage to a damaged database file. At this time, there is no evidence of a cyberattack. The FAA is working diligently to further pinpoint the causes of this issue and take all needed steps to prevent this kind of disruption from happening again.”
The FAA said Thursday evening that the data file “was damaged by personnel who failed to follow procedures.”
The NOTAM issue occurred just days after the FAA had said an “air traffic computer issue” was responsible for hours-long flight delays to Florida airports on Jan. 2. That system, known as ERAM, is responsible for tracking hundreds of flights at a time and is considered a critical component of the FAA’s efforts to modernize the US airspace.
In the case of Southwest, outdated scheduling systems that could not automatically adjust to disruptions caused by severe winter weather required painstaking manual intervention, which made the weather-related problems at that airline particularly pronounced.
‘Old, old systems’
Despite moving to modernize their equipment, in some cases airlines and the US government may still be reliant on technology that could be years or even decades old.
The FAA software that failed this week is 30 years old and at least six years away from being updated, a US government official told CNN on Thursday, though Transportation Secretary Pete Buttigieg has pushed to accelerate that timeline since the meltdown, the official said.
The notices issued by FAA’s NOTAM system are “Jurassic,” said Kathleen Bangs, a former airline pilot and aviation expert. “It’s a clumsy system that often over-burdens pilots with pages and pages of less-than-urgent notices, written in archaic code that sometimes buries that one, critical piece of safety information a pilot really needs.”
The FAA has acknowledged the NOTAM system’s age. In its most recent budget request to Congress, the agency called for money to help “eliminate the failing vintage hardware” behind it.
As early as 2012, the FAA decided it wanted to replace aging legacy voice switches used in air traffic control communications with new, internet-based communications technology. But thanks to a contracting dispute, the FAA now intends to keep using the old switches until at least 2030, according to a Transportation Department Inspector General report last year.
The ERAM air traffic system at the center of the disruptions on Jan. 2 is much younger, and only became fully operational in 2015. But according to a 2020 Inspector General report, the system was supposed to have been fully implemented five years prior, as a replacement to another system that had already been running for more than 40 years. The FAA is currently working to update ERAM’s hardware and software, following at least seven ERAM failures since 2014, a track record that has prompted congressional scrutiny. But it may not be until 2026 that the ERAM upgrade is complete, according to the 2020 report.
Meanwhile, many of the IT systems that airlines depend upon were custom-built long ago, with some running on legacy mainframe computers, and weren’t designed to handle enormous surges of incoming information, aviation experts said.
“This is not your standard Windows server or modern VMware architecture,” said Seth Miller, an IT consultant, aviation journalist and editor of the travel publication PaxExAero. “These are old, old systems.”
As a result, acute crises can easily overwhelm these fragile setups, according to an aviation industry official, speaking on condition of anonymity to discuss the issue more freely.
“These systems were built at a time when the airlines may have been smaller, and they weren’t necessarily built to handle so much data coming in at once,” the official said. “When you have something like the massive winter storm over the holidays, it cannot handle the volume of changes coming in at one time, because it’s on a system that wasn’t built to handle that large of a moving dataset.”
It’s not always that the technology’s age is inherently a problem, industry experts said. It’s what the age implies: An inability to scale to meet new demand, and a lack of proper support as the rest of the world moves on. The use of custom-built technology, as opposed to off-the-shelf solutions, exacerbates the problem, Miller said, as maintaining it requires increasingly specialized parts and know-how.
Trying to integrate old systems with newer ones — always in real time, because the global aviation industry never sleeps — can also create its own opportunities for catastrophic mistakes.
Many ways to fail
While all flight delays and cancellations tend to result in a similar experience for the air traveler, the underlying source of an outage can vary wildly. Many more things can go wrong than you might expect — highlighting the sheer complexity of the aviation industry, and underscoring how there isn’t a quick easy fix for IT-related travel disruptions.
Getting a flight off the ground involves a complex stew of information, industry experts say, and disruptions in any part of that information supply chain can cause delays.
The vulnerabilities are magnified due to the tremendous number of companies involved in the ecosystem — not just the airlines, but their vendors, and their vendors’ vendors.
“There’s so many various systems speaking to each other,” said Ross Feinstein, a former spokesperson for American Airlines and the Transportation Security Administration.
For example, Feinstein said, the TSA vets airline manifests. “If TSA has an outage, it halts the vetting process for reservations, which means passengers can’t check in, and they can’t retrieve a boarding pass. It could be the weather company has a disruption, and pilots can’t retrieve the latest weather data for their departure, en route, or arrival.”
In 2019, computer issues at a third-party company whose flight-planning tools help airlines calculate weight and balance for their aircraft led to delays for multiple airlines nationwide.
In 2021, an outage at Sabre, one of the world’s largest airline reservation companies, caused disruptions globally.
The interconnected nature of the aviation sector, involving dozens of countries, companies, agencies and databases creates multiple points of failure. Backups and redundancies can help, but it is still a massively complex system of systems.
Beneath the surface-level symptoms of the aviation sector’s IT problems are deeper, messier and more human challenges.
Take the FAA’s attempt to replace its air traffic voice switches. According to the Inspector General report, a major source of the breakdown came when the FAA and its potential vendor got into a dispute over the contract requirements. The dispute focused on possible software defects in the new switches, and whether the vendor could still deliver a good product on time.
The root of the issue was not, in itself, a technological problem. It was a procurement problem. But it has had lasting effects on FAA technology. The contract’s eventual termination means the FAA will need to spend more than $270 million through 2030 to keep using its aging legacy voice switches, the report said.
“Continued reliance on these switches creates the risk that communication will be disrupted,” the report concluded.
A similar dynamic has played out in the debate over 5G wireless technology near airports, which last year threatened to cause major disruptions. Bureaucratic divisions and years of deferred avionics upgrades led to a crisis where US aircraft were not equipped with technology that could handle potential 5G interference.
Meanwhile, the FAA continues to be led by an acting administrator, and lacks a Senate-confirmed chief. That has real-world consequences for IT upgrades and other projects, according to a person familiar with the agency, speaking on condition of anonymity to discuss the matter more freely.
“It’s really hard to set direction and vision when you don’t know if you’re going to be there for a week or you’re going to be there for 18 months,” the person said.
Much of the aviation industry’s unpaid technical debt, meanwhile, can be traced to a spate of mergers and bankruptcies in the wake of 9/11, when many airlines were more focused on finances than technological upgrades, said the industry official.
That bureaucratic myopia is its own cause of today’s technological malaise in the aviation industry. In some situations, institutional inertia and commercial priorities have outranked investments in costly and boring infrastructure.
But the increasingly interconnected and digitized nature of the system now means that when things go wrong, they can do so in ever more disastrous ways.
Aviation experts say only more investment, and better planning, can meet the challenge.
“[The FAA] is doing more with less resources, and they need more funding to modernize,” Feinstein said. “In Washington, we’ll talk about it for the next 24 to 48 hours, forget about it, and it’ll be a fight again when the FAA reauthorization bill comes up.”
-- CNN’s Pete Muntean, Gregory Wallace and Marnie Hunter contributed to this report