During Super Storm Sandy, the FCC reported that at one point 25% of all cell sites in the affected area were out of service. After the storm, Deputy Chief Chuck Dowd of the NYPD Communications section told me that neither the NYPD nor the FDNY had any outages during the storm save for one site where the generator ran out of fuel, and that was corrected within a few hours. Meanwhile, wired phone and cable services were disrupted and the Internet was unavailable in many areas. These failures not only affected cell phones and wireless communications, they also involved land-based communications.
We really need to look at these numbers first because the FCC has very strict requirements for commercial wireless network operators for reporting outages so the 25% failure rate is considered fairly accurate. On the other hand, there are no such requirements for Public Safety or even utility companies’ networks so all we really have there is anecdotal information. So NYC says no outages but in Northern New Jersey and elsewhere, I have heard reports of Public Safety outages. However, as you will see below, there is a big difference between a commercial cell site and a Public Safety site being out of service.
In the wake of this super storm, the FCC is “looking into” what can be done to beef up the wireless systems during this type of disaster, and there are a lot of after-action reports floating around as well as Monday morning quarterbacks who seem to think that the fixes are easy to identify and to prevent from happening in the future. I disagree with such comments. No matter how well prepared we are or how many contingencies we plan for, we will face either a natural or man-made disaster for which there is no preparation. In the case of Sandy, both the commercial and Public Safety network operators had learned lessons from Katrina, Isaac, and other storms. But if you remember, last November, the East Coast was hit with an earthquake. Who was prepared for an earthquake on the East Coast?
Sorry FCC, but failures of all types of communications systems will continue to occur. We can only plan and prepare for what we know and believe can happen. Public Safety agencies such as NYPD can and do plan for what I call graceful degradation. Its command center, which was fully staffed during Sandy, has full generator back-up separate and apart from the rest of Police headquarters, it has fully redundant telephone lines coming into the building from two different central offices, fiber that is also fully redundant and includes commercial fiber as well as the City’s fiber assets, and there are satellite feeds into the building. The systems held up during Sandy and worked as they were supposed to, providing command and control for many city departments, FEMA, and others. The cost to make this center this redundant is doable for this type of emergency operations center, but almost nowhere else.
Multiple Points of Failure (Weak Links)
Let’s look at the points of failure within a wireless network system and what can be done to improve them. First, it is important to understand that unlike today’s Public Safety radio sites, a cell site that is not connected to the network via some type of connection—wire, fiber, or microwave—is simply a dumb site. The back-end of the network is the brains of the cellular architecture and cell sites do not function at all if they lose their connectivity. During Sandy, trees and debris knocked down hundreds if not thousands of telephone poles. On these poles are generally three or four different services. First of course are the power lines, next are the telephone companies’ wired and perhaps fiber connections, and then the cable operators’ wires are on the lowest part of the pole, so a pole that is down takes down all three communication services along with electricity. If the connectivity to a cell site is being carried on these poles and they are knocked out, the cell site is out of service regardless of whether it has back-up batteries and a generator and the site is still technically in operation. Without connectivity to and from the cell site, it is just plain down.
Let’s compare this to Public Safety sites, many of which are also connected to their network by wires, fiber, or microwave. Since these are, for the most part, voice systems that operate in relay mode (when they hear a signal on a channel they rebroadcast it on another channel), even if all of the connectivity is lost, the site will continue to be functional. The operations center may have to resort to radio control instead of wired or microwave, but it still works. Further, if the site is part of one of the more advanced Public Safety networks and it becomes disconnected from the network, there is a built-in fallback mode that will turn it into a standalone but functioning site. If the site fails for other reasons, e.g., a power outage, generator failure, or damage to the antennas, and is knocked out of service, Public Safety still has one level of fallback that is not available today over commercial wireless systems. Public Safety voice devices are designed to operate in what is called talk-around, simplex, or tactical mode. This means that units can talk to each other even where no radio site is up and operating. Simplex is the final fallback for Public Safety and it works. It worked during Sandy while cell phones people carried in areas where there were no operational cell sites were completely useless. Without a cell site, cell phones are reduced to small packages of electronic components and batteries that cannot talk to anything.
If the connectivity to the cell site is underground, perhaps it survived, although in New York and New Jersey many underground utilities failed. Vaults can become flooded, and while the cables might be waterproof, there are splices and junction boxes along the route. In many cases, having utilities underground does not make them immune from damage. Sandy was a perfect example of this.
If connectivity to the site remained in place, other points of failure might include the power lines to the site. Many sites have back-up batteries and generators, some only have batteries, and some don’t have any back-up power at all. A few years ago, the FCC tried to mandate back-up power and the cell industry said it could do its own planning—and it has. Some cell sites are more important than others, some can have generators easily installed, some cannot. Cell sites can be on buildings but it is not practical to put a generator and fuel on a rooftop. Batteries yes, but generators and fuel, generally no. So the site may only have battery backup. If there is a generator, it might be located on the ground next to the building, in the basement, or even on the first floor. If there is flooding, well, we all know, or should know that generators don’t run underwater.
But if the generator does run, the site’s connectivity is not impacted, the building or shelter that houses the cell site equipment is not damaged, and the tower or structure and antennas are not damaged, the site will remain in operation until the power is restored or the generator runs out of fuel. If the generator has fuel for five days of operation but no one can reach the site for a week or more, the generator will cease to run and the site will go down.
So the points of failure for a single cell site, excluding the network back-end systems, include connectivity failures, power failures, batteries running out of power, generators running out of fuel, tower or structure damage, damage to antennas, and damage to equipment in the shelter. That is seven possible points of failure per cell site. While the same number of failure points are also applicable to Public Safety systems, as mentioned, there are also at least three levels of communications fallback for the Public Safety voice systems as opposed to zero for commercial cell sites.
Now in the case of Sandy, both the commercial network operators and the Public Safety agencies had time to prepare for the storm. They had warning unlike for an earthquake or a wild land fire, or a major train wreck, airplane crash, or manmade disaster. Since they did have warning, network operators sent in technicians, engineers, parts, antennas, and cellular sites on wheels. They staged additional generators, fuel for generators, and all sorts of pieces and parts for them. They were ready and prepared as much as could be expected. Why then did it take so long to return the sites to normal operation? This brings us to the next part of the equation. If there is no connectivity to the cell site, the network operator has to wait for the phone or fiber company to bring the circuits back up. If there is microwave connectivity and the microwave antenna has been moved out of alignment by high winds or shaking, someone has to go to the site, climb the tower, and re-aim the antenna. If power is out, network operators have to try to keep their cell sites working on batteries and generators until power is restored. But if they cannot even access the site because local officials won’t permit entry because the area is still too dangerous, it does not matter how many people, how much equipment, how many generators, or how much fuel is staged and ready.
If, unlike a hurricane, there is a disaster with no warning, not only do network operators have to move people and equipment into the affected areas, they still must wait until they are granted access to the area. During the aftermath of Sandy, personnel on the ground included power company, telephone company, and wireless network employees, and all of them had to wait until the local officials deemed an area safe before they were permitted entry. The bottom line here is that cell sites that are down cannot be magically turned on again. People must gain access to them, determine the source of the problem, and then work with others in order to fix them. Meanwhile, Public Safety and amateur radio systems stay up and running, at least in the unit-to-unit mode so there is some communications available for first responders.
Now comes another wrinkle to all of this. If you drive around the Mid-West and the East Coast you will see many cell towers (but few on the West Coast where they prefer “towers” to be pine or palm trees). If you look at these towers you will usually see more than one set of antennas. In most cases there are four or five sets starting at the top of the tower and working their way down. These are shared sites. Many shared sites have all four or five of the commercial networks on them. In Delaware County, PA, for example, just west of Philadelphia, almost every tower supports all five network operators. Losing a single tower or a group of towers takes out all of the networks.

Now suppose that FirstNet decides to have its network added to each of these cell sites. It makes perfect sense. It will be less expensive, therefore the funding will go further, the sites are already in place, and they already have 2G, 3G, and in most cases, commercial 4G networks running on them. The equipment is housed in shelters at the base of the tower and there are usually generators at each site. If the Public Safety network is added to these sites, and if they fail during the next Sandy, not only would Public Safety lose its broadband network capabilities in that area, it would lose ALL of the potential back-up networks as well.
Being Realistic
I think that the FCC, Congress, FirstNet, and Public Safety need to be realistic about several things when it comes to cellular networks. (Yes, the FirstNet Nationwide Public Safety Broadband Network will be a cellular network.)
First the FCC. Regardless of what the FCC mandates, there will continue to be network failures during natural and manmade disasters. It is not possible to prevent all of the failures; there are simply too many points of failure for wireless networks that are dependent on connectivity along with power from third parties. There are too many issues when it comes to back-up power, antennas and antenna structures, buildings, and equipment. There will always be some failures. We can stage people, equipment, and spares, but if people cannot safely access the sites, determine the problem, and then fix it, the site will stay offline until it is safe, until power has been restored or a fuel truck can reach the site, or until the connectivity is back in place.
Next to FirstNet and Public Safety. No matter what we do, the FirstNet network will resemble a commercial cellular-type network because it IS a cellular-type network. There will be too many sites, each with multiple points of failure. Perhaps in major cities, city personnel who are paid to take more risks than private contractors can gain access to sites, but if they cannot restore the connectivity and/or power, if they cannot take the fuel truck to the site, the site will stay down until they can.
The bottom line is that no cellular-type network that requires connectivity in order to function will be immune from failures when it is needed most. If you consider that Pubic Safety has $7 billion to build out this network, whereas AT&T is spending $14 billion on only its phase II build-out, there is no way the network will be as robust as our existing LMR networks. Maybe in the future we will have LTE devices that will work off-network, but today we do not have them nor do we have a standard for them.
Perhaps there is a better way to build out the first generation of the Public Safety Broadband Network. Maybe the already-hardened Public Safety sites should be used first and then augmented with sites used by commercial network operators. Even then this network will be prone to more failures than today’s Public Safety voice networks, but I do believe that there is a way to build the Public Safety network so that it is less prone to the seven points of failure than commercial networks are today.
But think about this, Public Safety, what if you put your voice, data, and video services on this one network. What if the FCC takes back the existing Public Safety LMR voice channels and ALL you have is the LTE nationwide network. When the next Sandy or earthquake hits, and 25% or more of the sites go down, how will you be able to continue to protect and serve? The Public Safety Broadband Network will change the way Public Safety operates. It will enable data and video services giving eyes to those in the field that today have only voice. Can we make it as robust? Can we add enough graceful degradation to make it stand up during the next Sandy, as the Public Safety voice networks did? I, for one, am skeptical, especially given the funding we have available.
Andrew M. Seybold
 
 
		 
		
I have not heard of structural failure in towers from Sandy.
Any examples out there?
Most of the common points of failure (other than towers themselves) can be engineered for weather. It is interesting that the easier-to-fix problems that you highlighted (generators, access cabling, antennas) were the source of problems.
When we instrument towers, we use the current requirements of TIA/EIA-222-F for a fastest mile wind speed of 80 mph with no ice, and 28 mph with 3/4″ inch ice thickness (in accordance with ASCE7 ice conditions).
When we’ve installed systems in India, I remember that merely seeing a generator on site was not enough.
We had to ask:
Do you own the generator?
Are you licensed to operate the generator?
Is it connected to the building load?
Is there fuel in the tank?
Have you tested the generator?
Is there an automatic transfer switch
Have you tested the switch.
And then the next level of due diligence:
Have the spark plugs been stolen?
Have the filters been stolen?
Why does that look like water in the fuel tank….
Andy: In a word, NO. You can’t make the cellular-type broadband technology as reliable as the LMR networks. A basic safety and hazard analysis, with suitable calculation of risk of combined / cascaded components, should illuminate this.
You can warn Public Safety, and the FCC, and NTIA, and all the rest, but they’ve already hitched political wagons to this particular star, and they will not easily tolerate someone telling them that the star needs adjustment or replacement.
Kropper–I do not have any information regarding tower failures during Sandy, BUT that does not mean that they had power or connectivity. I do ont believe that it is economically feasible or possible to provide fully redundant connectivity to most of the cell sites installed. For example, in our city the main City site which also houses a large Cell site and ALL of the wired and fiber access runs on a single pole to within 300 feet of the site where it then runs undergronud, if hte poles take taken out the system is just plan down (the cellular system that is, the Public Safety systems are all repeaters). One point on the generators, the normally exercise every week or so and if the fuel levels are not checked they could easily run out of fuel only a few hours into the failure.
Arelight–I agree 100%
Andy
Your point about graceful degradation being a desirable trait is important. It’s tough to implement. Everyone in the organization has to accept criticism of their pet system architecture in order to work toward an architecture that trends toward optimal.
You mention the multiple redundancies of NYPD’s Command Center. These are good features. In my opinion, agencies should be continually tuning to find a sweet spot where the tension between system complexity, and resilience to outages, is ideal. You’re never finished looking at this issue. If you’re paying attention, systems will teach you what they need. For example, you might discover the Command Center’s HVAC system is shared with the rest of the facility. A fire in the other parts of the building fills the Command Center with smoke or turns off air handlers that cool critical servers. Suddenly, you need to modify the HVAC system in order to isolate the airflows in each portion of the building.
To expand on your skeptic side, I think there’s an evolution toward acceptance of very complex Public Safety systems that are unstable and have major hidden costs. The government archetype of using technology long beyond its design life is impossible because everything requires short-lived software and chipsets are only available for a few years. The culture has shifted from Bell System toll grade audio and provisioning with capacity headroom to systems that have intermittent and difficult-to -troubleshoot problems with bit error rate, multipath, and interference. Some systems work well but many look too complex to manage. When your vendor declares a system component no longer supported, your agency has to buy new equipment. You become concerned equipment support is pulled to prop up the vendor’s quarterly earnings on the back of your budget. For self-maintained users, there are difficult training issues. Good configuration management is very difficult, especially where systems are so complex that no one person sees the whole picture. There are continual firmware and software revisions and my entire radio fleet has to be flashed with compatible versions. Agencies with vendor-maintained systems may not have in-house expertise to understand and resolve nagging system problems. I don’t think digital radio systems will go the way of fire engines with pump panels on the passenger side, but it seems like every current digital modulation scheme is an expensive, passing fad.
Thanks to all of you who make these systems work.