Fire at Cyxtera Causes Oracle NetSuite Cloud Outage
Around noon on February 14, 2023 a fire at a Cyxtera data center in Boston caused an outage for NetSuite that lasted a full day and included a 30 minute loss of data for some customers. When firefighters arrived on the scene they shut power off which likely complicated the restoration efforts and mitigated Oracle’s ability to prevent the data loss.
Cloud Outages and Reliability
The February NetSuite outage was not the first for the company and will not be the last. Microsoft Azure, Amazon Web Services, IBM, Google and Meta have all suffered outages in 2022 or 2023. The extent, duration and damages caused by outages vary based upon the incident but are inevitable.
- The bigger the provider, the more companies are impacted.
- Outages can last minutes, hours or days.
- Downtime is caused by human error, acts of god, cyber attack, failed maintenance activities, upgrades and other unintentional activities.
Looking at the number and frequency of outages, you might think that cloud services are unreliable. On the contrary most cloud service providers report 99.99+% uptime. Although these numbers can be misleading depending upon how they are calculated, cloud services are typically more secure and reliable than on-premise servers.
In the case of the recent NetSuite outage, power was cut by firefighters. Cyxtera and other top data centers have multiple redundant power supplies, onsite generators and fuel to last days if not weeks in case of utility outages. In this case all of those backup systems were overridden by the need to put out a fire. Data centers have fire detection and suppression systems that typically can isolate electrical fires before flames break out and chemical systems to prevent the fire from starting on the data center floor, often down to a single cabinet or cage level. In this instance it has been reported that the smoke was reported from electrical equipment in a power room and not on the data center floor.
Cloud service providers and their data centers mitigate for more circumstances than most companies can even imagine, Cloud computing is secure and reliable.
Preparing for a Cloud Outage
The question is not if, but when. Therefore, preparing for the outage is a business imperative. Disaster recovery planning is required for companies on-premise, private cloud, public cloud or hybrid cloud systems. Just as cloud service providers prepare for as many scenarios or incidents as they can think of, companies must also plan comprehensively and update their disaster recovery (DR) plans on a regular periodic basis.
Incident response planning are not for systems, but for business continuity. Although we are increasingly reliant on our technology, companies must identify how to operate for periods without systems. Can your store sell physical goods, production lines manufacture products or service providers assist customers without one or more systems? What happens if your plant with its computers burns down? Can you still fulfill orders from your warehouse? If not, how long would it take to replace systems and do you have good backups.
NetSuite and all other serious cloud service providers have plans to restore from backups or migrate systems and/or data to alternate data centers. In this unfortunate circumstance the migrations could not happen without power to move the data. A 30 minute recovery point is actually quite impressive for a major operator like NetSuite. Unless you work for a Fortune 100 company, I doubt your IT department has this capability, and many smaller companies don’t even have a Recovery Time or Recovery Point Objective in their DR plan (if the company even has a DR plan).
Disaster Recover Plan
Whether you are on-premise or in the cloud,
- Identify and assess disaster risks
- Create a team and identify who does what and how based upon incident type.
- Determine critical processes and define manual procedures,
- Identify sources of alternative computing power, and
- Secure backups.
Review and test your DR Plan at least annually or more frequently. When was the last time you tested a restore of you backups? Do you have a development environment which could become production in the case of computer failure?
What to Do During an Outage
Three primary actions during an outage are:
- Activate your DR Team as quickly as possible
- Triage the situation before making it worse!
- Remain calm and communicate
Activate DR Team
A DR team includes leaders and department managers that understand your business processes and how they impact customers. Understanding where data comes from and how it is processed is critical for fixing broken components. The DR team also includes your information technology professionals that understand the systems components and their failover and recovery processes, whether internal or cloud.
Triage is the process of assessment. The last thing you want to do is make the situation worse by implementing a quick fix which makes things worse. I have seen numerous occasions where functional managers, business owners and IT professionals restore system components creating incompatibilities. Implement changes one at a time and document any and all system changes during an outage.
Remain Calm and Communicate
Regardless of whether you are the IT department or the customer service representative who has to answer to customers, remain calm and communicate. The worst form of communication is no communication. The likelihood is you will not know the answer to everything the customer will ask and they will become frustrated.
If you are the customer, please do not take out your frustration on the support or customers service representative. The front line individuals are working under intense pressure, high stress and were not responsible for what is happening to you. Have compassion, grace and courtesy and they will remember you for your kindness.
Action Plan to Prepare for a Cloud Outage
First and foremost never panic! If you have been affected by the NetSuite outage, that doesn’t mean it is time to leave the cloud or switch systems. Any provider saying they have never had issues is like any established business saying they never lost a customer. Outages will happen. A cloud outage does not negate the benefits of cloud, nor does it make cloud less reliable than on-premise systems.
The time to plan for disaster is before it happens! If you need help creating a DR plan, one of our fractional CxO executives can help you develop one or review yours. Schedule a free initial CxO Explore to discuss your situation or concerns.