On the surface, it would seem cloud computing was made for disaster recovery, a “set it and forget it” concept due to the breadth and robust features of cloud resources.
However, the concept isn’t cut and dry. While redundancy and data protection are the core elements of maintaining uptime and recovering from disasters, it’s important to focus on the individual trees in the forest for the best cloud operational results.
Amitabh Sinha, co-founder and CEO of Workspot; Ofer Maor, co-founder and chief technology officer at Mitiga; and Or Aspir, cloud security research team leader at Mitiga, shared advice on cloud disaster recovery best practices with TechRepublic.
Jump to:
Amitabh Sinha: The number one challenge is the level of availability the cloud provides. Today, the major public clouds — AWS, Google and Azure — offer 99.9% availability, which means more than eight hours a year of downtime, a number that significantly hinders operations for most mission-critical workloads and can cost organizations millions of dollars in lost productivity.
The second major challenge is about cloud capacity. An organization might try to optimize cloud costs by shutting down some of their virtual machines when not in use, but what happens when you need to bring them back up? Even if the cloud is available, there may not be capacity in that cloud region or cloud to accommodate bringing those machines back up again, and that has another chilling effect on productivity.
In a disaster recovery scenario, capacity constraints are an even greater risk if you can’t get the capacity you need to get your business back up and running.
SEE: Disaster recovery and business continuity plan
Ofer Maor: The notion of the cloud and its shared responsibility model is that the responsibility for maintenance and availability of the environment lies on the cloud vendor. The reality is more complex.
The cloud vendor does not commit to 100% availability, only close to it, and while most of the time the environments are up, we have seen multiple outages in various cloud vendors over the last couple of years.
Furthermore, other aspects of availability revolve around the specific applications and utilization of resources, which are already the responsibility of the user and not the cloud vendor.
Finally, as attacks are moving to the cloud, security breaches can often lead to disruption of service through various means, from DOS to abuse of resources and ransomware attacks.
Or Aspir: Moving to the cloud requires organizations to acquire new skills, adapt existing processes and familiarize themselves with the intricacies of cloud infrastructure and services. This learning curve can slow down deployment, configuration and troubleshooting processes, potentially impacting uptime as teams navigate the complexities of cloud technologies.
Despite the availability of multi-zone or multi-region redundancies provided by cloud providers, many companies opt for centralized regions/zones due to compliance and cost considerations. However, this centralized approach makes them susceptible to power outages, network disruptions and physical damage within a specific zone, posing risks to their uptime and service availability.
Amitabh Sinha: Particularly for end-user computing (EUC), a multi-cloud and multi-region approach is critical. Running EUC workloads across cloud regions and across major clouds can drastically reduce the amount of downtime businesses experience.
Information technology leaders should expect capabilities that enable automatic failover, for example, from a primary virtual desktop to a secondary desktop — whether the secondary desktop is in another cloud region or an alternative cloud — in a way that is completely transparent to the end user. This always-available virtual desktop is now a reality. Virtual desktop deployment should be spread across multiple regions and clouds to ensure uptime.
Or Aspir: Effective monitoring and incident response mechanisms are essential for identifying and addressing issues promptly. Use proactive planning to understand your company’s recovery time objective (RTO) and recovery point objective (RPO).
Explore cloud providers’ offerings for ensuring uptime and implementing effective disaster recovery strategies. One good example is the AWS disaster recovery blog posts.
Amitabh Sinha: RTO is the metric everyone considers in a DR context. How long will it take you to get your business back up and running after a disruption? In the legacy, on-premises data center world, RTO was typically measured in days — with potentially catastrophic consequences for the business.
The two dimensions we talked about earlier — cloud availability and cloud capacity. In a DR context, as well as in a day-to-day operational context, the organization must have the agility to recover from a business disruption, whether a cloud outage, a weather event, or a ransomware attack in a few minutes. An RTO of days is no longer acceptable. Instead, the multi-cloud approach anticipates the cloud availability and cloud capacity constraints and solves them proactively.
Ofer Maor: Disaster recovery is a crucial aspect of this. While some uptime issues may be a result of a timed event, such as outage of a CSP region (in which case, no much DR is needed — it will come back on its own), other cases may include the destruction of cloud environments and in more extreme cases of the data itself, requiring disaster recovery measures to take place.
Naturally, backups are a crucial piece of the puzzle that must be done by the cloud (and SaaS) customers as they cannot rely on the cloud vendor to do them (at least in most shared responsibility models). One of the areas where most organizations are still lagging behind is on SaaS backup and recovery, but if an organization is breached and their entire Sharepoint or GDrive is held ransom by an attacker, the vendor may not be able to help.
How cloud disaster recovery compares to on-premise
Amitabh Sinha: With on-prem, it can take days or weeks to be back up and running again; it is a costly endeavor and very time-consuming for teams. In a cloud DR scenario companies can be up and running in minutes if they have chosen the right solutions.
How weather events factor in and related recommendations
Or Aspir: Severe weather conditions like hurricanes, floods, or storms can disrupt data centers within a specific availability zone in the cloud. These disruptions can cause power outages, network disruptions or physical damage, resulting in service interruptions and affecting the availability of cloud resources within that zone. An example of such a case is the outage of multiple Google Cloud services in Europe on April 25, 2023. This outage occurred due to a combination of a flood and fire incident.
Our recommendations are to verify cloud services’ availability zone redundancy for resilience against severe weather conditions.
How do more eyes on the end user decrease the costly downtime of outages?
Amitabh Sinha: Getting real-time visibility into the end user is crucial to mitigate any downtime. End-user observability allows IT teams to understand the problems users are having. By leveraging that data, teams can understand the level of the problem — from troubles with only accessing only a single desktop or app to the performance of those resources.
They can figure out if there is a more significant problem, such as a trend with a specific location, if it is impacting only a subset of end-users or if it has the potential to become a widespread issue. They can determine if it is a network issue or if a pattern is emerging in terms of cloud availability and access that could affect productivity and then they can take action in real time to resolve the problem.
In data center environments, IT teams only have control and visibility inside that data center itself. These legacy systems do not have the levels of end-user visibility that cloud environments do. By running cloud end-user observability tools IT teams can take real-time action to quickly identify and resolve any existing issues.
What else do you recommend IT professionals focus on here?
Amitabh Sinha: Create direct, in-product end-user feedback mechanisms for all end user applications (e.g., surveys at the end of a Teams or Zoom session).
Leverage workload-specific cloud-native observability tools, like DataDog for server workloads, and Workspot and ControlUp for end-user computing workloads.
Define people and processes to act on insights derived from the observability tools so problems are rapidly solved.
Or Aspir: Expanding the focus beyond natural disasters or malfunctions is crucial to address the potential impact of security incidents on disaster recovery. It is important to understand that under the shared-responsibility model, customers are responsible for the security of using their own cloud or SaaS instance, and any breach resulting from a misconfiguration or a compromised user is their responsibility and therefore they will be responsible for dealing with the repercussions of such an event.
This includes scenarios where compromised identities possess permissions not only on production systems but also on backup systems. By recognizing and preparing for such security-related disasters, organizations can enhance their overall disaster recovery strategies and mitigate the risks associated with unauthorized access and compromised identities.
Having a robust incident response plan, which may include collaboration with third-party entities, can significantly aid in addressing disaster recovery in the event of security incidents.
Read next: Your organization needs regional disaster recovery: Here’s how to build it on Kubernetes