A recent data centre outage in Sydney affected multiple cloud service providers and businesses, including Bank of Queensland and Jetstar. After this and other incidents, cloud customers might well ask if the same thing could happen to them — and how to mitigate it before it happens.
Can the risks of physical data centres be managed?
TechRepublic spoke with Nam Je Cho, director of solutions architecture for AWS Australia and New Zealand, and Guy Danskine, managing director at data centre provider Equinix Australia, who are both in box seats to witness the region’s embrace of the cloud.
Cho and Danskine have recommended a range of strategies, including embracing geographic diversity, ensuring there are built-in redundancies, seeking out data centre management best practices and considering the risk benefits of hybrid multicloud infrastructure.
Jump to:
Australian organisations were reminded that cloud computing risks do exist in August 2023. A lightning strike on electrical infrastructure 18 miles from a Sydney data centre caused a utility voltage sag, tripping a subset of the facility’s cooling system chiller units offline.
As affected cloud service provider Microsoft Azure reported in a post-incident report, while technicians were working to fix the problem, temperatures in the data centre increased to levels above operational thresholds. A subset of compute and storage scale units then had to be powered down to lower temperatures and prevent hardware damage.
The incident impacted cloud customers, beginning around 10:30 UTC and lasting until 22:40 UTC. For some of this time, Bank of Queensland customers experienced trouble with the bank’s app, and banking transactions were not being reflected correctly in customer accounts. Jetstar customers, meanwhile, had trouble logging in, managing bookings and checking in for flights.
SEE: Australian and New Zealand enterprises are also facing pressure to optimise cloud strategies.
Azure was not the only service impacted. As it was a shared data centre, Oracle Cloud and NetSuite services were also impacted by outages.
There are other data centre outages on the minds of local cloud customers. Only a month or so after the launch of the brand new region in Melbourne in 2021, Google Cloud Services in australia-southeast2 went down for 1 hour and 30 minutes due to transient voltage issues.
In a statement on the incident at the time, Google said “the root cause of the issue was transient voltage at the feeder to the network equipment, causing the equipment to reboot. In order to mitigate the issue, traffic within the australia-southeast2 region was redirected temporarily.”
Forrester’s recent State of Cloud in Australia and New Zealand report suggested incidents like this outage, as well as environmental uncertainties, were encouraging organisations to consider revisiting their risk mitigation strategies.
“Between the global pandemic, cloud outages in 2021 such as the Google outage in Melbourne, fires and floods in Australia and earthquakes in New Zealand, enterprises are prioritising resilience,” Forrester reported.
Forrester said risk mitigation may include “building greater risk awareness, leveraging multiple AZs (availability zones) for high-priority workloads, mitigating supplier risk through multi-cloud skill sets or scenario-building against potential risks.”
AWS services hundreds of thousands of businesses across Australia and New Zealand, including Atlassian, NAB, and public sector agencies like the Australian Bureau of Statistics and Western Australia’s Department of Education. Equinix, likewise, is trusted by customers in critical industries, including healthcare, financial services and government.
This calibre of customer needs to have cloud service around the clock without disruption.
Equinix Australia’s Danskine said organisations understand that data centres and the cloud are playing a foundational role in supporting their businesses. Danskine added that the scalability, reliability and cost-efficiency of cloud technologies and infrastructure are what enable organisations to operate effectively in an increasingly digital economy.
“Robust digital infrastructure is fundamental,” Danskine said. “It enables organisations to connect users, customers and employees, enhances data security and allows them to adapt to evolving market demands.”
This demand is propelling Equinix’s growth. It has 51 data centres in the APAC region, including 22 in Australia, located in Sydney, Melbourne, Brisbane, Canberra, Perth and Adelaide.
It is also investing over AU $1 billion (US $645 million) in 13 projects that will see new data centres built in Australia, India, Japan and Korea as well as expanded facilities in Indonesia and Malaysia.
“We’re always looking for the right opportunity to expand, in line with customer and market demand, to ensure we can best support current and future requirements,” Danskine said.
Meanwhile, AWS is investing AU $13.2 billion (US $8.44 billion) into infrastructure from 2023 to 2027 across Australia, and is building a new region in Auckland with three availability zones.
Investments like those of AWS and Equinix are underpinning what Forrester has called “a new scale of public cloud usage” in Australasia. Organisations currently migrating to the cloud expect an average of 46% of workloads to be in the cloud within the next two years.
As digital transformation continues to be a high priority, Danskine said that businesses are trusting data centres and the cloud to provide the infrastructure needed to fuel innovation, support high levels of availability and “drive growth in a data-driven world”.
Despite strong levels of trust, Danskine said the market was not risk free.
“After the pandemic, many organisations are operating with fewer staff onsite, so the possibility of a system failure, even with automated remote monitoring and preventive maintenance, has increased,” Danskine said.
One way to combat this risk is for organisations to ensure they have power redundancy to lower the impact of a system failure.
SEE: This risk management policy will help support your organisation’s resilience.
“At Equinix, we provide fully redundant electrical and mechanical infrastructure as standard to our global data centre customers,” Danskine said.
Risk mitigation is a central design feature for cloud and data centre providers. For example, AWS, like other cloud providers, offers multiple availability zones within all of its regions. This means an application can be partitioned across different geographies.
“AZ’s are physically separated by a meaningful distance, many kilometres, although all are within 100 kilometres (60 miles) of each other,” AWS’s Cho said. “Each AZ has independent power, cooling and physical security and is connected via redundant, ultra-low-latency networks.
“If an application is partitioned across AZs, companies are better isolated and protected from issues such as power outages, natural weather events and more.”
Multi-AZ managed services like the Amazon Relational Database Service and Amazon Elastic Kubernetes Service allow its customers to select which AZs they deploy across.
“If there is an infrastructure event in a single AZ, there is managed and automatic failover to a second AZ and failback as appropriate, with little to no service disruptions,” Cho said. “Our customers are running mission-critical workloads by deploying workloads with multi-AZs and/or multi-regions architectures to achieve high availability.”
Equinix is continuing to explore ways of increasing the operational integrity and safety of its data centres (Figure A). One example is that critical maintenance always takes place with a minimum of two qualified engineers present to double-check each other’s work.
Figure A
When customers choose to use its software-defined interconnection platform, Equinix Fabric, to connect to their cloud, SaaS and network providers, Danskine said the company always recommends configuring two physical ports.
“Companies can rely on these for additional resiliency when connecting to thousands of global end points or their own IT infrastructure on Platform Equinix,” Danskine said. “Companies can create interconnected business continuity and disaster recovery scenarios that meet their needs.”
Cloud and data centre uptimes are close to 100%. Equinix has a worldwide uptime of >99.9999% across 250 data centres, while AWS enables 99.999% availability. But there are ways customers can mitigate the risk of a data centre outage outside of depending on their providers’ uptime.
Geographical diversity is a foundational design feature of modern cloud services and should be considered important for all critical infrastructure. Like the multiple availability zones on offer within AWS regions, this spread of geographic risk could be via multiple data centres, mapping an application to multiple cloud regions or deploying the workload via containers.
A redundant network can support full operations during a service disruption and enable in-flight maintenance. Equinix said businesses should ensure redundancies in individual network components complement each other and the overall design, so that if an outage does occur, it would cause minimal impact while recovery efforts are underway.
Equinix argues regular testing is critical. It tests critical systems every two weeks under maximum load and performs an annual “dark site test,” where it intentionally disconnects sites from main power to ensure backup systems come up and perform as expected. Forrester also recommends revisiting the risk and continuity elements of cloud strategies.
Increasingly, organisations are pursuing cloud-agnostic digital infrastructure to achieve advantages like innovation, cost-efficiency and resilience. Pairing multiple clouds with cloud-adjacent on-premises environments can provide businesses with important security and business continuity benefits, building in more resilience for organisations.
SEE: Discover everything you need to know about multicloud and hybrid cloud.
AWS offers a number of managed services that allow organisations to operate within and across the region without needing to architect for multi-AZ characteristics themselves. With AWS taking care of this by default, if there’s an issue within a particular AZ, it will be handled on the customer’s behalf as part of a shared responsibility model.
Recent data centre outages will not slow cloud strategies. Danskine argues hybrid multicloud is becoming the architecture of choice for many because it is a versatile infrastructure strategy. And according to the fifth annual Nutanix Enterprise Cloud Index, respondents in Australia expect to increase their use of this model more than fivefold to 43% penetration by 2026.
“This approach provides the flexibility to choose between public and private clouds, optimising performance and cost-efficiency,” Danskine said. “It also enhances resilience through redundancy and disaster recovery capabilities and enables compliance with regulatory requirements in the host country, ensuring data security and sovereignty.”
Nam Je Cho from AWS said there is no doubt the region is “in the middle of a tectonic shift to the cloud. The number one reason that our customers are moving and innovating on the cloud is the agility and speed with which they can change their customer experience.”