Leverage Redundancy to Improve System Uptime

In the business world just a couple of decades ago, a certain amount of occasional downtime was almost expected in business systems.  It wasn’t uncommon for email systems, web servers, and file/applications servers to need occasional reboots, fall victim to memory leak errors, succumb to internet outages, or crash all together.  Avoiding unplanned downtime was possible, but doing so tended to be very costly for a truly redundant solution.  This limited the highly coveted 4 and 5 nines (9.999% and 9.9999%) of uptime to the large enterprise environments which could afford this level of redundancy.

Since then, even the smallest businesses have become less and less accepting of unplanned downtime in our production environment.  After all, our internet provider connections have become much more affordable to allow redundant connections, power protection is the norm, and operating systems have become much more reliable (though at times it doesn’t always feel that way).

SaaS (software as a service) and PasS (platform as a service) solutions like Microsoft 365, Azure SQL services, Azure Virtual Desktop, Azure Front Door, and others, often build in redundancies or make them easily deployable.  But what if you are still running virtual machines in the Azure cloud or in your on-premises environment?

It also goes without saying that taking the appropriate posture on cybersecurity and employing a good data backup solution is critical, but for the purpose of this discussion, I will stick to redundancy options.

Protecting Virtual On-Prem

Whether you’ve invested in Hyper-V or VMware for your virtualization hypervisor platform, there are a few things to consider.

  • Redundant networks – Multiple physical host adapters for management and VM traffic, preferably all connecting to different network switches.
  • Redundant power – Multiple power supplies in each virtualization host, each connected to a different UPS (uninterruptible power supply). Having a backup generator on top of this is a plus for any power outage beyond a few minutes.
  • Scale-out file server / SAN – Storage used by the hypervisors should be well thought out, allowing for resiliency not just within disk sets, but between disk arrays. Don’t just plan on individual disks failing. Disk controllers and even entire arrays can have critical issues. 
  • VM (virtual machine) clustering – Just like everything else, virtualization hosts are not immune to having issues.  Failover clustering allows a VM to automatically (or sometimes manually) start up on a different host when its primary host is either down unexpectedly or needs maintenance.  Clustering in the VMware world is pretty simple leveraging vSphere.  In Hyper-V, clustering is a little more complicated, but Microsoft has a lot of great resources to help you along the way.  There are also some really great 3rd party tools to manage failover and make failback a cinch. 

Storage Redundancy and SLAs

Storage in Azure has multiple resiliency options depending on your uptime requirements. Here are a few of Microsoft’s SLA (service level agreement) guaranteed uptimes for Azure Storage:

  • At least 99.99% (99.9% for Cool and Archive* Access Tiers) of the time, we will successfully process requests to read data from Read Access-Geo Redundant Storage (RA-GRS) accounts, provided that failed attempts to read data from the primary region are retried on the secondary region. Rehydration is not supported in the secondary region.
  • At least 99.9% (99% for Cool and Archive* Access Tiers) of the time, we will successfully process requests to read data from Locally Redundant Storage (LRS), Zone Redundant Storage (ZRS), and Geo Redundant Storage (GRS) accounts.
  • At least 99.9% (99% for Cool and Archive* Access Tiers) of the time, we will successfully process requests to write data to LRS, ZRS, GRS accounts, and RA-GRS accounts.

Protecting Azure Virtual Machines

Simple VMs in Azure should have at least a 95% uptime guarantee from Microsoft without any additional work.  If 95% uptime is not quite good enough, there are plenty of options to improve this score considerably.

For example, you can expect a 99% SLA simply by using Premium SSD, Ultra Disk, or Premium SSD v2 for all Operating System Disks and Data Disks. 

Availability Zones in Azure. Source: Microsoft

An Availability Set is two or more VMs deployed across different Fault Domains to avoid a single point of failure. When deploying two or more VM instances in the same Availability Set or in the same Dedicated Host Group, you can expect an SLA of up to 99.95%. 

Availability Zones are fault-isolated areas within an Azure region, providing redundant power, cooling, and networking.  They can make reliability even better with an SLA of up to 99.99% when two or more instances are deployed across two or more Availability Zones in the same Azure region.

Availability Sets and Availability Zones can be leveraged for VMs and also with Azure Virtual Desktop (AVD) to ensure a significant reduction in any single point of failure.

If you are planning an AVD deployment, then also consider On-Demand Capacity Reservations, since it will guarantee you will receive compute capacity up to the reserved quantity of VMs at least 99.9% of the time.

For more, here is a list of Microsoft’s SLAs for online services: https://azure.microsoft.com/en-us/support/legal/sla/

Permanent link to this article: https://www.robertborges.us/2022/12/cloud-computing/leverage-redundancy-to-improve-system-uptime/