High Availability On Azure

If your app is hosted on a single virtual machine, this virtual machine is a single point of failure. This means that any single thing that goes wrong on this machine (hardware failure, networking blip, etc.) can cause your app to fail completely. To avoid this, Azure offers components that allow you to build a highly available infrastructure that is resilient to failures on a single machine.

Highly Available Building Blocks

There are 3 building blocks in Azure that allow you to build highly available applications:

1) Availability Sets

When you place your virtual machines in the same availability set, they are automatically spread across update domains and fault domains. Machines in different update domains within the same availability set won’t undergo platform maintenance at the same time, and machines in different fault domains within the same availability set are spread across different physical hardware. For more information, see the Azure documentation: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/manage-availability.

Note, however, that no assumptions can be made about update domains and fault domains across availability sets. For example, say we have machine 1 in update domain 1 of availability set 1, and we have machine 2 in update domain 2 of availability set 2. These machines might undergo platform maintenance at the same time, or they might not. Since they are in different availability sets, we cannot assume either way.

2) Placement Groups/Scale Sets

Scale sets offer greater scale than availability sets by spreading across “placement groups”, where each placement group behaves like an availability set. Within each placement group, machines are spread across update domains and fault domains, but no assumptions can be made across placement groups. This behavior is the default for regional scale sets (scale sets that don’t use availability zones), and is called “static 5 fault domain spreading”.

However, scale sets also allow you to choose “max spreading” instead of “static 5 fault domain spreading”. Max spreading means that within each placement group, the scale set will spread machines across as many fault domains as possible instead of exactly 5. More information can be found in the Azure documentation: https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-use-availability-zones.

3) Availability Zones

While availability sets and placement groups provide local redundancy, availability zones provide redundancy at the datacenter level. You can deploy machines into one of multiple zones, so if one datacenter goes down, only the machines in that zone are affected. In the case of scale sets, machines are automatically distributed across placement groups in the zones you choose to spread across. More information can be found in the Azure documentation: https://docs.microsoft.com/en-us/azure/availability-zones/az-overview.

1 comment

Leave a comment