Building a System That Never Goes Down

Picture this! You’ve built a beautiful website. It’s fast, it’s functional, and you’re proud of it. Then one day, the server crashes. Your site goes dark. Users can’t access it. Panic sets in. This happened to me (in a test environment, thankfully), and it taught me something crucial: One server is a single point of failure. Today, I learned how to fix that.

What Is High Availability?

Think of high availability like having a backup generator. When the main power goes out, the backup kicks in automatically. No one even notices there was a problem.

In cloud terms, this means:

Multiple servers running your application
A traffic director (load balancer) that sends users to healthy servers
Automatic failover when something breaks

What we’re Building Today

I created a system with:

Two virtual machines (think: two computers in the cloud)
An Availability Set (Azure’s way of keeping those VMs on different physical hardware)
A Load Balancer (the traffic cop that distributes visitors between VMs)
A Health Probe (constantly checks if each VM is alive and responding).

Why This Matters is that if VM1 crashes, VM2 keeps running. Users get routed to VM2 automatically. Zero downtime.

Getting to It

Part 1: Setting Up the Foundation

Create your project folder:

mkdir day7-high-availability
cd day7-high-availability
```

Create these files:
```
day7-high-availability/
├── main.tf           # Main infrastructure
├── variables.tf      # Configurable values
├── outputs.tf        # Display information
├── terraform.tfvars  # Your personal settings
└── providers.tf      # Azure connection

Step 2: Configuration Files

providers.tf

terraform {
  
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}

provider "azurerm" {
  features {}
}

What this does: Tells Terraform to use Azure and which version of the Azure provider to download.

variables.tf

variable "resource_group_name" {
  description = "Name of the resource group"
  type        = string
  default     = "rg-ha-lab"
}

variable "location" {
  description = "Azure region"
  type        = string
  default     = "East US"
}

variable "admin_username" {
  description = "Admin username for VMs"
  type        = string
  default     = "azureuser"
}

variable "admin_password" {
  description = "Admin password for VMs"
  type        = string
  sensitive   = true
}

variable "vm_count" {
  description = "Number of VMs to create"
  type        = number
  default     = 2
}

What this does: Defines what can be customized without editing the main code.

terraform.tfvars

resource_group_name = "rg-ha-lab"
location            = "East US"
admin_username      = "azureuser"
admin_password      = "P@ssw0rd123!SecurePass"  # Change this!
vm_count            = 2

Security Note: In production, NEVER store passwords in plain text. Use Azure Key Vault or SSH keys instead. For this learning lab, we’re using passwords for simplicity.

Better approach (SSH key):

# Generate SSH key pair
ssh-keygen -t rsa -b 4096 -f ~/.ssh/azure_vm_key

main.tf

I’ll break this into logical sections:

Part 1: Resource Group

# Resource Group - Container for all resources
resource "azurerm_resource_group" "rg" {
  name     = var.resource_group_name
  location = var.location

  tags = {
    Environment = "Learning"
    Project     = "High-Availability"
    ManagedBy   = "Terraform"
  }
}

What this creates: A folder in Azure to organize all your resources.

Part 2: Virtual Network (VNet)

# Virtual Network - Private network for VMs
resource "azurerm_virtual_network" "vnet" {
  name                = "ha-vnet"
  address_space       = ["10.0.0.0/16"]
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name

  tags = {
    Purpose = "VM-Network"
  }
}

# Subnet - Section of the network for VMs
resource "azurerm_subnet" "subnet" {
  name                 = "vm-subnet"
  resource_group_name  = azurerm_resource_group.rg.name
  virtual_network_name = azurerm_virtual_network.vnet.name
  address_prefixes     = ["10.0.1.0/24"]
}

What this creates:

VNet: A private network (like your home WiFi network)
Subnet: A segment of that network (like devices on 192.168.1.x)

IP addresses explained:

10.0.0.0/16 = VNet can have IPs from 10.0.0.0 to 10.0.255.255 (65,536 addresses)
10.0.1.0/24 = Subnet uses 10.0.1.0 to 10.0.1.255 (256 addresses)

Part 3: Network Security Group (Firewall)

# Network Security Group - Firewall rules
resource "azurerm_network_security_group" "nsg" {
  name                = "ha-nsg"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name

  # Allow HTTP traffic (port 80)
  security_rule {
    name                       = "AllowHTTP"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "80"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  # Allow SSH (port 22) - for management
  security_rule {
    name                       = "AllowSSH"
    priority                   = 110
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "22"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }

  # Allow Load Balancer health probes
  security_rule {
    name                       = "AllowAzureLoadBalancer"
    priority                   = 120
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "*"
    source_port_range          = "*"
    destination_port_range     = "*"
    source_address_prefix      = "AzureLoadBalancer"
    destination_address_prefix = "*"
  }

  tags = {
    Purpose = "VM-Security"
  }
}

# Associate NSG with Subnet
resource "azurerm_subnet_network_security_group_association" "nsg_association" {
  subnet_id                 = azurerm_subnet.subnet.id
  network_security_group_id = azurerm_network_security_group.nsg.id
}

What this creates: Firewall rules that control what traffic can reach your VMs.

Rules explained:

AllowHTTP: Anyone can access port 80 (web traffic)
AllowSSH: Anyone can SSH to VMs (in production, restrict to your IP only)
AllowAzureLoadBalancer: Critical! Allows health probes from Load Balancer

Priority numbers: Lower = higher priority (100 runs before 110)

Part 4: Availability Set

# Availability Set - Ensures VMs are on different hardware
resource "azurerm_availability_set" "avset" {
  name                         = "ha-avset"
  location                     = azurerm_resource_group.rg.location
  resource_group_name          = azurerm_resource_group.rg.name
  platform_fault_domain_count  = 2  # VMs on 2 different racks
  platform_update_domain_count = 2  # Updates happen to 1 VM at a time
  managed                      = true  # Use managed disks

  tags = {
    Purpose = "High-Availability"
  }
}

What this creates: The magic that spreads your VMs across different physical hardware.

Configuration:

fault_domain_count = 2: VMs will be on 2 separate racks
update_domain_count = 2: Azure updates 1 VM at a time (never both simultaneously)
managed = true: Use Azure-managed disks (recommended)

Part 5: Public IP for Load Balancer

# Public IP - The address users connect to
resource "azurerm_public_ip" "lb_public_ip" {
  name                = "ha-lb-public-ip"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
  allocation_method   = "Static"  # IP never changes
  sku                 = "Standard"  # Required for Standard Load Balancer

  tags = {
    Purpose = "LoadBalancer-Frontend"
  }
}

What this creates: The public IP address users will connect to (like 20.30.40.50).

Static vs Dynamic:

Static: IP never changes (good for production)
Dynamic: IP might change when resource is stopped (cheaper, but bad for users)

Part 6: Load Balancer

# Load Balancer - Traffic distributor
resource "azurerm_lb" "lb" {
  name                = "ha-load-balancer"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
  sku                 = "Standard"

  frontend_ip_configuration {
    name                 = "PublicFrontend"
    public_ip_address_id = azurerm_public_ip.lb_public_ip.id
  }

  tags = {
    Purpose = "Traffic-Distribution"
  }
}

# Backend Pool - Collection of VMs behind the LB
resource "azurerm_lb_backend_address_pool" "backend_pool" {
  name            = "vm-backend-pool"
  loadbalancer_id = azurerm_lb.lb.id
}

# Health Probe - Checks if VMs are responsive
resource "azurerm_lb_probe" "health_probe" {
  name                = "http-probe"
  loadbalancer_id     = azurerm_lb.lb.id
  protocol            = "Tcp"
  port                = 80
  interval_in_seconds = 5   # Check every 5 seconds
  number_of_probes    = 2   # 2 failures = unhealthy
}

# Load Balancing Rule - How to distribute traffic
resource "azurerm_lb_rule" "lb_rule" {
  name                           = "http-rule"
  loadbalancer_id                = azurerm_lb.lb.id
  protocol                       = "Tcp"
  frontend_port                  = 80
  backend_port                   = 80
  frontend_ip_configuration_name = "PublicFrontend"
  backend_address_pool_ids       = [azurerm_lb_backend_address_pool.backend_pool.id]
  probe_id                       = azurerm_lb_probe.health_probe.id
  disable_outbound_snat          = true
  tcp_reset_enabled               = true
}

# Outbound Rule - Allow VMs to access internet
resource "azurerm_lb_outbound_rule" "outbound_rule" {
  name                    = "outbound-internet"
  loadbalancer_id         = azurerm_lb.lb.id
  protocol                = "All"
  backend_address_pool_id = azurerm_lb_backend_address_pool.backend_pool.id

  frontend_ip_configuration {
    name = "PublicFrontend"
  }
}

What this creates: The complete load balancing setup.

Components:

Load Balancer: Main resource
Backend Pool: List of VMs to distribute traffic to
Health Probe: Checks VM health every 5 seconds
Load Balancing Rule: Port 80 → VMs on port 80
Outbound Rule: Lets VMs reach internet for updates

Part 7: Network Interfaces for VMs

# Network Interfaces - Virtual network cards for VMs
resource "azurerm_network_interface" "nic" {
  count               = var.vm_count
  name                = "ha-nic-${count.index + 1}"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name

  ip_configuration {
    name                          = "internal"
    subnet_id                     = azurerm_subnet.subnet.id
    private_ip_address_allocation = "Dynamic"
  }

  tags = {
    VM = "vm-${count.index + 1}"
  }
}

# Associate NICs with Backend Pool
resource "azurerm_network_interface_backend_address_pool_association" "nic_backend" {
  count                   = var.vm_count
  network_interface_id    = azurerm_network_interface.nic[count.index].id
  ip_configuration_name   = "internal"
  backend_address_pool_id = azurerm_lb_backend_address_pool.backend_pool.id
}

What this creates: Virtual network cards for each VM, connected to the backend pool.

The count parameter: Creates 2 NICs (one per VM) automatically.

Part 8: Virtual Machines

# Virtual Machines
resource "azurerm_linux_virtual_machine" "vm" {
  count                 = var.vm_count
  name                  = "ha-vm-${count.index + 1}"
  location              = azurerm_resource_group.rg.location
  resource_group_name   = azurerm_resource_group.rg.name
  network_interface_ids = [azurerm_network_interface.nic[count.index].id]
  size                  = "Standard_B1s"  # 1 vCPU, 1GB RAM (cheapest)
  availability_set_id   = azurerm_availability_set.avset.id

  # OS disk configuration
  os_disk {
    name                 = "ha-vm-osdisk-${count.index + 1}"
    caching              = "ReadWrite"
    storage_account_type = "Standard_LRS"
  }

  # Ubuntu 20.04 image
  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-focal"
    sku       = "20_04-lts"
    version   = "latest"
  }

  # Admin credentials
  admin_username                  = var.admin_username
  admin_password                  = var.admin_password
  disable_password_authentication = false

  # Startup script - Install Nginx
  custom_data = base64encode(<<-EOF
    #!/bin/bash
    apt-get update
    apt-get install -y nginx
    
    # Create custom web page showing VM name
    echo "<h1>Hello from VM ${count.index + 1}</h1>" > /var/www/html/index.html
    echo "<p>This is VM ${count.index + 1} responding to your request</p>" >> /var/www/html/index.html
    
    systemctl restart nginx
  EOF
  )

  tags = {
    Name        = "vm-${count.index + 1}"
    Environment = "HA-Lab"
  }
}

What this creates: 2 Ubuntu VMs with Nginx automatically installed.

VM size breakdown:

Standard_B1s: 1 vCPU, 1GB RAM
Cost: ~$7.50/month per VM
Good for: Testing, small websites, learning

The custom_data script: This runs automatically when VM first starts:

Updates package list
Installs Nginx web server
Creates a custom homepage showing which VM is responding
Starts Nginx

Result: Each VM will show “Hello from VM 1” or “Hello from VM 2”

outputs.tf

output "load_balancer_ip" {
  description = "Public IP of the Load Balancer"
  value       = azurerm_public_ip.lb_public_ip.ip_address
}

output "load_balancer_url" {
  description = "URL to access your application"
  value       = "http://${azurerm_public_ip.lb_public_ip.ip_address}"
}

output "vm_names" {
  description = "Names of the VMs"
  value       = [azurerm_linux_virtual_machine.vm[*].name]
}

output "availability_set_name" {
  description = "Name of the availability set"
  value       = azurerm_availability_set.avset.name
}

What this does: After deployment, shows you important information like the public IP to test.

Step 3: Deploy the Infrastructure

Now let’s build everything!

# Step 1: Initialize Terraform
terraform init

# Step 2: Preview what will be created
terraform plan

# Step 3: Deploy (confirm with 'yes')
terraform apply
```

**What happens:**

1. **terraform init** (~30 seconds)
   - Downloads Azure provider plugins
   - Prepares backend storage

2. **terraform plan** (~15 seconds)
   - Shows exactly what will be created
   - Estimates 20+ resources

3. **terraform apply** (~5-7 minutes)
   - Creates resource group
   - Creates network infrastructure
   - Creates availability set
   - Creates load balancer
   - Creates 2 VMs
   - Installs Nginx on both VMs

**Expected output:**
```
Apply complete! Resources: 23 added, 0 changed, 0 destroyed.

Outputs:

load_balancer_ip = "20.30.40.50"
load_balancer_url = "http://20.30.40.50"
vm_names = [
  "ha-vm-1",
  "ha-vm-2",
]
availability_set_name = "ha-avset"

Copy the load_balancer_url — you’ll use it to test!

Step 4: Verify in Azure Portal (Continued)

Check 2: Availability Set

In your resource group, click on “ha-avset”
You should see:
- Virtual machines: 2 (ha-vm-1, ha-vm-2)
- Fault domains: 2
- Update domains: 2
Click on “Virtual machines” in the left menu
Keep refreshing: it will alternate:
What’s happening: The Load Balancer is using round-robin distribution, each request goes to the next VM in rotation.

Test 2: Continuous Traffic (See Distribution)

Let’s generate continuous traffic to see the pattern:

On Windows (PowerShell):

# Run this 10 times
for ($i=1; $i -le 10; $i++) {
    $response = Invoke-WebRequest -Uri "http://20.30.40.50"
    Write-Host "Request $i - $($response.Content)"
}

On Mac/Linux (Terminal):

# Run this 10 times
for i in {1..10}; do
    echo "Request $i:"
    curl -s http://20.30.40.50 | grep "Hello from"
    echo ""
done
```

**Expected output:**
```
Request 1: Hello from VM 1
Request 2: Hello from VM 2
Request 3: Hello from VM 1
Request 4: Hello from VM 2
Request 5: Hello from VM 1
Request 6: Hello from VM 2
...
```

**Perfect distribution!** Each VM handles 50% of requests.

Conclusion

Today, I didn’t just build infrastructure;
I built confidence in my design, my automation, and my understanding of why high availability matters.

“Uptime isn’t an accident it’s a decision you make at the architecture level.”

As I continue my Azure DevSecOps challenge, this project reminded me that reliability is as important as scalability.

Adedeji Awolesi