Navigation
MonitoringUpdated July 3, 2026

Runbook: Load Balancer VIP Availability Alert

runbookazure-monitoralertsload-balancernetworkinghealth-probestroubleshootingincident-responseservicenowinfrastructure-as-code

Runbook: Load Balancer VIP Availability Alert

Alert Details

  • Metric: VIP Availability
  • Threshold: <75% for 15 minutes
  • Severity: 1 (Error)

Impact

Frontend service disruption. External users may receive 502/503 errors or connection timeouts.

Investigation Steps

1. Check Load Balancer Health Probes

  • Azure Portal → Load Balancers → [LB Name] → Health Probes
  • Verify probe configuration (port, protocol, interval)
  • Check probe status (Success vs. Failed)

2. Review Backend Pool Health

  • Load Balancer → Backend Pools
  • Verify healthy backend instance count
  • Check which instances are failing health probes
  • Minimum healthy: 2 instances for production

3. Investigate Backend VM Issues

For each unhealthy backend:

  • Check VM availability (is it running?)

  • Check CPU/memory/disk metrics

  • Verify application service is running

  • Test health probe endpoint manually:

    curl -v http://<backend-ip>:<probe-port>/health
    

4. Verify Network Connectivity

  • Check NSG rules allow health probe traffic
  • Verify firewall rules (source: Azure Load Balancer tag)
  • Confirm backend VMs in correct subnet/VNET

5. Review Recent Changes

  • Recent backend VM deployments
  • Application version changes
  • NSG/firewall rule modifications
  • Load balancer configuration changes

Remediation

[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes.

Investigation Actions

  1. Identify which backends are unhealthy (use investigation steps above)
  2. Test health probe endpoint manually from healthy VM
  3. Check NSG rules and firewall configurations
  4. Document findings in ServiceNow incident

Short-Term Resolution

Open ServiceNow Incident:

  • If all backends unhealthyEpic_Azure_Infrastructure_Ops:
    • Tier 3 Support will investigate common issue (NSG change, port binding)
    • Application service restart coordinated with Epic - Azure (National West)
    • Network configuration review
  • If some backends unhealthyEpic_Azure_Infrastructure_Ops:
    • Backend pool member adjustment via Terraform
    • Unhealthy instance investigation
    • Health probe configuration review
  • If health endpoint failingEpic - Azure (National West):
    • Application team will fix health endpoint
    • Port binding verification
    • Application configuration correction

Long-Term Resolution

Create GitHub Issue: Epic on Azure Ops Issues

  • Engineering Team will implement permanent solutions:
    • Auto-scaling for backend pool via Terraform
    • Backend redundancy increase (minimum 3 instances)
    • Health check retry logic in application code
    • Application Gateway with WAF configuration
    • Blue/green deployment implementation
  • All solutions implemented through CI/CD pipeline
  • Load balancer changes via Terraform, application fixes via standard deployment

Load Balancer vs. Backend Differentiation

Load Balancer Issue

  • All backends report unhealthy simultaneously
  • Health probe configuration incorrect
  • NSG blocking health probe traffic

Backend Issue

  • Individual backends failing over time
  • Application crashes or hangs
  • Resource exhaustion (CPU/memory/disk)

Escalation

  • Network Team: Open ServiceNow incident with Network assignment group for load balancer config issues
  • Epic_Azure_Infrastructure_Ops: Open ServiceNow incident for backend VM issues or persistent health probe failures
  • Epic - Azure (National West): Open ServiceNow incident if health endpoint failing

Related Alerts

  • DIP Availability (removed/disabled): Was generating duplicate alerts
  • Backend VM alerts (CPU, memory, availability)

Historical Context

  • VIP availability <90% threshold was too sensitive
  • New threshold <75% allows for backend rotation without alerting
  • <75% = <3.75 min downtime in 15-min window (acceptable for health probe cycles)