MonitoringUpdated July 3, 2026

Runbook: Load Balancer VIP Availability Alert

runbookazure-monitoralertsload-balancernetworkinghealth-probestroubleshootingincident-responseservicenowinfrastructure-as-code

Runbook: Load Balancer VIP Availability Alert

Alert Details

Metric: VIP Availability
Threshold: <75% for 15 minutes
Severity: 1 (Error)

Impact

Frontend service disruption. External users may receive 502/503 errors or connection timeouts.

Investigation Steps

1. Check Load Balancer Health Probes

Azure Portal → Load Balancers → [LB Name] → Health Probes
Verify probe configuration (port, protocol, interval)
Check probe status (Success vs. Failed)

2. Review Backend Pool Health

Load Balancer → Backend Pools
Verify healthy backend instance count
Check which instances are failing health probes
Minimum healthy: 2 instances for production

3. Investigate Backend VM Issues

For each unhealthy backend:

Check VM availability (is it running?)
Check CPU/memory/disk metrics
Verify application service is running

Test health probe endpoint manually:

curl -v http://<backend-ip>:<probe-port>/health

4. Verify Network Connectivity

Check NSG rules allow health probe traffic
Verify firewall rules (source: Azure Load Balancer tag)
Confirm backend VMs in correct subnet/VNET

5. Review Recent Changes

Recent backend VM deployments
Application version changes
NSG/firewall rule modifications
Load balancer configuration changes

Remediation

[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes.

Investigation Actions

Identify which backends are unhealthy (use investigation steps above)
Test health probe endpoint manually from healthy VM
Check NSG rules and firewall configurations
Document findings in ServiceNow incident

Short-Term Resolution

Open ServiceNow Incident:

If all backends unhealthy → Epic_Azure_Infrastructure_Ops:
- Tier 3 Support will investigate common issue (NSG change, port binding)
- Application service restart coordinated with Epic - Azure (National West)
- Network configuration review
If some backends unhealthy → Epic_Azure_Infrastructure_Ops:
- Backend pool member adjustment via Terraform
- Unhealthy instance investigation
- Health probe configuration review
If health endpoint failing → Epic - Azure (National West):
- Application team will fix health endpoint
- Port binding verification
- Application configuration correction

Long-Term Resolution

Create GitHub Issue: Epic on Azure Ops Issues

Engineering Team will implement permanent solutions:
- Auto-scaling for backend pool via Terraform
- Backend redundancy increase (minimum 3 instances)
- Health check retry logic in application code
- Application Gateway with WAF configuration
- Blue/green deployment implementation
All solutions implemented through CI/CD pipeline
Load balancer changes via Terraform, application fixes via standard deployment

Load Balancer vs. Backend Differentiation

Load Balancer Issue

All backends report unhealthy simultaneously
Health probe configuration incorrect
NSG blocking health probe traffic

Backend Issue

Individual backends failing over time
Application crashes or hangs
Resource exhaustion (CPU/memory/disk)

Escalation

Network Team: Open ServiceNow incident with Network assignment group for load balancer config issues
Epic_Azure_Infrastructure_Ops: Open ServiceNow incident for backend VM issues or persistent health probe failures
Epic - Azure (National West): Open ServiceNow incident if health endpoint failing

Related Alerts

DIP Availability (removed/disabled): Was generating duplicate alerts
Backend VM alerts (CPU, memory, availability)

Historical Context

VIP availability <90% threshold was too sensitive
New threshold <75% allows for backend rotation without alerting
<75% = <3.75 min downtime in 15-min window (acceptable for health probe cycles)