MonitoringUpdated July 3, 2026
Runbook: Virtual Machine Availability Alert
runbookazure-monitoralertsvm-availabilitydisaster-recoverytroubleshootingincident-responseservicenowinfrastructure-as-code
Runbook: Virtual Machine Availability Alert
Alert Details
- Metric: VM Availability
- Threshold: <50% for 30 minutes
Impact
Service disruption. Users cannot access application. Potential data loss if VM crashed.
Investigation Steps
1. Check VM Status
- Azure Portal → Virtual Machines → [VM Name] → Overview
- Check current status: Running, Stopped, Deallocated, Failed
- Review "Resource Health" section
2. Review Activity Log
- Azure Portal → VM → Activity Log
- Filter last 2 hours
- Look for:
- Planned maintenance events
- User-initiated reboots/shutdowns
- Azure platform events (host failures)
- Auto-shutdown policies
3. Check Boot Diagnostics
- Azure Portal → VM → Boot Diagnostics
- Review serial console output
- Look for OS-level boot failures
- Check for kernel panics (Linux) or BSOD (Windows)
4. Review Resource Health
- Azure Portal → VM → Resource Health
- Check for known Azure platform issues
- Review historical availability data
5. Correlate with Other Alerts
- Check if memory alert fired before availability drop (OOM crash)
- Check if disk alert fired (disk exhaustion crash)
- Review network alerts (connectivity loss)
Remediation
[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes.
Investigation Actions
- Check boot diagnostics logs for root cause
- Review application logs from before crash
- Verify VM resource sufficiency (CPU/memory/disk)
- Document VM state and crash details in ServiceNow incident
Short-Term Resolution
Open ServiceNow Incident with Epic_Azure_Infrastructure_Ops:
- Tier 3 Support will review and implement resolution:
- VM restart (if stopped)
- Azure Support ticket (if platform failure)
- Snapshot creation (if recurring crash)
- Resource adjustment if undersized
- For Platform Events: Monitor Azure Service Health, wait for Azure resolution
- All changes coordinated with application teams
Long-Term Resolution
Create GitHub Issue: Epic on Azure Ops Issues
- Engineering Team will implement permanent solutions:
- Auto-healing for scale sets via Terraform
- Multiple instance deployment with load balancer
- Application crash root cause fix
- Availability Zones configuration for higher SLA
- Azure Site Recovery for disaster recovery
- All solutions implemented through CI/CD pipeline
- Infrastructure changes via Terraform, application fixes via standard deployment
Differentiate Planned vs. Unplanned
Planned Maintenance (Ignore)
- User-initiated reboot
- Azure planned maintenance window
- Patching windows (Tuesday 2-6 AM, Weekend 2-4 AM)
- Auto-shutdown schedules
Unplanned Downtime (Investigate)
- Crash/hang without user action
- Azure platform host failure
- Application-induced crash (OOM, disk full)
- Network isolation
Escalation
- Epic_Azure_Infrastructure_Ops: Open ServiceNow incident for persistent availability issues or recurring crashes
- Azure Support: For platform-level failures
- Epic - Azure (National West): Open ServiceNow incident if crash caused by application code
Related Alerts
- Low Memory (OOM killer terminates VM)
- Disk IOPS/Space (disk full crashes)
- CPU Spike (may precede hang/crash)
Historical Context
Common causes in OHEMR Epic environment:
- Care Everywhere VMs: Memory exhaustion crashes
- Planned patching windows (should be suppressed)
- Epic background process crashes