Navigation
PostmortemsUpdated July 3, 2026

Postmortem: West Training Servers Unexpected Shutdown - March 10, 2025

postmortemvm-shutdownterraformazuredisk-controllermonitoringalertstraining-environmentincident-response

Critical Issue -Root Cause Analysis -West Training Servers shutdown unexpectedly on 3/10

1. Summary 16 West Training VMs were shutdown unexpectedly due to (2) failed updates to VM Disk Controller type "NVE" that are not supported on Gen1 VM images

  • Two terraform updates – affected (8) VMs each by two different SPNs:
  • Update 1 caused VMs to stop, deallocate and remain off at 4:29a (8)
  • Update 2 caused VMs to stop, deallocate and reaming off at 6:50a (8)
  • All VMs were in Epic Non-Prod subscription
  • NOTE: Citrix also had a provisioning update 2 hrs later, that failed due to max CPU allocation reached – but not related to above

1.1. Initial Findings

  • A planned changed to a VM, with a successful TF plan (test) triggered a state file update on multiple VMs deployed as a set
  • TF Plan did not indicate that changes to related deployments would affect the running status of the VM upon failure.
  • Monitoring alerts did fire – but emails were NOT sent due to know issue with current Email Distribution list (DL)

1.2. Current State

  • VM were off for approximately 2 hours when Epic admins logged in
  • VM were turned back on ~ 11a EST

2. Alert Monitoring

  • All 16 VM shutdowns were detected by Alert Monitoring
  • Alerts 1st set were raised within 2 min @ 4:31a (8)
  • Alerts 2nd set were raised in 2 min @ 6:51a (8)
  • After reboot – Alert status was changed to Resolved (8)

2.1. Initial Findings: Email Notification – failed

  • Monitoring alerts did fire – but emails were NOT sent due to know issue with current Email Distribution list (DL)
  • Need to determine why (8) alerts did NOT auto-resolve

2.2. Resolution

  • A new Email Notification Group has been create to address receiving service notification from Azure
  • ohemrcloudalerts
  • Enabling Resource Health alerts to detect platform related issues is planned in this PI

3. Terraform Apply Failed

  • The deployment to change the VM storage type from standard to NVE supported failed during the deployment with the error code:
Disk Controller Type property 'NVMe' is not supported by the OS image or disk specified for the VM. Disk Controller types supported by the OS are 'SCSI'.

3.1. Findings

  • Making changes to VM disk-controllers and storage type should be tested prior to deployment in an upper-environment. This type of change caused the VM deployment to fail and Azure API put the VM in a deallocsated state

3.2. Resolution

  • Update code to ignore changes to Gen1 VMs, test and provide recommendation to reduce risk to resource state changes
  • Validate the state of the servers in the workspace after deployment to confirm the health and state

4. Further Recommendations

  • Terraform core services changes should be done in CloudTest prior to moving to higher environments
  • Architecture changes to existing patterns (VM configs etc) should be tested, validated and then reviewed in the ARB
  • To prevent unintended deletes or adverse changes Resource Locks for critical shared resources should be deployed as planned in the LLD:
TargetLevelLock
Any core/shared networkingResource GroupDelete
Virtual NetworksResource GroupDelete
VNet PeeringsResource GroupDelete
Routing TablesResource GroupRead
Network Security GroupsResource GroupRead
Application Security GroupsResource GroupRead
Virtual Appliances (NGFW, WAF, SD-WAN)Resource GroupDelete
Domain controllersResource GroupDelete
Public IpsResource GroupDelete