Navigation
MonitoringUpdated July 3, 2026

Runbook: Windows Virtual Machine Low Disk Space

runbookazure-monitoralertsvm-performancedisk-spacestoragetroubleshootingincident-responseservicenowinfrastructure-as-code

Runbook: Windows Virtual Machine Low Disk Space

Alert Details

  • Metric: LogicalDisk % Free Space
  • Metric Namespace: Azure.VM.Windows.GuestMetrics
  • Critical Threshold: <10% free space for 15 minutes
  • Warning Threshold: <15% free space for 15 minutes
  • Resource Type: Windows Virtual Machines (requires Azure Monitor Agent)

Impact

Low disk space can cause:

  • Application failures and crashes
  • Database corruption
  • Log file write failures
  • Service interruptions
  • System instability
  • Inability to process new data or transactions

Investigation Steps

1. Check Disk Space Metrics in Azure Portal

  • Navigate to Azure Portal → Virtual Machines → [VM Name] → Metrics
  • Select "LogicalDisk % Free Space" metric
  • Adjust time range to last 24 hours
  • Identify which logical disk (C:, D:, E:, etc.) triggered the alert
  • Look for patterns: gradual growth vs. sudden spike

2. Identify Disk Usage and Large Files

[!NOTE] VM Connection Methods

  • Azure Portal: VM → Connect → Choose connection method (Bastion, RDP, Serial Console)
  • Access Requirements: Contributor or VM Contributor role on VM or resource group
  • Serial Console: Requires boot diagnostics enabled (Azure Portal → VM → Boot diagnostics)
  • Network Access: Bastion provides browser-based access without public IP requirements

Windows:

# Connect via RDP (Azure Portal → VM → Connect → RDP)
# OR via Azure Serial Console (VM → Serial Console)

# Check disk space on all drives
Get-PSDrive -PSProvider FileSystem | Select-Object Name, Used, Free, @{Name="UsedPercent";Expression={[math]::Round(($_.Used / ($_.Used + $_.Free)) * 100, 2)}}

# Find largest folders on C: drive (top 20)
Get-ChildItem C:\ -Directory -ErrorAction SilentlyContinue |
  ForEach-Object {
    $size = (Get-ChildItem $_.FullName -Recurse -ErrorAction SilentlyContinue |
      Measure-Object -Property Length -Sum).Sum
    [PSCustomObject]@{
      Path = $_.FullName
      SizeGB = [math]::Round($size / 1GB, 2)
    }
  } | Sort-Object SizeGB -Descending | Select-Object -First 20

# Find largest files (top 20)
Get-ChildItem C:\ -File -Recurse -ErrorAction SilentlyContinue |
  Sort-Object Length -Descending |
  Select-Object -First 20 FullName, @{Name="SizeMB";Expression={[math]::Round($_.Length / 1MB, 2)}}

# Check log file sizes
Get-ChildItem "C:\Windows\Logs" -Recurse -ErrorAction SilentlyContinue |
  Measure-Object -Property Length -Sum |
  Select-Object @{Name="TotalSizeGB";Expression={[math]::Round($_.Sum / 1GB, 2)}}

# Check temp folder sizes
Get-ChildItem $env:TEMP -Recurse -ErrorAction SilentlyContinue |
  Measure-Object -Property Length -Sum |
  Select-Object @{Name="TotalSizeGB";Expression={[math]::Round($_.Sum / 1GB, 2)}}

3. Check Application Logs

  • Review application logs for disk-related errors or warnings
  • Check Epic application logs (if applicable):
    • Epic Cache logs
    • Interconnect logs
    • Print Spool directories
  • Look for failed log rotation or archiving processes
  • Check database logs for growth patterns

4. Review Disk Growth Trends

  • Azure Portal → VM → Metrics → "LogicalDisk % Free Space"
  • Analyze historical data to understand growth rate
  • Determine if this is gradual growth or sudden consumption
  • Correlate with application deployments or batch job schedules

5. Check for Common Space Consumers

Common Windows locations to check:

  • C:\Windows\Logs - Windows system logs
  • C:\Windows\Temp - Windows temp files
  • C:\Users\*\AppData\Local\Temp - User temp files
  • C:\inetpub\logs - IIS logs
  • SQL Server log files
  • Application-specific log directories
  • Database backup files
  • Windows Update cache

Remediation

[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes to infrastructure.

Investigation Actions

  1. Identify the logical disk with low space (use investigation steps above)
  2. Determine the top space-consuming folders and files
  3. Review disk space growth trends to estimate when disk will be full
  4. Document findings in ServiceNow incident with:
    • Affected disk (C:, D:, etc.)
    • Current free space percentage
    • Top 5 space-consuming folders/files
    • Growth rate (GB per day/week)

Short-Term Resolution

Open ServiceNow Incident with Epic_Azure_Infrastructure_Ops:

  • Tier 3 Support will review and implement changes via incident or change request:
    • Disk expansion (must be done through Terraform/IaC)
    • Safe cleanup of temporary files and logs
    • Log rotation configuration
    • Move old data to Azure Blob Storage (archive tier)
    • Database log file shrinking (if safe and appropriate)
  • All changes implemented through Terraform/IaC
  • No manual Azure Portal disk resizing

Safe Temporary Cleanup (coordinate with application teams):

  • Clear Windows temp files (use Disk Cleanup utility)
  • Archive old application logs
  • Clear IIS logs older than retention period
  • Remove old Windows Update files

Long-Term Resolution

Create GitHub Issue: Epic on Azure Ops Issues

  • Engineering Team will implement permanent solutions:
    • Proper disk sizing based on workload analysis
    • Automated log rotation and archiving
    • Log forwarding to centralized logging (Splunk)
    • Database maintenance plans for log management
    • Storage tiering strategy (hot data on VM, cold data on Blob Storage)
    • Monitoring and alerting for log file growth
    • Disk auto-expansion policies via Terraform
  • All solutions implemented through CI/CD pipeline
  • Changes tracked via GitHub issue → PR → deployment workflow

Terraform Configuration Example

module "metric_alerts_disk_space" {
  source                   = "terraform.uhg.com/uhg-customer-modules/private-registry-metric-alerts/epic"
  version                  = "1.7.5"
  short_name               = "DISK"
  explanation              = "Low disk space detected. Investigation: Check disk usage trends, identify large files/folders, review application logs for disk space issues. Remediation: Clean up old logs, expand disk size, move data to alternate storage."
  target_resource_location = local.resource_location
  target_resource_type     = "Microsoft.Compute/virtualMachines"
  target_name              = local.target_name
  metric_name              = "LogicalDisk % Free Space"
  metric_namespace         = "Azure.VM.Windows.GuestMetrics"
  resource_group_name      = azurerm_resource_group.ohemr-rg.name
  scopes                   = concat(local.alert_scopes_app_rg_ids, local.alert_scopes_odb_rg_ids)

  email_recipients = {
    prod_action_group = {
      name                    = "prod_action_group"
      email_address           = local.recipients
      use_common_alert_schema = true
    }
  }

  event_hub = {
    event_hub_npd = {
      name                    = "As per region"
      event_hub_namespace     = "As per region"
      event_hub_name          = "diagnostic-logs"
      use_common_alert_schema = false
    }
  }

  alerts = {
    critical = {
      threshold      = 10 # Less than 10% free space
      severity       = 0
      aggregation    = "Average"
      operator       = "LessThan"
      severity_name  = "critical"
      metric_details = "Disk Free Space Less than 10% ${local.environment} ${local.resource_location}"
      frequency      = "PT5M"
      window_size    = "PT15M"
    }

    warning = {
      threshold      = 15 # Less than 15% free space
      severity       = 2
      aggregation    = "Average"
      operator       = "LessThan"
      severity_name  = "warning"
      metric_details = "Disk Free Space Less than 15% ${local.environment} ${local.resource_location}"
      frequency      = "PT5M"
      window_size    = "PT15M"
    }
  }
}

Escalation

  • Epic_Azure_Infrastructure_Ops: Open ServiceNow incident if disk space <5% or requires immediate expansion
  • Epic - Azure (National West): Open ServiceNow incident if disk space issue is caused by application behavior or database growth
    • Application log file explosion
    • Database transaction log growth
    • Application data retention issues

Related Alerts

  • High CPU Usage (disk I/O operations can cause CPU spikes)
  • Application Performance Issues (disk space can cause application errors)
  • Database Performance (disk space affects database operations)

Historical Context

Common causes in OHEMR Epic environment:

  • IIS log files not rotated properly
  • Epic Print Spool directory growth
  • SQL Server transaction log growth
  • Windows Update cache accumulation
  • Epic Cache local storage growth
  • Temp file accumulation from batch jobs