MonitoringUpdated July 3, 2026

Runbook: Windows Virtual Machine Low Disk Space

runbookazure-monitoralertsvm-performancedisk-spacestoragetroubleshootingincident-responseservicenowinfrastructure-as-code

Runbook: Windows Virtual Machine Low Disk Space

Alert Details

Metric: LogicalDisk % Free Space
Metric Namespace: Azure.VM.Windows.GuestMetrics
Critical Threshold: <10% free space for 15 minutes
Warning Threshold: <15% free space for 15 minutes
Resource Type: Windows Virtual Machines (requires Azure Monitor Agent)

Impact

Low disk space can cause:

Application failures and crashes
Database corruption
Log file write failures
Service interruptions
System instability
Inability to process new data or transactions

Investigation Steps

1. Check Disk Space Metrics in Azure Portal

Navigate to Azure Portal → Virtual Machines → [VM Name] → Metrics
Select "LogicalDisk % Free Space" metric
Adjust time range to last 24 hours
Identify which logical disk (C:, D:, E:, etc.) triggered the alert
Look for patterns: gradual growth vs. sudden spike

2. Identify Disk Usage and Large Files

[!NOTE] VM Connection Methods

Azure Portal: VM → Connect → Choose connection method (Bastion, RDP, Serial Console)

Access Requirements: Contributor or VM Contributor role on VM or resource group

Serial Console: Requires boot diagnostics enabled (Azure Portal → VM → Boot diagnostics)

Network Access: Bastion provides browser-based access without public IP requirements

Windows:

# Connect via RDP (Azure Portal → VM → Connect → RDP)
# OR via Azure Serial Console (VM → Serial Console)

# Check disk space on all drives
Get-PSDrive -PSProvider FileSystem | Select-Object Name, Used, Free, @{Name="UsedPercent";Expression={[math]::Round(($_.Used / ($_.Used + $_.Free)) * 100, 2)}}

# Find largest folders on C: drive (top 20)
Get-ChildItem C:\ -Directory -ErrorAction SilentlyContinue |
  ForEach-Object {
    $size = (Get-ChildItem $_.FullName -Recurse -ErrorAction SilentlyContinue |
      Measure-Object -Property Length -Sum).Sum
    [PSCustomObject]@{
      Path = $_.FullName
      SizeGB = [math]::Round($size / 1GB, 2)
    }
  } | Sort-Object SizeGB -Descending | Select-Object -First 20

# Find largest files (top 20)
Get-ChildItem C:\ -File -Recurse -ErrorAction SilentlyContinue |
  Sort-Object Length -Descending |
  Select-Object -First 20 FullName, @{Name="SizeMB";Expression={[math]::Round($_.Length / 1MB, 2)}}

# Check log file sizes
Get-ChildItem "C:\Windows\Logs" -Recurse -ErrorAction SilentlyContinue |
  Measure-Object -Property Length -Sum |
  Select-Object @{Name="TotalSizeGB";Expression={[math]::Round($_.Sum / 1GB, 2)}}

# Check temp folder sizes
Get-ChildItem $env:TEMP -Recurse -ErrorAction SilentlyContinue |
  Measure-Object -Property Length -Sum |
  Select-Object @{Name="TotalSizeGB";Expression={[math]::Round($_.Sum / 1GB, 2)}}

3. Check Application Logs

Review application logs for disk-related errors or warnings
Check Epic application logs (if applicable):
- Epic Cache logs
- Interconnect logs
- Print Spool directories
Look for failed log rotation or archiving processes
Check database logs for growth patterns

4. Review Disk Growth Trends

Azure Portal → VM → Metrics → "LogicalDisk % Free Space"
Analyze historical data to understand growth rate
Determine if this is gradual growth or sudden consumption
Correlate with application deployments or batch job schedules

5. Check for Common Space Consumers

Common Windows locations to check:

C:\Windows\Logs - Windows system logs
C:\Windows\Temp - Windows temp files
C:\Users\*\AppData\Local\Temp - User temp files
C:\inetpub\logs - IIS logs
SQL Server log files
Application-specific log directories
Database backup files
Windows Update cache

Remediation

[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes to infrastructure.

Investigation Actions

Identify the logical disk with low space (use investigation steps above)
Determine the top space-consuming folders and files
Review disk space growth trends to estimate when disk will be full
Document findings in ServiceNow incident with:
- Affected disk (C:, D:, etc.)
- Current free space percentage
- Top 5 space-consuming folders/files
- Growth rate (GB per day/week)

Short-Term Resolution

Open ServiceNow Incident with Epic_Azure_Infrastructure_Ops:

Tier 3 Support will review and implement changes via incident or change request:
- Disk expansion (must be done through Terraform/IaC)
- Safe cleanup of temporary files and logs
- Log rotation configuration
- Move old data to Azure Blob Storage (archive tier)
- Database log file shrinking (if safe and appropriate)
All changes implemented through Terraform/IaC
No manual Azure Portal disk resizing

Safe Temporary Cleanup (coordinate with application teams):

Clear Windows temp files (use Disk Cleanup utility)
Archive old application logs
Clear IIS logs older than retention period
Remove old Windows Update files

Long-Term Resolution

Create GitHub Issue: Epic on Azure Ops Issues

Engineering Team will implement permanent solutions:
- Proper disk sizing based on workload analysis
- Automated log rotation and archiving
- Log forwarding to centralized logging (Splunk)
- Database maintenance plans for log management
- Storage tiering strategy (hot data on VM, cold data on Blob Storage)
- Monitoring and alerting for log file growth
- Disk auto-expansion policies via Terraform
All solutions implemented through CI/CD pipeline
Changes tracked via GitHub issue → PR → deployment workflow

Terraform Configuration Example

module "metric_alerts_disk_space" {
  source                   = "terraform.uhg.com/uhg-customer-modules/private-registry-metric-alerts/epic"
  version                  = "1.7.5"
  short_name               = "DISK"
  explanation              = "Low disk space detected. Investigation: Check disk usage trends, identify large files/folders, review application logs for disk space issues. Remediation: Clean up old logs, expand disk size, move data to alternate storage."
  target_resource_location = local.resource_location
  target_resource_type     = "Microsoft.Compute/virtualMachines"
  target_name              = local.target_name
  metric_name              = "LogicalDisk % Free Space"
  metric_namespace         = "Azure.VM.Windows.GuestMetrics"
  resource_group_name      = azurerm_resource_group.ohemr-rg.name
  scopes                   = concat(local.alert_scopes_app_rg_ids, local.alert_scopes_odb_rg_ids)

  email_recipients = {
    prod_action_group = {
      name                    = "prod_action_group"
      email_address           = local.recipients
      use_common_alert_schema = true
    }
  }

  event_hub = {
    event_hub_npd = {
      name                    = "As per region"
      event_hub_namespace     = "As per region"
      event_hub_name          = "diagnostic-logs"
      use_common_alert_schema = false
    }
  }

  alerts = {
    critical = {
      threshold      = 10 # Less than 10% free space
      severity       = 0
      aggregation    = "Average"
      operator       = "LessThan"
      severity_name  = "critical"
      metric_details = "Disk Free Space Less than 10% ${local.environment} ${local.resource_location}"
      frequency      = "PT5M"
      window_size    = "PT15M"
    }

    warning = {
      threshold      = 15 # Less than 15% free space
      severity       = 2
      aggregation    = "Average"
      operator       = "LessThan"
      severity_name  = "warning"
      metric_details = "Disk Free Space Less than 15% ${local.environment} ${local.resource_location}"
      frequency      = "PT5M"
      window_size    = "PT15M"
    }
  }
}

Escalation

Epic_Azure_Infrastructure_Ops: Open ServiceNow incident if disk space <5% or requires immediate expansion
Epic - Azure (National West): Open ServiceNow incident if disk space issue is caused by application behavior or database growth
- Application log file explosion
- Database transaction log growth
- Application data retention issues

Related Alerts

High CPU Usage (disk I/O operations can cause CPU spikes)
Application Performance Issues (disk space can cause application errors)
Database Performance (disk space affects database operations)

Historical Context

Common causes in OHEMR Epic environment:

IIS log files not rotated properly
Epic Print Spool directory growth
SQL Server transaction log growth
Windows Update cache accumulation
Epic Cache local storage growth
Temp file accumulation from batch jobs