Azure Metric Alert Terraform Code - Reference
Azure Metric Alert Terraform Code Documentation
local variables that are used on all alerts
Local terraform variables used:
-
environment: This is used for naming conventions or tagging
-
resource_location: is the API parameter that specifies the Azure region where a resource will be deployed.
-
target_name: This is for virtual machine metric alerts; it refers to the API call of resources.
-
recipients: This is email recipients list where metric alerts are going to
Alert Monitoring and Troubleshooting
All metric alerts, when fired, are automatically forwarded to Splunk for centralized monitoring and analysis. This provides a unified view of all alert activity across the Epic infrastructure.
Splunk Integration
When any metric alert triggers (critical, warning, or informational), the alert data is sent to Splunk through configured Event Hubs. This allows for:
- Centralized Alert Monitoring: View all Epic infrastructure alerts in one location
- Historical Analysis: Track alert patterns and trends over time
- Correlation Analysis: Identify relationships between different alerts and services
- Operational Dashboards: Create custom dashboards for monitoring specific services or environments
Finding Alerts in Splunk
To search for metric alerts in Splunk, use the following search query:
index=cloud_epic_azure_nw "data.schemaId"=AzureMonitorMetricAlert
This search will return all Azure Monitor metric alerts that have been triggered across the Epic infrastructure.
Common Splunk Search Refinements
You can refine your search to focus on specific aspects:
# Search for critical alerts only
index=cloud_epic_azure_nw "data.schemaId"=AzureMonitorMetricAlert "data.data.status"=Activated "data.data.context.severity"=0
# Search for alerts from a specific resource type (e.g., NetApp)
index=cloud_epic_azure_nw "data.schemaId"=AzureMonitorMetricAlert "data.data.context.resourceType"="Microsoft.NetApp/netAppAccounts/capacityPools/volumes"
# Search for alerts within a specific time range
index=cloud_epic_azure_nw "data.schemaId"=AzureMonitorMetricAlert earliest=-24h latest=now
# Search for alerts from a specific environment or region
index=cloud_epic_azure_nw "data.schemaId"=AzureMonitorMetricAlert reg=westus3 "epicpro"
Alert Data Structure in Splunk
Each alert entry in Splunk contains detailed information about the triggered alert, including:
- Alert Details: Alert name, severity, status (Activated/Resolved)
- Resource Information: Resource ID, resource type, resource group
- Metric Data: Metric name, threshold values, actual values
- Timing Information: When the alert fired and evaluation windows
- Environment Context: Region, environment tags, and other metadata
This comprehensive logging enables effective monitoring, troubleshooting, and operational insights across the entire Epic infrastructure.
Metric Alert Baselines
Example: code of Virtual Machine alert for CPU if you want to use this baseline for different types of alerts such as loadbalancers, express routes or other services. You would need to use its data.resource.id for scope variable and target_name variable you need to use data.resource.name.
module "metric_alerts_cpu/loadbalancer/Express_Route" {
source = "terraform.uhg.com/uhg-customer-modules/private-registry-metric-alerts/epic" # which terraform workspace URL you are using
version = "1.4.0" # which terraform workspace version you are using highly recommend keeping this for best practices
short_name = "VM" # the nickname of the metric alert
Optional value: scope = [data.azurerm_lb.loadbalancer.id] # defines the resource that the alert is monitoring. By default, many examples use Virtual Machines (VMs), but you can target other resource types like Load Balancers, Application Gateways, Storage Accounts, etc.
explanation = "CPU usage is high" # this is to explain the alerts function and it will be shown on the alert name in the azure UI
target_resource_location = local.resource_location # this is where you use the local.resource_location value so terraform can deploy to the correct region and for non virtual machine alert you must use data.resource.localtion
target_resource_type = "Microsoft.Compute/virtualMachines" # this is the api name of the resource you are trying to target
target_name = local.target_name # base value is virtual machine but, if you are not using a virtual machine alert you need to do data.resource.name (resource name) in order for the alert to point to it.
metric_name = "Percentage CPU" # This is the API name of the measurement for thresholds
resource_group_name = azurerm_resource_group.resource_group.name # which resource group you want to use
email_recipients = {
prod_action_group - { (Map of email address)
name = "prod_action_group" # name of what it's going to be
email_address = "${local.recipients}" # this use of local.recipients
use_common_alert_schema = true # if you want to use common alert schema true or false
}
}
event_hub = {
event_hub_cloudtest - { (Map of eventhub)
name = "lp-cl-westus3-eventhub-cc751735" # name of eventhub
event_hub_namespace = "lp-cl-westus3-eventhub-cc751735" # name of eventhub namespace
event_hub_name = "diagnostic-logs" # name of eventhub
use_common_alert_schema = false # if you want to use common alert schema true or false
}
}
alerts = {
critical = {
threshold = 95 # The value that the metric must exceed to trigger the alert.
severity = 0 # severity of the alert from numbers 0-4
aggregation = "average" # The aggregation type used to evaluate the metric.
operator = "GreaterThanOrEqual" # comparison operator
severity_name = "critical" # the name of the severity critical, warning, verbose etc
metric_details = "Percentage CPU Greater than 95 ${local.environment} ${local.resource_location}" # it adds description of the alert for the azure UI
frequency = "PT1M" # How often the alert rule is evaluated
window_size = "PT5M" # Defines the time range over which the metric is evaluated
}
}
}
Alert Silence Rule
There are a few different types of Alert Silence rule you can do. The Alert Silencing rule can be based on resource name, resource type, and metric alert name. I highly recommend you check the layout of the terraform code to determine which alert would be best.
module "alert_processing_rule_silenced_alert_rule" {
source = "terraform.uhg.com/uhg-customer-modules/private-registry-alert-processing-rule-suppression/epic" # which terraform workspace url you are using
version = "1.5.0" # which terraform workspace version you are using highly recommend keeping this for best practicies
alert_processing_rule_suppression_name = "Alert-Processing-Rule-Suppression" # name of alert processing rule which will be seen in the azure UI
schedule_enabled = true # if you want the alert to be permanent or temporary, however you may have to enabled it anyway to able to customize recurrence
# optional schedule_enabled values:
effective_from = "2025-03-06T01:00:00"
effective_until = "2225-03-06T01:00:00"
recurrence_daily_enabled = true # if you want the alert to run everyday
# optional recurrence_daily_enabled values:
start_time = "23:59:59" # what time should the alert start everyday
end_time = "00:00:00" # what time should the alert end everyday
recurrence_weekly_enabled = true # if you want the alert to run weekly
# optional recurrence_weekly_enabled values:
days_of_week = ["Monday","Tuesday"] # which days should the alert activate on
start_time = "23:59:59" # what time should the alert start everyday
end_time = "00:00:00" # what time should the alert end everyday
time_zone = "Central Standard Time" # which timezone the alert should be using for dates and time
resource_group_name = azurerm_resource_group.resource_group.name # which resource group should the alert be deployed to
short_name = "ohemr APRS" # short name of the alert
explanation = "notifications have been silenced due maintenance window." # how the alert works
condition = true # if you need a condition filter, you can apply the alert to silence resource group, resource type, or alert name please see the terraform code to understand how to switch between the options.
optional values - switching between silencing alert rule, resource group and resource type.
# silence alert rule option:
alert_rule_name_enabled = true # if you want it to enable or not
alert_rule_name_enabled_operator = "Equals" # comparison operator
alert_rule_name_enabled_values = ["Metric alert to silence"] # the alert name of value that you are silencing
# silence resource type option:
target_resource_type_enabled = true # if you want it to enabled or not
target_resource_type_operator = "Equals" # comparison operator
target_resource_type_values = ["Microsoft.Compute/virtualMachines", "Microsoft.Network/loadBalancers"] # the resource type you are silencing
# silence resource group options:
target_resource_group_enabled = true # if you want it to be enabled or not
target_resource_group_operator = "Contains" comparison operator
target_resource_group_values = "resource_group_name" the resource group you are silencing
}
Windows Disk Space Metric Alerts
The disk space monitoring alerts track available disk space percentages across Windows virtual machine logical disks. These alerts help prevent storage-related outages by providing early warning when disk capacity is running low.
[!IMPORTANT] This alert is specifically designed for Windows Virtual Machines only. It uses the
Azure.VM.Windows.GuestMetricsnamespace which requires the Azure Monitor Agent to be installed on Windows VMs. For Linux VMs, a separate disk space monitoring solution is required.
Key Characteristics
- Metric Name:
LogicalDisk % Free Space - Metric Namespace:
Azure.VM.Windows.GuestMetrics(requires Azure Monitor Agent) - Resource Type:
Microsoft.Compute/virtualMachines - Alert Type: Percentage-based threshold monitoring
Why This Alert Is Important
Low disk space can cause:
- Application failures and crashes
- Database corruption
- Log file write failures
- Service interruptions
- System instability
Alert Configuration
module "metric_alerts_disk_space" {
source = "terraform.uhg.com/uhg-customer-modules/private-registry-metric-alerts/epic"
version = "1.7.5"
short_name = "DISK"
explanation = "Low disk space detected. Investigation: Check disk usage trends, identify large files/folders, review application logs for disk space issues. Remediation: Clean up old logs, expand disk size, move data to alternate storage."
target_resource_location = local.resource_location
target_resource_type = "Microsoft.Compute/virtualMachines"
target_name = local.target_name
metric_name = "LogicalDisk % Free Space"
metric_namespace = "Azure.VM.Windows.GuestMetrics"
resource_group_name = azurerm_resource_group.ohemr-rg.name
scopes = concat(local.alert_scopes_app_rg_ids, local.alert_scopes_odb_rg_ids)
email_recipients = {
prod_action_group = {
name = "prod_action_group"
email_address = "${local.recipients}"
use_common_alert_schema = true
}
}
event_hub = {
event_hub_npd = {
name = "As per region"
event_hub_namespace = "As per region"
event_hub_name = "diagnostic-logs"
use_common_alert_schema = false
}
}
alerts = {
critical = {
threshold = 10 # Less than 10% free space
severity = 0
aggregation = "Average"
operator = "LessThan"
severity_name = "critical"
metric_details = "Disk Free Space Less than 10% ${local.environment} ${local.resource_location}"
frequency = "PT5M"
window_size = "PT15M"
}
warning = {
threshold = 15 # Less than 15% free space
severity = 2
aggregation = "Average"
operator = "LessThan"
severity_name = "warning"
metric_details = "Disk Free Space Less than 15% ${local.environment} ${local.resource_location}"
frequency = "PT5M"
window_size = "PT15M"
}
}
}
Investigation Steps
When this alert fires:
- Identify the disk: Check which logical disk (C:, D:, etc.) triggered the alert
- Review disk usage trends: Look at historical data to understand growth patterns
- Find large files/folders:
- Use
TreeSizeor PowerShell to identify space consumers - Check application log directories
- Review temp folders and user profiles
- Use
- Check application logs: Look for disk-related errors or warnings
Alert Processing Rule
This Alert processing Rule is a stopgap for backup vault alerts. Currently, Microsoft doesn’t have updated alerts for them. This alert rule redirects Microsoft classic alerts to Eventhub and email using an action group until a modern implementation is established, please continue using this processing rule.
module "alert_processing_rule" {
source = "terraform.uhg.com/uhg-customer-modules/private-registry-alert-processing-rule/epic" # which terraform workspace URL you are using
version = "1.1.1" # which terraform workspace version you are using highly recommend keeping this for best practices
alert_processing_rule_name = "Ohemr-Alert-Processing-Rule-Backup" # name of alert processing rule
short_name = "Ohemr APR" # nickname of alert processing rule
explanation = "is experiencing issues affecting backup and restore operations" # explanation of how the alert processing rule works
alert_processing_rule_operator = "Equals" # comparison operator
alert_processing_rule_value = ["Azure Backup"] # the name of value that you are trying to change
resource_group_name = azurerm_resource_group.resource_group.name # which resource group should the alert be deployed to
metric_name = "Backup Vault" # name of metric name
email_recipients = {
prod_action_group - { (Map of email address)
name = "prod_action_group" # name of what it's going to be
email_address = "${local.recipients}" # this use of local.recipients
use_common_alert_schema = true # if you want to use common alert schema true or false
}
}
event_hub = {
event_hub_cloudtest - { (Map of eventhub)
name = "lp-cl-westus3-eventhub-cc751735" # name of eventhub
event_hub_namespace = "lp-cl-westus3-eventhub-cc751735" # name of eventhub namespace
event_hub_name = "diagnostic-logs" # name of eventhub
use_common_alert_schema = false # if you want to use common alert schema true or false
}
}
}
Service Health and Administrative Alerts
When entering values for service health I highly recommend you check the values on the Azure UI. Service Health has unique values and won’t use traditional API names best example is service_health_location variables. It uses its own location values and not the traditional API name of them.
If you are using an administrative alert or similar you can just change the category values, however you must have some sort of filter such as resource types or operator name.
module "activity_log_alert_rule_service_health/activity_log_alert_rule_administrative" {
source = "terraform.uhg.com/uhg-customer-modules/registry-activity-log-alert/private" # which terraform workspace URL you are using
version = "1.4.0" # which terraform workspace version you are using highly recommend keeping this for best practices
activity_log_alert_name = "Activity Log Service Health" # activity log alert name
resource_group_name = azurerm_resource_group.ohemr-rg.name # which resource group should the alert be deployed to
resource_group_location = "global" # which location should the alert be deployed to but it’s only global for service health
description = "Service Health of express route, load balancer, and virtual machines" # description of log alert rule service function
category = "ServiceHealth" # which log alert rule category you want to implement example service health, administrative, maintenance etc
metric_name = "Activity Log Service Health" # the name of the metric you are using
short_name = "ohemr LGA" # nickname of the alert
action_group_details = "Priority Resources" # this is an addition description of your action group this will show up in the azure UI
optional value: operator name = "Microsoft.Sql/servers/firewallRules/write" # is the identity that performed an action on a resource this is for administrative alerts.
optional value: service_health = {
service_health_priority_services = { (map of service health)
service_health_locations = ["West US 3"] # locations on where the service health will be monitoring
services = ["Load Balancer"] # azure resources that service health will monitor
events = ["Incident", "Security", "Maintenance", "ActionRequired"] # type of notifications it will monitor
}
}
email_recipients = {
prod_action_group - { (Map of email address)
name = "prod_action_group" # name of what it's going to be
email_address = "${local.recipients}" # this use of local.recipients
use_common_alert_schema = true # if you want to use common alert schema true or false
}
}
event_hub = {
event_hub_cloudtest - { (Map of eventhub)
name = "lp-cl-westus3-eventhub-cc751735" # name of eventhub
event_hub_namespace = "lp-cl-westus3-eventhub-cc751735" # name of eventhub namespace
event_hub_name = "diagnostic-logs" # name of eventhub
use_common_alert_schema = false # if you want to use common alert schema true or false
}
}
}
Log search alert
This is the log search alert for Patching failures. The alert counts the amount of patching failures that occured and send it to the patching team through email and Splunk.
module "log-search-fail-patch-jobs" {
source = "terraform.uhg.com/uhg-customer-modules/private-registry-log-search-alerts/epic" # which terraform workspace URL you are using
version = "1.1.3" # which terraform workspace version you are using highly recommend keeping this for best practices
resource_group_name = azurerm_resource_group.ohemr-rg.name # which resource group you want to use
resource_group_location = local.resource_location # which location should the alert be deployed
metric_name = "log search" # name of metric name
short_name = "ohemr lgs" # nickname of the alert
action_group_details = "failed patch jobs" # this is an addition description of your action group this will show up in the azure UI
identity_type = "SystemAssigned" # this is for the user identity however, if you want to use your own, you have to modify the private registry in order to do that.
log_search_alerts = {
log_search_alert_1 = {
metric_details = "patch failures" # it adds description of the alert for the azure UI
evaluation_frequency = "PT10M" # How often the alert rule is evaluated
window_duration = "PT10M" # Defines the time range over which the metric is evaluated
severity = 0 # severity of the alert
auto_mitigation_enabled = false # to a setting that determines whether an alert should automatically resolve itself when the alert condition is no longer met
workspace_alerts_storage_enabled = false # alert data (such as triggered alerts, alert history, or alert metadata) is stored in a Log Analytics workspace
description = "detects failed patching jobs" # it adds description of the alert for the azure UI
display_name = "Ohemr Alert Failed Patch Jobs" # it adds to the display name of the alert
enabled = true # boolean that determines whether the alert rule is active or disabled.
query_time_range_override = null # override the default time range used when executing the Kusto query for the alert
skip_query_validation = true # whether Terraform should validate the Kusto query during deployment.
time_aggregation_method = "Count" # how metric values are aggregated over a specified time window before being evaluated against a threshold in an alert rule.
threshold = 0 # The value that the metric must exceed to trigger the alert.
operator = "GreaterThan" # comparison logic such as GreaterThan, LessThan etc
metric_measure_column = null # which column from your Kusto query contains the numerical values that should be evaluated for alerting.
}
}
dimensions = {
dimension1 = {
name = "vmResourceId" # name of the dimension
operator = "Include" # comparison operator wherter include and exclude
values = ["*"] # values you want to include recommend using * so it can break down the values one by one
}
}
log_search_query = <<-QUERY # this where you put your KQL query between <<-QUERY and QUERY
QUERY
email_recipients = {
prod_action_group - { (Map of email address)
name = "prod_action_group" # name of what it's going to be
email_address = "${local.recipients}" # this use of local.recipients
use_common_alert_schema = true # if you want to use common alert schema true or false
}
}
event_hub = {
event_hub_cloudtest - { (Map of eventhub)
name = "lp-cl-westus3-eventhub-cc751735" # name of eventhub
event_hub_namespace = "lp-cl-westus3-eventhub-cc751735" # name of eventhub namespace
event_hub_name = "diagnostic-logs" # name of eventhub
use_common_alert_schema = false # if you want to use common alert schema true or false
}
}
}
Azure NetApp Files (ANF) Metric Alerts
Azure NetApp Files monitoring uses a sophisticated template-based approach to generate multiple metric alerts for volumes and capacity pools. Unlike traditional metric alerts that target individual resources, ANF alerts are dynamically generated based on configuration variables and templates.
ANF Alert Architecture
The ANF monitoring system consists of three main components:
- Variables (
variable.tf): Define metric configurations, thresholds, timing parameters, and specify which volumes and pools to monitor - Local Templates (
locals.tf): Create metric templates and calculate dynamic thresholds - Resource Configuration: Volumes and pools are defined as variables in
variable.tfwith default values
Key Components
Volume Metric Templates
Each volume can monitor up to 19 different metrics:
- Storage Metrics:
volume_consumed_size,percentage_consumed_size,snapshot_size,inode - Performance Metrics:
read_iops,write_iops,total_iops,other_iops - Latency Metrics:
read_latency,write_latency - Throughput Metrics:
read_throughput,write_throughput,total_throughput,other_throughput - Replication Metrics:
replication_lag,replication_status,replication_transferring - Backup Metrics:
backup_enabled,backup_operation_complete
Capacity Pool Metrics
Each capacity pool monitors:
- Pool Storage:
pool_consumed_size- monitors pool utilization against allocated size
Dynamic Threshold Calculation
ANF alerts use service-level performance calculations to set appropriate thresholds:
# Service level performance maps
netapp_service_level_throughput_per_tib = { Standard = 16, Premium = 64, Ultra = 128 }
netapp_service_level_iops_per_tib = { Standard = 1024, Premium = 4096, Ultra = 8192 }
# Per-volume calculations based on allocated size and service level
netapp_volume_calculations = {
for k, v in local.netapp_volume_inputs : k => {
allocated_tib = v.allocated_bytes > 0 ? v.allocated_bytes / local.bytes_per_tib : 0
max_throughput_mibps = lookup(local.netapp_service_level_throughput_per_tib, v.service_level, 0) * (v.allocated_bytes > 0 ? v.allocated_bytes / local.bytes_per_tib : 0)
max_iops = lookup(local.netapp_service_level_iops_per_tib, v.service_level, 0) * (v.allocated_bytes > 0 ? v.allocated_bytes / local.bytes_per_tib : 0)
service_level = v.service_level
}
}
ANF Alert Configuration Example
# Example of how ANF alerts are generated from templates
module "netapp_metric_alerts" {
source = "terraform.uhg.com/uhg-customer-modules/private-registry-metric-alerts/epic"
version = "1.4.0"
for_each = local.all_netapp_metrics
short_name = each.value.short_name
scope = [each.value.target_scope]
explanation = each.value.explanation
target_resource_location = local.resource_location
target_resource_type = each.value.target_resource_type
target_name = each.value.target_name
metric_name = each.value.metric_name
resource_group_name = azurerm_resource_group.resource_group.name
email_recipients = {
prod_action_group = {
name = "prod_action_group"
email_address = local.recipients
use_common_alert_schema = true
}
}
event_hub = {
event_hub_cloudtest = {
name = "lp-cl-westus3-eventhub-cc751735"
event_hub_namespace = "lp-cl-westus3-eventhub-cc751735"
event_hub_name = "diagnostic-logs"
use_common_alert_schema = false
}
}
alerts = each.value.alerts
}
Adding New ANF Volumes/Pools to Monitoring
To add new ANF volumes or capacity pools to an existing monitoring workspace, follow these steps:
Step 1: Update variable.tf
Add your new volumes and pools to the variable definitions in variable.tf:
# NetApp Capacity Pools configuration
variable "netapp_capacity_pools" {
type = map(object({
allocated_size_bytes = number
service_level = string
azure_name = string
account_name = string
resource_group_name = string
}))
default = {
existing_pool = {
allocated_size_bytes = 1099511627776 # 1 TiB
service_level = "Standard"
azure_name = "Standard"
account_name = "ohemr-anf-epic-shared-cus-001"
resource_group_name = "ohemr-rg-west-epic-netapp-shared-cus-001"
}
new_pool = {
allocated_size_bytes = 2199023255552 # 2 TiB
service_level = "Premium"
azure_name = "Premium"
account_name = "ohemr-anf-epic-shared-cus-001"
resource_group_name = "ohemr-rg-west-epic-netapp-shared-cus-001"
}
}
description = "Map of NetApp capacity pools and their properties"
}
# NetApp Volumes configuration
variable "netapp_volumes" {
type = map(object({
max_allocated_bytes = number
capacity_pool_name = string
azure_name = string
account_name = string
resource_group_name = string
}))
default = {
existing_volume = {
max_allocated_bytes = 107374182400 # 100 GiB
capacity_pool_name = "existing_pool"
azure_name = "epic-shared-volume-001"
account_name = "ohemr-anf-epic-shared-cus-001"
resource_group_name = "ohemr-rg-west-epic-netapp-shared-cus-001"
}
new_volume = {
max_allocated_bytes = 214748364800 # 200 GiB
capacity_pool_name = "new_pool"
azure_name = "epic-shared-volume-002"
account_name = "ohemr-anf-epic-shared-cus-001"
resource_group_name = "ohemr-rg-west-epic-netapp-shared-cus-001"
}
}
description = "Map of NetApp volumes and their properties"
}
Step 2: Verify Resource Names and Static Configuration
Ensure the Azure resource names match exactly what exists in your Azure subscription:
azure_name: The actual NetApp volume/pool name in Azureaccount_name: The NetApp account containing the resourcesresource_group_name: The resource group containing the NetApp accountcapacity_pool_name: Must reference a key from thenetapp_capacity_poolsmap
Important: The allocated_size_bytes, max_allocated_bytes, and service_level values in the monitoring workspace are
static configurations. If volume sizes or service levels are changed directly in Azure outside of the monitoring workspace,
these changes will not be automatically reflected in the alert thresholds. The monitoring workspace must be manually updated
to reflect any changes made to the actual Azure NetApp Files resources.
Step 3: Validate Configuration
Run Terraform validation to ensure your configuration is correct:
terraform init
terraform validate
terraform plan
Step 4: Deploy Changes via TFE
Deploy the changes through the standard Terraform Enterprise (TFE) workflow:
- Create Pull Request: Submit your changes via a pull request
- Code Review: Wait for pull request review and approval
- TFE Deployment: Once approved and merged, TFE will automatically apply the changes to create new metric alerts
Note: Direct terraform apply commands are not used. All deployments go through TFE after proper review and approval process.
Step 5: Verify Alert Creation
After deployment, verify that alerts were created for your new resources:
- Check Azure Portal: Navigate to Monitor > Alerts to see the new metric alerts
- Expected Alerts Per Volume: Each volume will generate up to 19 metric alerts (one for each enabled metric)
- Expected Alerts Per Pool: Each capacity pool will generate 1 metric alert for pool consumed size
Important Considerations
Resource Sizing Requirements
- Minimum Volume Size: Volumes with
max_allocated_bytes = 0will be excluded from monitoring - Service Level Impact: Different service levels (Standard/Premium/Ultra) have different IOPS and throughput limits that affect alert thresholds
- Static Configuration: Alert thresholds are calculated based on the static values defined in
variable.tf. Changes to actual Azure NetApp Files resources (size increases, service level changes) will not automatically update alert thresholds until the monitoring workspace variables are manually updated
Alert Naming Convention
ANF alerts follow this naming pattern:
- Format:
{volume_key}_{metric_key}or{pool_key}_{metric_key} - Example:
epic_shared_volume_001_read_iops_critical
Metric Names (Azure API)
ANF uses specific Azure metric names:
VolumeLogicalSize(volume consumed size)VolumeSnapshotSize(snapshot size)ReadIops,WriteIops,TotalIops,OtherIopsAverageReadLatency,AverageWriteLatency(legacy metric names for Azure compatibility)ReadThroughput,WriteThroughput,TotalThroughput,OtherThroughput
Troubleshooting Common Issues
- Volume Not Monitored: Check that
max_allocated_bytes > 0 - Missing Alerts: Verify resource names match exactly in Azure
- Threshold Errors: Ensure service levels are spelled correctly (
Standard,Premium,Ultra) - Validation Errors: Check that
capacity_pool_namereferences an existing pool key
Advanced Configuration
Custom Thresholds
You can customize alert thresholds by modifying the variables in variable.tf:
variable "netapp_read_iops_warning_percent" {
type = number
default = 80 # 80% of maximum IOPS for warning
}
variable "netapp_read_iops_critical_percent" {
type = number
default = 95 # 95% of maximum IOPS for critical
}
Metric Selection
To disable specific metrics for all volumes, remove them from the enabled_metrics list in locals.tf:
enabled_metrics = [
"volume_consumed_size", "inode", "percentage_consumed_size", "snapshot_size",
"read_iops", "write_iops", "other_iops", "total_iops",
"read_latency", "write_latency",
# "replication_lag", "replication_status", "replication_transferring", # Disabled
"backup_enabled", "backup_operation_complete",
"read_throughput", "write_throughput", "total_throughput", "other_throughput"
]