ANF โ DR Testing, Cutover, and Failback Guide
ANF โ DR Testing, Cutover, and Failback Guide
Operational runbook for testing, cutting over to DR, and failing back ANF volumes.
๐ฏ Overview
This runbook provides step-by-step procedures for Disaster Recovery (DR) testing, manual cutover, and failback for Azure NetApp Files (ANF) volumes supporting Epic environments. It ensures that DR events are executed with minimal risk and downtime, maintaining data integrity and compliance.
Strategic Benefits:
- Operational Resilience: Ensures Epic data and services remain available during regional outages or planned DR events.
- Regulatory Compliance: Supports DR testing evidence for audits (e.g., HIPAA, SOX).
- Controlled Failover/Failback: Mitigates split-brain risk and preserves data consistency.
๐ Process Classification
| Phase | Scope | Purpose | Governance Level |
|---|---|---|---|
| DR Test Preparation | All Epic ANF Volumes | Ensure readiness for DR drill | Mandatory |
| DR Cutover | Target (DR) Region | Enable Epic services from DR site | Controlled |
| DR Failback | Source (Primary) Region | Restore Epic services to original region | Controlled |
๐ Step-by-Step Procedures
1. Prepare For DR Testing
ANF Account/Volume: ohemr-anf-west-epic-pro-wus3-001
1.1 Check Replication Status in Azure Portal
- Navigate to Azure NetApp Files โ Volumes.
- Select the volume and review the Replication tab.
- Ensure Relationship is Healthy and Last sync is recent.
1.2 Notify Stakeholders About Downtime Impact
- Identify all applications, teams, and business owners impacted.
- Define the DR test window, scope, and expected user impact.
1.3 Identify DNS Records That Need Updating
- List client-facing hostnames used by the share/mount path.
- Locate associated CNAME (preferred) or A records in DNS.
- Document current values and TTL; lower TTL (e.g.,
60s) for test window.
2. Begin DR Manual Cutover
ANF Account / Volume: ohemr-anf-epic-pro-cus-001
2.1 Break Replication in DR Region (Destination Volume)
- Go to the DR (secondary) volume in the target region.
- Open the Replication tab and select Break Replication (aka Break Peering).
- Wait for the status to show Replication Broken and volume as Online (writable).
๐ก Note: Breaking replication makes the secondary volume writable for DR operations.
2.2 Reconfigure Protocol Access
- CIFS/SMB: Ensure Active Directory (AD) is configured for the DR region.
- Verify Share Permissions and NTFS ACLs (should be preserved by CRR).
- Test access from a client in the DR network.
2.3 DNS Updates to Point to DR ANF
- Redirect clients to DR endpoint; avoid manual mount path changes.
2.4 Validate DR Access
- Test mounting/accessing the DR volume from multiple clients.
- Verify data consistency at last sync point.
- Ensure application services are functioning from DR site.
2.5 Post-DR Operations
- Monitor DR volume capacity and performance.
- Keep primary volume read-only or offline to prevent split-brain writes.
- Plan for reverse replication if primary is restored.
3. Begin Failback Manually
ANF Account / Volume: ohemr-anf-west-epic-pro-wus3-001
3.1 Reverse Resync to Reactivate Source Volume
- Select the source volume in Azure.
- Open Replication and select Reverse Resync.
- Confirm prompt and monitor health status until stable.
3.2 Reestablish Source-to-Destination Replication
- On the destination volume, open Replication.
- Confirm Mirror State is Mirrored and Relationship Status is Idle.
- Select Break Peering and confirm.
- Remount the source volume for client access if necessary.
3.3 Resync the Source Volume with the Destination Volume
- On the destination volume, select Reverse Resync to restore normal replication direction.
๐ฅ Healthcare-Specific Considerations
- PHI Data: Ensure DR/Failback processes do not expose PHI to unauthorized environments.
- Audit Trail: Retain logs of DR events for compliance (HIPAA, SOX).
- Testing Frequency: Schedule DR tests per regulatory and internal policy.
๐ง Implementation Guidelines
Azure Portal Operations
- Use the Replication tab for all ANF volume replication actions.
- Confirm volume status after each step before proceeding.
DNS Management
- Lower DNS TTL before DR events for faster client redirection.
- Document changes and revert TTL to standard value after event.
Access Validation
- Test both Windows (CIFS/SMB) and Linux (NFS) clients as applicable.
- Validate application-level access, not just share mounts.
๐ Monitoring & Reporting
- Replication Health: Use Azure Portal or CLI to monitor replication status.
- Capacity/Performance: Monitor DR volumes for IOPS, latency, and space utilization during DR event.
- Event Logging: Track all steps and changes for audit purposes.
๐ Compliance Validation
- DR Drill Evidence: Archive runbook execution logs and notifications.
- PHI Handling: Ensure no DR operations violate HIPAA or internal data handling policies.
- Change Control: All DNS and storage changes must follow established change management procedures.
๐จ Troubleshooting Guide
Common Issues
| Problem | Diagnosis | Resolution |
|---|---|---|
| Replication fails to break | Volume busy or Azure API issue | Retry; check portal for active connections or errors |
| Access fails in DR | AD not configured or permissions not synced | Validate AD integration, review ACLs |
| DNS redirection delayed | TTL not lowered or cached values | Lower TTL in advance; flush client DNS cache |
| Split-brain risk | Primary is still writable | Set primary to read-only/offline before DR cutover |
๐ Related Documentation
- Epic Architecture Requirements: Storage and DR architecture details
- Operational Procedures: Standard operating procedures for Epic on Azure
- Security Baseline: Security and compliance controls for DR operations
๐ Support & Contacts
| Domain | Contact | Responsibility |
|---|---|---|
| ANF/DR Operations | [email protected] | Storage operations and runbook execution |
| DNS Management | [email protected] | DNS changes and troubleshooting |
| Compliance Audit | [email protected] | DR test evidence and regulatory reporting |
| Epic Application | [email protected] | Application validation in DR/failback |
๐๏ธ DR Excellence: Reliable DR and failback processes minimize downtime, ensure data integrity, and support regulatory compliance for Epic healthcare infrastructure.