GuidesUpdated July 3, 2026

Solving Linux Boot Problems in Azure

guidetroubleshootinglinuxazurevm-recoveryboot-problemsdisk-repairgrubdisaster-recoveryrhel

Solving Linux boot problems in Azure

Purpose

This document outlines the process for solving boot problems when a VM is stuck in grub or some other state. Since we do not allow console login and we have no root password, there is no way to boot to Single-User-Mode and solve a boot issue. In some cases we have been able to adjust the settings on the disk and repair the system. In others, just destroy the VM and re-create it.

Last update: 7/12/2025 - Randy Olinger

Solving Linux boot problems in Azure

Overview

The basic process is to create a new copy of the boot disk (so as not to risk destroying the original) and mounting that disk on a different running server so repairs can be done. The repairs may be as simple as commenting out lines in the /etc/fstab or as complex as fixing grub.

In some cases it is just easier to destroy the server and have it recreated from Terraform. However, if the server is a highly customized snowflake, the loss function to that economic equation is more complex. In short, the admin will have to make the call if fixing the VM is worth it.

The admin following this guide is expected to have basic Linux skills for things like managing disks with LVM, editing files with VIM (or something else), and watching boot logs in the Azure portal. In addition to Linux skills, certain Azure skills such as creating and attaching disks, building and destroying VM's, and setting tags are necessary.

Snapshots and Clones and Backups

OK, if you are not scared away, the first step is to make a copy of the broken boot disk. For the examples we will use a server named serfoo and a resource group name rgbar. All of the steps in Azure portal can be done via command line, so feel free to do so. I will use either in this guide, leaning more toward command line.

If the problem you are trying to solve happened in the last 7 days or so, you can probably recover from a backup of the _OS disk. If you are unsure when the problem happened, you can clone the existing disk to a new managed disk and make necessary changes to that.

To perform these tasks, you must activate your JIT with Contributor role for the subscription.

A note about cleaning up. During this process you will be creating many different Azure resources (servers, disks, snapshots, etc.) It is imperative that you clean up after and remove what was not there before you started. This is work outside of IAAC, so be mindful that any garbage created will not be cleaned up by automation. We have logs so we will see who is not being tidy...

Restore from Backup

Navigate to the server in Azure, on the left navigation menus find "Backup + disaster recovery" menu. Choose "Restore point"

Backup Menu

You should see a list of restore points, pick one and click on it to see a list of disks that have been backed up. To the right of the OS disk there is a link named "Create disk from a restore point". Click on that for the OS Disk Only.

Restore Points

From here you fill in the following fields

Resource Group (make sure to change this to where you want it, typically the same as the original disk. Do not take the default.)
Disk Name. Make something up and put the word "restore" in it, like zwclodb302-restore-20250702
Advance to the Networking tab and be sure to choose "Disable public and private access"
Advance to end and create the disk. This will take about 60 seconds to restore.
Congratulations! You now have a copy of the boot disk to mount on an alternate server and make changes without risking the original.

Cloning a Disk

We are doing this from command line... Much faster...

First, set some variables to make your typing easier. In particular, the original server name, resource group, and subscription. Then log into the Azure portal using your favorite command line method. Make sure to change the examples below to match your situation.

server="<servername>"   #Example zwplodbew301
location="westus3"      #Make sure to set correctly
rg="<resourceGroup>"     #Example ohemr-rg-west_epic_odb-pro-wus3-001
sub="<subscription>"    #Example OHEMR-SUB-EPIC-PRO-001
today=`date +%Y%m%d`
az login --use-device-code
aideId=UHGMR0000
environment=dev
serviceTier=p3
  # This command will find the NIC and then the subnet...
NICID=$(az vm show -n $server -g $rg --subscription $sub \
     | jq .networkProfile.networkInterfaces[0].id | tr -d \")
subnet=$(az network nic show --ids $NICID  \
     | jq .ipConfigurations[0].subnet.id | tr -d \")

## Get the SKU of the OS disk
SKU=$(az disk show --name ${server}_os -g $rg --subscription $sub  \
     | jq .sku.name | tr -d \")

### May need to adjust image name over time
image="/subscriptions/0ce4d167-b6f8-429d-8c0d-7637d25fd659/resourceGroups/\
prod_golden_image_azu_gallery/providers/Microsoft.Compute/galleries/\
prod_golden_image_azu_gallery/images/RHEL_9_G2/versions/latest"

Enter your favorite SSH public key. Don't forget the quotes... This will be used later when logging into the surrogate server.

pub="$(cat ~/.ssh/id_rsa.pub)"

Shut down the server so we can get a consistent snapshot.

az vm deallocate --subscription $sub -g $rg -n $server

Run the command to create a snap of the OS disk. (you will need to get the name of the disk )

original=${server}_os
az snapshot create -g ${rg} -n ${original}_restoreSnap_${today}  \
--source ${original} --sku $SKU \
--subscription ${sub}

When this finishes, we have a safe snapshot. Lets create a disk from the snapshot

az disk create -g ${rg} --subscription ${sub} \
-n ${original}_restoreDisk_${today} \
--source ${original}_restoreSnap_${today} \
--sku $SKU \
--zone 1

Congratulations! You now have a copy of the boot disk to mount on an alternate server and make changes without risking the original.

Mount to Donor Machine

For this guide, we are creating a vm to use for manipulating the copy of the boot disk, that will eventually become the boot disk of the corrupted machine. Need to create a storage account in order to see boot diagnostics.

Create a storage account

OPTUM_PublicIP="168.183.0.0/16 149.111.0.0/16 128.35.0.0/16 161.249.0.0/16 \
198.203.174.0/23 198.203.176.0/22 198.203.180.0/23"

az storage account create --subscription ${sub} \
--location $location  \
--name  ${server}${today}rest   --resource-group ${rg} \
--min-tls-version TLS1_2 --allow-shared-key-access "true" \
--public-network-access "Enabled" \
--access-tier Hot \
--allow-blob-public-access "false" \
--https-only "true" \
--bypass AzureServices \
--default-action Deny  \
--tags aide-id=UHGMR0000 environment=dev service-tier=p3 \
--subscription ${sub}

for i in ${OPTUM_PublicIP}
do
  az storage account network-rule add --ip-address ${i} \
--account-name ${server}${today}rest \
--subscription ${sub}
  echo ${i} added
done

This command will create the recovery Virtual Machine which is used to make modifications to the boot disk of the broken server. This also attaches the cloned disk as a data disk.

az vm create -n ${server}_myRestoreMachine -g ${rg} \
--subscription ${sub} --image ${image} \
--admin-username locadm --ssh-key-values "${pub}" \
--license-type RHEL_EUS  \
--size Standard_D4s_v5 --authentication-type ssh \
--nic-delete-option Delete \
--public-ip-address '' \
--subnet ${subnet}  \
--nsg-rule NONE \
--tags aide-id=$aideIid service-tier=$serviceTier environment=$environment \
--boot-diagnostics-storage ${server}${today}rest

az vm disk attach --vm-name ${server}_myRestoreMachine \
--resource-group ${rg} --subscription ${sub} \
--name ${original}_restoreDisk_${today}

Fixing the boot disk (or other areas)

You should now have a running VM with the cloned disk attached. The disk is not yet usable as it has namespace conflicts with the root disks. Here are the next steps to access the files on the cloned disk.

Use your CloudSAW workstation to log in with your key. The ip address of the workstation is output from the command above. The username is locadm.

Now you need to figure out which disk to mess with. The below command will show the block device tree. In most cases the cloned disk will be /dev/sdc. The partition that is rootvg is usually /dev/sdc4 (but you must verify)

sudo lsblk

Next, we need fix the PV so that its UUID does not conflict with image PV.

sudo vgimportclone  /dev/sdb4

This will change the UUID on the PV and also rename the rootvg to rootvg1. (This will need to be undone later.)

Now that we have a functional PV and VG, let's activate the LV's from the cloned disk.

sudo vgchange -ay rootvg1

ls -l /dev/rootvg1/*

You should see some logical volumes, look in /dev/mapper for homelv rootlv tmplv usrlv varlv

You now want to mount the other rootlv on the recovery system, use the following commands

sudo mkdir /mnt/{otherroot,otherboot}
sudo mount -o nouuid /dev/mapper/rootvg1-rootlv /mnt/otherroot
sudo mount -o nouuid /dev/mapper/rootvg1-varlv /mnt/otherroot/var
sudo mount -o nouuid /dev/mapper/rootvg1-usrlv /mnt/otherroot/usr
sudo mount -o nouuid /dev/sdb2 /mnt/otherboot
sudo mount /dev/sdb1 /mnt/otherboot/efi

cd /mnt/otherroot

Now you can see these files and make changes. Some things you may have to do...

clean up /mnt/othervar/log/audit
make a change to /mnt/otherroot/etc/audit/auditd.conf, set max_log_file_action = rotate
mount /dev/sdc1 to /mnt/otherefi to fix the grub problem by copying the necessary file from the restoration system /boot/efi/EFI/redhat (This is under development and only done once)

Detach and migrate the disk back

First unmount all the drives that you mounted

cd ${HOME}

sudo umount /mnt/otherroot/var
sudo umount /mnt/otherroot/usr
sudo umount /mnt/otherroot

sudo umount /mnt/otherboot
sudo umount /mnt/otherboot/efi


### logout
exit

Switch Back to the Azure Commands

Now detach the disk from the recovery VM

az vm disk detach \
-g ${rg}  --subscription ${sub} \
--vm-name ${server}_myRestoreMachine \
--name ${original}_restoreDisk_${today} \
--subscription ${sub}

Now swap the restoreDisk to be the boot disk of the broken VM

  # get the disk ID

az vm deallocate   --resource-group ${rg} \
--name ${server} \
--subscription ${sub}

  # Swap the disk to be the boot disk
az vm update \
--resource-group ${rg} \
--name ${server} \
--os-disk ${original}_restoreDisk_${today} \
--subscription ${sub}

Boot the machine and look at the serial log

az vm start -g ${rg} --subscription ${sub} -n ${server}

The boot wont work, but that is because we renamed the root volume group. But it should give you a root prompt in the Serial Console after a great while. It will be at the dracut prompt. This is where we rename the VG back to its original name so the server can reboot.

## dracut>  lvm vgrename rootvg1 rootvg
### dracut>  reboot

If the boot works, shut down and clone the disk to the standard name <servername_os>. This is a lot of steps but they are repeats of above operations for cloning the original disk. One final thing to think about, make sure the SKU for the new disk matches the original disk, which is usually PremiumV2_LRS, but not always. If this is not a match, the VM may be destroyed by Terraform the next time the workspace is run.

  # Stop both broken VM and the restore VM in case it is running
az vm deallocate  -g ${rg} --subscription ${sub} -n ${server}

az vm deallocate -g ${rg} --subscription ${sub} -n ${server}_myRestoreMachine

## delete the original OS disk
az disk delete  --name ${original} --yes \
--resource-group ${rg} --subscription ${sub}

  # Create a snap of the working boot disk

az snapshot create -g ${rg} -n ${original}_finalSnap_${today}  \
--source ${original}_restoreDisk_${today} \
--subscription ${sub} \
--sku $SKU

  # Create a disk from the working snap

az disk create -g ${rg} --subscription ${sub} \
-n ${original} \
--source ${original}_finalSnap_${today}   \
--sku $SKU \
--subscription ${sub} \
--zone 1

  # Swap this disk to the original broken VM.

az vm update \
--resource-group ${rg} \
--name ${server} \
--os-disk ${original} \
--subscription ${sub}

az vm start   --resource-group ${rg} \
--name ${server} \
--subscription ${sub}

Clean up and delete all the resources

This is just a long string of deletes. Probably best to do them one at a time until you are sure they all work correctly.

  # Snapshots
for snap in ${original}_restoreSnap_${today} \
            ${original}_finalSnap_${today}
  do
    az snapshot delete -g ${rg} -n ${snap}  \
--subscription ${sub} --no-wait
  done

  # Disks
for disk in ${original}_restoreDisk_$today
  do
    az disk delete -g ${rg} -n ${disk}  \
--subscription ${sub} --yes --no-wait
  done

  # Virtual Machine
az vm delete -y \
-n ${server}_myRestoreMachine -g ${rg} \
--subscription ${sub} --no-wait

  # Storage Account
az storage account delete --yes \
--subscription ${sub} \
--name  ${server}${today}rest   \
--resource-group ${rg}

  # NSG for VM
az network nsg delete \
--name ${server}_myRestoreMachineNSG \
--resource-group ${rg} \
--subscription ${sub} --no-wait

Double Check

Make sure nothing was missed in cleanup. These commands can run without creating errors and still fail. There may be a boot disk from the recovery machine left around...

Boot Surgery

This section is for a more esoteric problem where we need to find a way to fix the boot volume in other ways. This is a quick way to fix the "Snapshot Volume won't boot" problem, normally identified by error messages in the console and a non-functioning network.

In essence, it is the following steps:

Reboot the machine into recovery mode
Mount the core file systems
chroot into the boot directory
edit the fstab with vi, save
exit chroot and reboot

Reboot into recovery mode (Break Shell)

This is a tricky thing, and will only work if the server already has recovery enabled or at least the grub authentication disabled. (To do this truncate the file /boot/grub2/user.cfg and run grub2-mkconfig. Of course, this has to be in place before the crisis...)

If the server is powered on but you cannot log in, the best way to get to the grub menu is to go to the serial console and use the menu to do a "Hard Reset" which is equivalent to a power off with no shutdown. Then press either the up/down arrows or the ESC key if you see the server counting down from 10. This can be tricky but if you catch it, you will see a screen like this...

grub menu

Arrow down to select "pre-mount break shell" and press enter. The system should do a very quick boot and stop at a shell prompt. Copy and paste the following commands to the prompt:

mount /dev/mapper/rootvg-rootlv /sysroot
for i in var home usr
  do mount /dev/mapper/rootvg-${i}lv /sysroot/$i
done

for m in dev sys run proc
  do mount -o bind /$m /sysroot/$m
done
chroot /sysroot

This will give you a shell prompt in a chrooted environment, you can now edit and save the fstab:

vi /etc/fstab

Save the file, exit the chroot with 'exit' and reboot. This clears up the broken fstab from snapshot script.

Reboot into recovery mode (Without Break Shell)

This procedure requires that the grub authentication has been disabled on your server previously.

If your grub menu does not have a line for "Break Shell" then you can still break by arrowing to your kernel line and pressing 'e' to enter EDIT mode. Then arrow to the "linux" line, press ctl-e to go to end of line and add the option rd.break, then press ctl-x to have system execute the boot. It will stop as above at a root prompt.

grub edit