Ability to switch to a new hard drive without a RAID controller and reboot

If you have a Linux server without a hardware RAID controller that you’d never want to shut down, and one of the disks starts to fail. If a minimum of these conditions is met:

  • All data from a dying disk can be read, although slowly but read;
  • The equipment supports “hot” connection of “SATA” or “SAS” drives or what we have there …
  • There is a free slot for connecting a disk.;
  • Disk partitioning uses LVM

In this, you can replace the disk “on the fly” without rebooting the server.

How you can do this?

This is exactly the story that happened to one of the servers that I served. Now will be described the whole process of “heart surgery” of the trust server. At one point, one of the installed drives began to crumble as they say

HDD1 - SATA 500GB - SMART status BAD - needs to be replaced 
HDD2 - SATA 1000GB - SMART status GOOD

Naturally, we needed a serviceable replacement disk with a volume not less than the one that needs to be replaced – we could take more.

The current partition geometry was viewed

root@localhost~# lsblk
[root@vsnlinx02 ~]
# lsblk
   NAME                                    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
   sda                                      8:0    0  500G  0 disk 
   ├─sda1                                   8:1    0    2M  0 part 
   ├─sda2                                   8:2    0    1G  0 part /boot
   ├─sda3                                   8:3    0  498G  0 part 
   │ ├─centos-root                        253:0    0  120G  0 lvm  /
   │ └─centos-swap                        253:1    0   32G  0 lvm  [SWAP]

We see that the SIZE 500G has a device called ‘sda’, and it is he who needs to be replaced. Further it was looked that at us with physical volumes:

root@localhost ~]# vgdisplay 

  --- Volume group ---

  VG Name               centos

  System ID             

  Format                lvm2

  Metadata Areas        1

  Metadata Sequence No  3

  VG Access             read/write

  VG Status             resizable

  MAX LV                0

  Cur LV                2

  Open LV               2

  Max PV                0

  Cur PV                1

  Act PV                1

  VG Size               498,00 GiB

  PE Size               4,00 MiB

  Total PE              127488

  Alloc PE / Size       127488 / 498,00 GiB

  Free  PE / Size       0 / 0   

---------------//------------------------------

It can be seen that the volume group is called ‘centos’ and other statistics. A replacement drive was connected. He got the name ‘sdc’ (it could be ‘sdd’ – depending on which last letter is free). There are 3 partitions on a “dying” disk. (sda1, sda2 and sda3).

To replace a fully marked drive in LVM, it is enough to do 2 things:

1. Command * vgextend * /dev/sdc – Add a working disk to the volume group (volume group) – the operation will take no more than 3 seconds.  

2. The command * vgreduce * /dev/sda – Remove the failed disk from the volume group. – The operation will take much more time.

In fact, the VLM subsystem will transfer all sectors to a new disk, while nothing will need to be unmounted and disconnected. Monitoring the I / O subsystem will show almost 100% load on both disks for the entire copy period. In this case, there were 2 more non-LVM partitions (sda1 and sda2), which:

3. you had to create and copy manually also needed  

4. reinstall the bootloader to the new disk. After all, sooner or later it will be necessary to reboot at least after updating the kernel.

Next, we looked at the current partition table using the sfdisk command:

root@localhost~# sfdisk -d /dev/sda
# partition table of /dev/sda
unit: sectors
/dev/sda1 : start=     2048, size=      4096, Id=83

/dev/sda2 : start=     6144, size=   2097152, Id=83, bootable

/dev/sda3 : start=  2103296, size=1044381696, Id=8e

3 partitions were created on a new disk, 2Mb, 1GB and all free space, and using fdisk (he calculates the free space of the last partition, unlike parted). The /boot/partition was unmounted

root@localhost~# umount /boot/

Next, the contents of sections 1 and 2 were copied using the dd utility:

root@localhost~# dd if=/dev/sda1 of=/dev/sdc1 status=progress
root@localhost~# dd if=/dev/sda2 of=/dev/sdc2 status=progress

The /boot partition was mounted from a new disk

root@localhost~# mount /dev/sdc2 /boot/

Then reinstalled the bootloader to a new disk.

root@localhost~# grub-install /dev/sdc

Then I partitioned 3 partitions as a physical LVM volume.

root@localhost~# pvcreate /dev/sdc3

Was added this newest section to the volume group

root@localhost~# vgextend centos /dev/sdc3

Don’t forget to look at what we have with a group of volumes

root@localhost~# vgdisplay 
--- Volume group ---
VG Name centos
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 3
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 2
Open LV 2
Max PV 0
Cur PV 1
Act PV 1
VG Size 978,00 GiB
PE Size 4,00 MiB
Total PE 250368
Alloc PE / Size 127488 / 498,00 GiB
Free PE / Size 127488 / 498,00 GiB

Finally removed our “dying” drive from the group

root@localhost~# vgreduce centos /dev/sda3

This operation lasts the longest. As a result, the check is issued again:

root@localhost ~]# vgdisplay 

  --- Volume group ---

  VG Name               centos

  ----------//----------------------

  VG Size               498,00 GiB

  PE Size               4,00 MiB

  Total PE              127488

  Alloc PE / Size       127488 / 498,00 GiB

  Free  PE / Size       0 / 0   

The old disk is removed from the volume group and can be disconnected.