• Help & contact
    • Spring Offers
      %

    In this article, we'll show you how to identify a defective hard disk on a Linux Dedicated Server with software RAID and prepare the server for the replacement of the defective disk.

    Please Note

    This article assumes you have basic knowledge of server administration with Linux. If you have any questions regarding the replacement of a defective hard disk or need assistance, please contact IONOS Customer Service.

    In order to ensure the highest possible reliability, it is necessary that you monitor the software RAID of your Dedicated Server. If you discover that a hard disk is defective or you receive a notification email about a defective hard disk, you must contact IONOS Customer Service to arrange for the hard disk to be replaced. This requires that you identify the defective hard disk and prepare the server to replace the defective disk.

    Attention

    RAID systems allow for greater fail-safety and/or speed. However, they are not a substitute for regular backups. To avoid data loss, we recommend that you back up regularly. Also, be sure to back up before performing the steps below to ensure the safety of your data.

    Checking the Status of the Software RAID

    To check the status of the software RAID, enter the following command in the shell:

    [root@host ~]: cat /proc/mdstat

    If both disks are present and mounted correctly, the following message is displayed:

    [root@localhost ~]# cat /proc/mdstat

    Personalities : [raid1]
    read_ahead 1024 sectors
    md2 : active raid1 sda3[1] sdb3[0]
    262016 blocks [2/2] [UU]

    md1 : active raid1 sda2[1] sdb2[0]
    119684160 blocks [2/2] [UU]

    md0 : active raid1 sda1[1] sdb1[0]
    102208 blocks [2/2] [UU]

    unused devices: <none>

    The above example shows three multiple devices or logical drives (md0, md1, md2). For each of these logical drives, it is indicated which partitions they are composed of and on which drives these partitions are located.

    Example: The logical drive md0 is composed of the partitions sda1 and sdb 1.

    In the line listed below the respective logical drive, the state of the individual partitions is shown at the end of the line in the square brackets. A U means that the respective disk is mounted (up) in the RAID.

    In the following example, all logical drives have only one partition mounted, which is located on the sda hard disk. The respective partition located on the second hard disk sdb is not mounted. You can recognize this also by the entry [U_]. The unmounted partitions of the hard disk sdb indicate that there is an error or a defect with this hard disk.

    [root@localhost ~]# cat /proc/mdstat

    Personalities : [raid1]
    read_ahead 1024 sectors
    md0 : active raid1 sda1[1]
    102208 blocks [2/1] [U_]

    md1 : active raid1 sda2[1]
    119684160 blocks [2/1] [U_]

    md2 : active raid1 sda3[1]
    262016 blocks [2/1] [U_]

    unused devices: <none>


    In the following example, a defective disk is still mounted in the RAID:

    [root@localhost ~]# cat /proc/mdstat

    Personalities : [raid1]
    md3 : active raid1 sda3[0] sdb3[2](F)
    439553856 blocks super 1.0 [2/1] [U_]
    bitmap: 1/4 pages [4KB], 65536KB chunk

    md1 : active raid1 sdb1[2](F) sda1[0]
    19529600 blocks super 1.0 [2/1] [U_]

    unused devices:

    <none>

    The entry (F) in this example shows that the partition is marked as faulty.

    Error Diagnosis and Finding the Necessary Data for Hard Disk Replacement

    To detect hard disk errors, we recommend that you do the following:

    Install the Smartctl program, which is a command-line program to monitor disks using SMART (Self-Monitoring, Analysis and Reporting Technology). With this program you can check if a disk is defective. It is a part of Smartmontools. The Smartmontools are available as packages for many Linux distributions.

    Please Note

    In some cases, a hard disk defect may not be detected by means of the smart values. Therefore, we recommend that you also analyze the /var/log/messages log file.

    Install Smartctl

    To install Smartctl, enter the following command:

    CentOS

    yum install smartmontools

    Ubuntu

    sudo apt-get install smartmontools

    Get information about the hard disk

    To access a list of disks, enter the following command:

    smartctl --scanExample:

    [root@8E8885C ~]# smartctl --scan

    /dev/sda -d scsi # /dev/sda, SCSI device
    /dev/sdb -d scsi # /dev/sdb, SCSI device

    To access detailed information for error diagnostics, enter the following command:

    smartctl -iHAl error [FIXED NAMES]

    Please Note

    Device interfaces must be specified in the following format:

    SCSI / SATA devices:

    smartctl - iHAl error /dev/sd[a-z]

    Example:

    [root@localhost ~] # smartctl -iHAl error /dev/sda

    After entering the command, the following information is displayed, for example:

    [root@8E8885C ~]# smartctl -iHAl error /dev/sda
    smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-862.14.4.el7.x86_64] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Device Model:     HGST HUS722T1TALA604
    Serial Number:    WMC6N0K2RW66
    LU WWN Device Id: 5 0014ee 004722db0
    Firmware Version: RAGNWA07
    User Capacity:    1,000,204,886,016 bytes [1.00 TB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   ACS-3 T13/2161-D revision 5
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Fri May  3 07:45:14 2019 UTC
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED     WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always      0
      3 Spin_Up_Time            0x0027   183   183   021    Pre-fail  Always      3833
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always      9
      5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always      0
      7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always      0
      9 Power_On_Hours          0x0032   097   097   000    Old_age   Always      2560
     10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always      0
     11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always      0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always      9
     16 Unknown_Attribute       0x0022   000   200   000    Old_age   Always      26802171994
    183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always      0
    192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always      4
    193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always      67
    194 Temperature_Celsius     0x0022   116   111   000    Old_age   Always      31
    196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always      0
    197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always      0
    198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline     0
    199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always      0
    200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline     0
    
    SMART Error Log Version: 1
    No Errors Logged
    bash

    Interpretation of Parameters and Fault Diagnosis

    Analyze the detailed information that you called by means of the command smartctl -iHAl error [NAMED DISK]. The first section lists information that you can use to identify the hard disk:

    === START OF INFORMATION SECTION ===
    Device Model:     HGST HUS722T1TALA604
    Serial Number:    WMC6N0K2RW66
    LU WWN Device Id: 5 0014ee 004722db0
    Firmware Version: RAGNWA07
    User Capacity:    1,000,204,886,016 bytes [1.00 TB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   ACS-3 T13/2161-D revision 5
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Fri May  3 07:45:14 2019 UTC
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    bash

    This section displays, among other things, the device model and serial number of the checked hard disk.

    In the second section, the current state of the hard disk is assessed by Smartctl. If the value "PASSED" is not displayed but, for example, the value "Failed" or "UNKNOWN", you should arrange for the hard disk in question to be replaced as soon as possible.

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    bash

    In the third section, the determined SMART VALUES are listed in detail. Next to each current percentage value (VALUE), the worst value ever measured (WORST) and the respective limit value (THRESH) are listed. If the current, percentage value (VALUE) or the worst, ever measured value (WOR ST) exceeds the limit value (THRESH), a SMART warning is displayed in the WHEN_FAILED column (e.g. FAILING_NOW).

    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED     WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always      0
      3 Spin_Up_Time            0x0027   183   183   021    Pre-fail  Always      3833
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always      9
      5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always      0
      7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always      0
      9 Power_On_Hours          0x0032   097   097   000    Old_age   Always      2560
     10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always      0
     11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always      0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always      9
     16 Unknown_Attribute       0x0022   000   200   000    Old_age   Always      26802171994
    183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always      0
    192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always      4
    193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always      67
    194 Temperature_Celsius     0x0022   116   111   000    Old_age   Always      31
    196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always      0
    197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always      0
    198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline     0
    199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always      0
    200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline     0
    bash

    The following parameters can indicate an impending hard disk failure before a SMART warning is displayed:

    Reallocated_Sector_Ct: Indicates the number of sectors that have been reallocated due to read errors. If a sector can no longer be read, written to or checked correctly, a replacement sector is automatically allocated to it. The faulty sector is permanently marked as unreadable. This is a clear warning sign of incipient surface problems. If this value is not zero, a hard disk failure is often imminent. This value is the most important indicator for a hard disk replacement.

    Current_Pending_Sector_Ct: Indicates the number of unstable sectors waiting to be remapped. If a sector cannot be read and written to correctly, it initially receives the status Current Pending Sector. The sector is not reallocated in this state because the data located on the sector is unknown. Only after several unsuccessful read or write attempts is a replacement sector allocated and the faulty sector is permanently marked as unreadable. The Current_Pending_Sector_Ct value is an important indicator for a hard disk replacement. If this value is not zero, a hard disk failure is often imminent.

    Offline_Uncorrectable: Indicates the number of uncorrectable errors during read and write access to sectors.

    The last section deals with the internal hard disk log. Errors are recorded here if the servers work requests from the hard disk were not processed properly. If at least a two-digit error number is displayed in this section, you should arrange for the hard disk to be replaced as soon as possible.

    SMART Error Log Version: 1
    No Errors Logged
    bash

    Required Information for Hard Disk Replacement

    The following information is required to initiate the replacement of the defective hard disk:

    • Designation of the hard disk in the RAID (e.g. sda)

    • Serial number

    • Model

    • Log file (optional)

    Creating a SMART Log

    To create a full SMART log, enter the following command:

    smartctl -x [NAMEFIXED]

    Example:

    [root@localhost ~]# smartctl -x /dev/sda

    If the hard disk can no longer be accessed using Smartctl, you can use the hdparm program to retrieve the necessary information. How to install hdparm:

    CentOS

    yum -y install hdparm

    Ubuntu/Debian

    sudo apt-get update
    sudo apt-get install hdparm

    Then enter the following command to retrieve the information required for disk replacement:

    hdparm -i /dev/sda

    Notes
    • If the SMART log was created as described above, this is sufficient information. You can then arrange for the defective hard disk to be replaced. Please contact IONOS Customer Service for this.

    • If you cannot call up the serial number of the defective hard disk using Smartctl, you can alternatively provide the serial number of the working hard disk(s) to the customer service.

    Preparing a Server for Hard Disk Replacement

    The following example assumes that the second hard disk (sdb) is to be replaced. For example, the following status of the software RAID is displayed during the status check:

    [root@host ~]# cat /proc/mdstat

    Personalities : [raid1]
    md3 : active raid1 sda3[0] sdb3[2]
    439553856 blocks super 1.0 [2/1] [UU]

    md1 : active raid1 sdb1[2] sda1[0]
    19529600 blocks super 1.0 [2/1] [UU]

    unused devices: <none>

    The second hard disk (sdb) is still mounted in the RAID in this example and is therefore still in use.

    Manually mark raid device as "faulty" to remove it from RAID

    To mark the defective disk as "faulty" so that it can be removed from RAID, enter the following command:

    [root@host ~]# mdadm PATH_DES_RAID_ARRAYS -f PATH_OF_FIXED DISK.

    In the examples below, the sdb3 or sdb1 disks are marked as faulty:

    [root@host ~]# mdadm /dev/md3 -f /dev/sdb3
    mdadm: set /dev/sdb3 faulty in /dev/md3

    [root@host ~]# mdadm /dev/md1 -f /dev/sdb1
    mdadm: set /dev/sdb1 faulty in /dev/md1

    After entering the command, the RAID has the following status:

    [root@host ~]# cat /proc/mdstat

    Personalities : [raid1]
    md3 : active raid1 sda3[0] sdb3[2](F)
    439553856 blocks super 1.0 [2/1] [U_]

    md1 : active raid1 sdb1[2](F) sda1[0]
    19529600 blocks super 1.0 [2/1] [U_]

    unused devices: <none>

    Remove partition/ from the Multiple Device

    To remove a partition from the Multiple Device, issue the following command:

    [root@host ~]# mdadm -r /PFAD_DES_RAID_ARRAYS /PFAD_DER_FESTPLATTE

    In the examples below, the sdb3 and sdb1 disks are removed from the multiple device md3 and md1, respectively:

    [root@host ~]# mdadm -r /dev/md3 /dev/sdb3
    mdadm: hot removed /dev/sdb3 from /dev/md3

    [root@host ~]# mdadm -r /dev/md1 /dev/sdb1
    mdadm: hot removed /dev/sdb1 from /dev/md1

    Then check the status of the RAID. In this example, the RAID that was prepared for disk replacement has the following final state:

    [root@host ~]# cat /proc/mdstat

    Personalities : [raid1]
    md3 : active raid1 sda3[0]
    439553856 blocks super 1.0 [2/1] [U_]

    md1 : active raid1 sda1[0]
    19529600 blocks super 1.0 [2/1] [U_]

    unused devices: <none>

    Check which swap partitions are used

    Check which swap partitions are used by the operating system. To do this, type the following command:

    [root@host ~]# cat /proc/swaps

    Filename Type Size Used Priority
    /dev/sda2 partition 9765884 0 -1
    /dev/sdb2 partition 9765884 0 -2

    Alternatively, you can check which swap partitions are defined in fstab by entering the following command:

    [root@host ~]# grep swap /etc/fstab
    /dev/sda2 none swap sw
    /dev/sdb2 none swap sw

    Disable swap partition on the defective device

    Disable the swap partition on the defective disk so that it can be swapped. To do this, type the following command:

    [root@host ~]# swapoff PATH_OF_FIXED_DISK

    Example:

    [root@host ~]# swapoff /dev/sdb2

    Please Note

    If the swap partition on the defective disk is not deactivated and a disk replacement is performed, the swap partition in /proc/swaps receives the deleted status.

    Arranging for Hard Disk Replacement

    Now the replacement of the defective hard disk can be arranged. For this purpose please contact IONOS Customer Service.

    Required Steps After Replacing the Hard Disk

    After replacing the defective hard disk, it is necessary that you rebuild the software RAID. For more information about rebuilding a software RAID, click here:

    Rebuild Software RAID (Linux)