Difference between revisions of "Diagnosing and repairing a RAID 1 failure"
Jump to navigation
Jump to search
m (1 revision imported) |
|
(No difference)
|
Latest revision as of 11:59, 14 May 2022
Determine what tests are supported
Normal drive:
root@mono:~# smartctl -c /dev/sda smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org --- START OF READ SMART DATA SECTION --- General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 120) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 482) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported.
Nothing there:
root@mono:~# smartctl -c /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org Short INQUIRY response, skip product id A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
Short test of working drive:
root@mono:~# smartctl -t short /dev/sda smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org --- START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION --- Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Sun Mar 29 09:20:46 2020 Use smartctl -X to abort test. root@mono:~# smartctl -H /dev/sda smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org --- START OF READ SMART DATA SECTION --- SMART overall-health self-assessment test result: PASSED
Short test of failing drive:
root@mono:~# smartctl -t short /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org Short INQUIRY response, skip product id A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
root@mono:~# smartctl -t short -T permissive /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org Short INQUIRY response, skip product id Short Background Self Test has begun Use smartctl -X to abort test
root@mono:~# smartctl -H /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org Short INQUIRY response, skip product id A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
root@mono:~# smartctl -H -T permissive /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org Short INQUIRY response, skip product id --- START OF READ SMART DATA SECTION --- SMART Health Status: OK
NEXT: Check hardware attachments and re-test /dev/sdb
root@mono:~# smartctl -t short -T permissive /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org --- START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION --- Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Sun Mar 29 11:28:07 2020 Use smartctl -X to abort test.
root@mono:~# smartctl -H -T permissive /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org --- START OF READ SMART DATA SECTION --- SMART overall-health self-assessment test result: PASSED
It's back:
root@mono:~# df -h Filesystem Size Used Avail Use% Mounted on udev 7.3G 0 7.3G 0% /dev tmpfs 1.5G 12M 1.5G 1% /run /dev/md2 28G 15G 12G 57% / tmpfs 7.4G 0 7.4G 0% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 7.4G 0 7.4G 0% /sys/fs/cgroup /dev/md1 237M 131M 94M 59% /boot /dev/md4 3.6T 717G 2.7T 21% /home tmpfs 1.5G 16K 1.5G 1% /run/user/118 tmpfs 1.5G 0 1.5G 0% /run/user/0
root@mono:~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 3.7T 0 disk ├─sda1 8:1 0 1M 0 part ├─sda2 8:2 0 244M 0 part │ └─md1 9:1 0 243.8M 0 raid1 /boot ├─sda3 8:3 0 28G 0 part │ └─md2 9:2 0 27.9G 0 raid1 / ├─sda4 8:4 0 14.9G 0 part │ └─md3 9:3 0 14.9G 0 raid1 [SWAP] └─sda5 8:5 0 3.6T 0 part └─md4 9:4 0 3.6T 0 raid1 /home sdb 8:16 0 3.7T 0 disk ├─sdb1 8:17 0 1M 0 part ├─sdb2 8:18 0 244M 0 part ├─sdb3 8:19 0 28G 0 part ├─sdb4 8:20 0 14.9G 0 part └─sdb5 8:21 0 3.6T 0 part
But the array needs rebuilding:
root@mono:~# cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md4 : active raid1 sda5[0] 3861712896 blocks super 1.2 [2/1] [U_] bitmap: 9/29 pages [36KB], 65536KB chunk md3 : active (auto-read-only) raid1 sda4[0] 15617024 blocks super 1.2 [2/1] [U_] md1 : active raid1 sda2[3] 249664 blocks super 1.2 [2/1] [U_] md2 : active raid1 sda3[3] 29280256 blocks super 1.2 [2/1] [U_] unused devices: <none>
root@mono:~# mdadm --add /dev/md1 /dev/sdb2 mdadm: added /dev/sdb2
root@mono:~# mdadm --add /dev/md2 /dev/sdb3 mdadm: added /dev/sdb3
root@mono:~# mdadm --add /dev/md3 /dev/sdb4 mdadm: added /dev/sdb4
root@mono:~# mdadm --add /dev/md4 /dev/sdb5 mdadm: re-added /dev/sdb5
root@mono:~# cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md4 : active raid1 sdb5[2] sda5[0] 3861712896 blocks super 1.2 [2/1] [U_] resync=DELAYED bitmap: 9/29 pages [36KB], 65536KB chunk md3 : active raid1 sdb4[2] sda4[0] 15617024 blocks super 1.2 [2/1] [U_] resync=DELAYED md1 : active raid1 sdb2[2] sda2[3] 249664 blocks super 1.2 [2/2] [UU] md2 : active raid1 sdb3[2] sda3[3] 29280256 blocks super 1.2 [2/1] [U_] [=========>...........] recovery = 45.2% (13258112/29280256) finish=1.4min speed=180514K/sec unused devices: <none>
root@mono:~# cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md4 : active raid1 sdb5[2] sda5[0] 3861712896 blocks super 1.2 [2/1] [U_] [===========>.........] recovery = 56.6% (2187364800/3861712896) finish=179.4min speed=155464K/sec bitmap: 5/29 pages [20KB], 65536KB chunk md3 : active raid1 sdb4[2] sda4[0] 15617024 blocks super 1.2 [2/1] [U_] resync=DELAYED md1 : active raid1 sdb2[2] sda2[3] 249664 blocks super 1.2 [2/2] [UU] md2 : active raid1 sdb3[2] sda3[3] 29280256 blocks super 1.2 [2/2] [UU] unused devices: <none>
root@mono:~# cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md4 : active raid1 sdb5[2] sda5[0] 3861712896 blocks super 1.2 [2/2] [UU] bitmap: 0/29 pages [0KB], 65536KB chunk md3 : active raid1 sdb4[2] sda4[0] 15617024 blocks super 1.2 [2/2] [UU] md1 : active raid1 sdb2[2] sda2[3] 249664 blocks super 1.2 [2/2] [UU] md2 : active raid1 sdb3[2] sda3[3] 29280256 blocks super 1.2 [2/2] [UU]
(Next time, use “watch†... watch cat /proc/mdstat)
root@mono:~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 3.7T 0 disk ├─sda1 8:1 0 1M 0 part ├─sda2 8:2 0 244M 0 part │ └─md1 9:1 0 243.8M 0 raid1 /boot ├─sda3 8:3 0 28G 0 part │ └─md2 9:2 0 27.9G 0 raid1 / ├─sda4 8:4 0 14.9G 0 part │ └─md3 9:3 0 14.9G 0 raid1 [SWAP] └─sda5 8:5 0 3.6T 0 part └─md4 9:4 0 3.6T 0 raid1 /home sdb 8:16 0 3.7T 0 disk ├─sdb1 8:17 0 1M 0 part ├─sdb2 8:18 0 244M 0 part │ └─md1 9:1 0 243.8M 0 raid1 /boot ├─sdb3 8:19 0 28G 0 part │ └─md2 9:2 0 27.9G 0 raid1 / ├─sdb4 8:20 0 14.9G 0 part │ └─md3 9:3 0 14.9G 0 raid1 [SWAP] └─sdb5 8:21 0 3.6T 0 part └─md4 9:4 0 3.6T 0 raid1 /home
Post-restore testing
root@mono:~# smartctl -c /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org --- START OF READ SMART DATA SECTION --- General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 120) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 483) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported.
root@mono:~# smartctl -t long /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org --- START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION --- Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 483 minutes for test to complete. Test will complete after Sun Mar 29 21:14:35 2020
Use smartctl -X to abort test.
root@mono:~# date Sun 29 Mar 2020 01:11:47 PM PDT <pre> <pre> root@mono:~# smartctl -H /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org --- START OF READ SMART DATA SECTION --- SMART overall-health self-assessment test result: PASSED
root@mono:~# smartctl -c /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org --- START OF READ SMART DATA SECTION --- General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 120) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 483) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported.