Difference between revisions of "Diagnosing and repairing a RAID 1 failure"

From One-Eyed Man Wiki
Jump to navigation Jump to search
 
m (1 revision imported)
 
(No difference)

Latest revision as of 11:59, 14 May 2022

Determine what tests are supported


Normal drive:

root@mono:~# smartctl -c /dev/sda

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


--- START OF READ SMART DATA SECTION ---

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 120) seconds.

Offline data collection

capabilities: (0x5b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 482) minutes.

SCT capabilities: (0x003d) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.


Nothing there:

root@mono:~# smartctl -c /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


Short INQUIRY response, skip product id

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.


Short test of working drive:

root@mono:~# smartctl -t short /dev/sda

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


--- START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ---

Sending command: "Execute SMART Short self-test routine immediately in off-line mode".

Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.

Testing has begun.

Please wait 2 minutes for test to complete.

Test will complete after Sun Mar 29 09:20:46 2020


Use smartctl -X to abort test.

root@mono:~# smartctl -H /dev/sda

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

--- START OF READ SMART DATA SECTION ---

SMART overall-health self-assessment test result: PASSED


Short test of failing drive:

root@mono:~# smartctl -t short /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


Short INQUIRY response, skip product id

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
root@mono:~# smartctl -t short -T permissive /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


Short INQUIRY response, skip product id

Short Background Self Test has begun

Use smartctl -X to abort test
root@mono:~# smartctl -H /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


Short INQUIRY response, skip product id

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
root@mono:~# smartctl -H -T permissive /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


Short INQUIRY response, skip product id

--- START OF READ SMART DATA SECTION ---

SMART Health Status: OK


NEXT: Check hardware attachments and re-test /dev/sdb

root@mono:~# smartctl -t short -T permissive /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


--- START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ---

Sending command: "Execute SMART Short self-test routine immediately in off-line mode".

Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.

Testing has begun.

Please wait 2 minutes for test to complete.

Test will complete after Sun Mar 29 11:28:07 2020


Use smartctl -X to abort test.


root@mono:~# smartctl -H -T permissive /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


--- START OF READ SMART DATA SECTION ---

SMART overall-health self-assessment test result: PASSED


It's back:

root@mono:~# df -h

Filesystem Size Used Avail Use% Mounted on

udev 7.3G 0 7.3G 0% /dev

tmpfs 1.5G 12M 1.5G 1% /run

/dev/md2 28G 15G 12G 57% /

tmpfs 7.4G 0 7.4G 0% /dev/shm

tmpfs 5.0M 4.0K 5.0M 1% /run/lock

tmpfs 7.4G 0 7.4G 0% /sys/fs/cgroup

/dev/md1 237M 131M 94M 59% /boot

/dev/md4 3.6T 717G 2.7T 21% /home

tmpfs 1.5G 16K 1.5G 1% /run/user/118

tmpfs 1.5G 0 1.5G 0% /run/user/0

root@mono:~# lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT

sda 8:0 0 3.7T 0 disk

├─sda1 8:1 0 1M 0 part

├─sda2 8:2 0 244M 0 part

│ └─md1 9:1 0 243.8M 0 raid1 /boot

├─sda3 8:3 0 28G 0 part

│ └─md2 9:2 0 27.9G 0 raid1 /

├─sda4 8:4 0 14.9G 0 part

│ └─md3 9:3 0 14.9G 0 raid1 [SWAP]

└─sda5 8:5 0 3.6T 0 part

└─md4 9:4 0 3.6T 0 raid1 /home

sdb 8:16 0 3.7T 0 disk

├─sdb1 8:17 0 1M 0 part

├─sdb2 8:18 0 244M 0 part

├─sdb3 8:19 0 28G 0 part

├─sdb4 8:20 0 14.9G 0 part

└─sdb5 8:21 0 3.6T 0 part


But the array needs rebuilding:

root@mono:~# cat /proc/mdstat

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]

md4 : active raid1 sda5[0]

3861712896 blocks super 1.2 [2/1] [U_]

bitmap: 9/29 pages [36KB], 65536KB chunk


md3 : active (auto-read-only) raid1 sda4[0]

15617024 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[3]

249664 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[3]

29280256 blocks super 1.2 [2/1] [U_]

unused devices: <none>
root@mono:~# mdadm --add /dev/md1 /dev/sdb2

mdadm: added /dev/sdb2
root@mono:~# mdadm --add /dev/md2 /dev/sdb3

mdadm: added /dev/sdb3
root@mono:~# mdadm --add /dev/md3 /dev/sdb4

mdadm: added /dev/sdb4
root@mono:~# mdadm --add /dev/md4 /dev/sdb5

mdadm: re-added /dev/sdb5
root@mono:~# cat /proc/mdstat

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]

md4 : active raid1 sdb5[2] sda5[0]

3861712896 blocks super 1.2 [2/1] [U_]

resync=DELAYED

bitmap: 9/29 pages [36KB], 65536KB chunk


md3 : active raid1 sdb4[2] sda4[0]

15617024 blocks super 1.2 [2/1] [U_]

resync=DELAYED

md1 : active raid1 sdb2[2] sda2[3]

249664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdb3[2] sda3[3]

29280256 blocks super 1.2 [2/1] [U_]

[=========>...........] recovery = 45.2% (13258112/29280256) finish=1.4min speed=180514K/sec

unused devices: <none>
root@mono:~# cat /proc/mdstat

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]

md4 : active raid1 sdb5[2] sda5[0]

3861712896 blocks super 1.2 [2/1] [U_]

[===========>.........] recovery = 56.6% (2187364800/3861712896) finish=179.4min speed=155464K/sec

bitmap: 5/29 pages [20KB], 65536KB chunk


md3 : active raid1 sdb4[2] sda4[0]

15617024 blocks super 1.2 [2/1] [U_]

resync=DELAYED

md1 : active raid1 sdb2[2] sda2[3]

249664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdb3[2] sda3[3]

29280256 blocks super 1.2 [2/2] [UU]

unused devices: <none>
root@mono:~# cat /proc/mdstat

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]

md4 : active raid1 sdb5[2] sda5[0]

3861712896 blocks super 1.2 [2/2] [UU]

bitmap: 0/29 pages [0KB], 65536KB chunk


md3 : active raid1 sdb4[2] sda4[0]

15617024 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[2] sda2[3]

249664 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdb3[2] sda3[3]

29280256 blocks super 1.2 [2/2] [UU]


(Next time, use “watch” ... watch cat /proc/mdstat)

root@mono:~# lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT

sda 8:0 0 3.7T 0 disk

├─sda1 8:1 0 1M 0 part

├─sda2 8:2 0 244M 0 part

│ └─md1 9:1 0 243.8M 0 raid1 /boot

├─sda3 8:3 0 28G 0 part

│ └─md2 9:2 0 27.9G 0 raid1 /

├─sda4 8:4 0 14.9G 0 part

│ └─md3 9:3 0 14.9G 0 raid1 [SWAP]

└─sda5 8:5 0 3.6T 0 part

└─md4 9:4 0 3.6T 0 raid1 /home

sdb 8:16 0 3.7T 0 disk

├─sdb1 8:17 0 1M 0 part

├─sdb2 8:18 0 244M 0 part

│ └─md1 9:1 0 243.8M 0 raid1 /boot

├─sdb3 8:19 0 28G 0 part

│ └─md2 9:2 0 27.9G 0 raid1 /

├─sdb4 8:20 0 14.9G 0 part

│ └─md3 9:3 0 14.9G 0 raid1 [SWAP]

└─sdb5 8:21 0 3.6T 0 part

└─md4 9:4 0 3.6T 0 raid1 /home

Post-restore testing

root@mono:~# smartctl -c /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


--- START OF READ SMART DATA SECTION ---

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 120) seconds.

Offline data collection

capabilities: (0x5b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 483) minutes.

SCT capabilities: (0x003d) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.
root@mono:~# smartctl -t long /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


--- START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ---

Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".

Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.

Testing has begun.

Please wait 483 minutes for test to complete.

Test will complete after Sun Mar 29 21:14:35 2020

Use smartctl -X to abort test.

root@mono:~# date

Sun 29 Mar 2020 01:11:47 PM PDT
<pre>

<pre>
root@mono:~# smartctl -H /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


--- START OF READ SMART DATA SECTION ---

SMART overall-health self-assessment test result: PASSED
root@mono:~# smartctl -c /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org


--- START OF READ SMART DATA SECTION ---

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 120) seconds.

Offline data collection

capabilities: (0x5b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 483) minutes.

SCT capabilities: (0x003d) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.