zfs-mirror更换损坏的硬盘

zfs-mirror更换损坏的硬盘

自从用上zfs之后就开始关心硬盘坏掉之后如何更换,今天终止有机会了。

我在zfs上安装的pve,用的mirror。有段时间一直报有一块硬盘有READ EROR,我也一直没当回事。都是用zpool clear清除掉,反正都是READ错误,没有WRITE错误,而且也就在10左右。

直到今天,一下子冒出这么多的错误,我知道这块盘是保不住了,本来这两块盘就是人家淘汰下来的(穷哭)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
root@pve:~# zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 6.81M in 02:41:15 with 0 errors on Sun Mar 13 03:05:17 2022
config:

        NAME                                                  STATE     READ WRITE CKSUM
        rpool                                                 DEGRADED     0     0     0
          mirror-0                                            DEGRADED     0     0     0
            ata-WDC_WD5000BPKT-75PK4T0_WD-WX11A43M7165-part3  DEGRADED   857     0   367  too many errors
            ata-WDC_WD5000BPKT-75PK4T0_WD-WX11A43M7336-part3  ONLINE       0     0     0

errors: No known data errors

虽然SATA可以热插拔,保险起见我还是关机了。不过忘了在关机前将坏的硬盘offline,这也没什么问题。

将旧硬盘拆下,新硬盘装上后开机。

此时zfs已经识别到旧硬盘UNAVAILABLE了,这时将它OFFLINE: zpool offline rpool 10409275789507660143

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
root@pve:~# zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 6.81M in 02:41:15 with 0 errors on Sun Mar 13 03:05:17 2022
config:

        NAME                                                  STATE     READ WRITE CKSUM
        rpool                                                 DEGRADED     0     0     0
          mirror-0                                            DEGRADED     0     0     0
            10409275789507660143                              OFFLINE      0     0     0  was /dev/disk/by-id/ata-WDC_WD5000BPKT-75PK4T0_WD-WX11A43M7165-part3
            ata-WDC_WD5000BPKT-75PK4T0_WD-WX11A43M7336-part3  ONLINE       0     0     0

errors: No known data errors

由于用过的硬盘,需要先擦除。

在pve管理界面,选择pve -> disks,找到新硬盘,选择Wipe Disk:

更好的方法是从好的硬盘复制分区表:

1
2
sgdisk /dev/sdb -R /dev/sdc ## sdb的分区表复制到sdc
sgdisk -G /dev/sdc ## 重新生成UUID

用/dev/sda这样的方式不太安全,因为更换线缆位置后可能会出现变化,最好用id更保险。

进入pve的shell,ls /dev/disk/by-id,找到新硬盘的id。

1
2
3
4
5
6
root@pve:/dev/disk/by-id# ls
ata-ST500LX012-1LM162-SSHD_W3N179MS             ata-WDC_WD5000BPKT-75PK4T0_WD-WX11A43M7336        wwn-0x50014ee00410d1af        wwn-0x50014ee658bf9463-part1
ata-WDC_WD30PURX-64P6ZY0_WD-WMC4N0L041D9        ata-WDC_WD5000BPKT-75PK4T0_WD-WX11A43M7336-part1  wwn-0x50014ee00410d1af-part1  wwn-0x50014ee658bf9463-part2
ata-WDC_WD30PURX-64P6ZY0_WD-WMC4N0L041D9-part1  ata-WDC_WD5000BPKT-75PK4T0_WD-WX11A43M7336-part2  wwn-0x50014ee00410d1af-part2  wwn-0x50014ee658bf9463-part3
ata-WDC_WD30PURX-64P6ZY0_WD-WMC4N0L041D9-part2  ata-WDC_WD5000BPKT-75PK4T0_WD-WX11A43M7336-part3  wwn-0x50014ee00410d1af-part3
ata-WDC_WD30PURX-64P6ZY0_WD-WMC4N0L041D9-part3  wwn-0x5000c50082b85d27                            wwn-0x50014ee658bf9463

zpool replace -f rpool ata-WDC_WD5000BPKT-75PK4T0_WD-WX11A43M7336-part3 ata-HGST_HUS726T4TALE6L4_V1G7DWXC-part3 ## 注意这里是第三分区,如果是新增硬盘则将replace换成attach,后接pool的剩下的分区和新分区

-f – 强制

上面的命令没有输出,需要再次查看:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
root@pve:/dev/disk/by-id# zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Mar 13 23:33:50 2022
        88.1G scanned at 534M/s, 2.38G issued at 14.4M/s, 187G total
        2.45G resilvered, 1.28% done, 03:37:59 to go
config:

        NAME                                                  STATE     READ WRITE CKSUM
        rpool                                                 DEGRADED     0     0     0
          mirror-0                                            DEGRADED     0     0     0
            replacing-0                                       DEGRADED     0     0     0
              10409275789507660143                            OFFLINE      0     0     0  was /dev/disk/by-id/ata-WDC_WD5000BPKT-75PK4T0_WD-WX11A43M7165-part3
              ata-ST500LX012-1LM162-SSHD_W3N179MS             ONLINE       0     0     0  (resilvering)
            ata-WDC_WD5000BPKT-75PK4T0_WD-WX11A43M7336-part3  ONLINE       0     0     0

errors: No known data errors

此时已经在重建mirror,大概需要几个小时吧,看硬盘大小。

重建结束后可能会自动online,否则就手动online一下。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
root@pve:~# zpool status
  pool: rpool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: resilvered 190G in 02:59:24 with 0 errors on Mon Mar 14 02:33:14 2022
config:

        NAME                                                  STATE     READ WRITE CKSUM
        rpool                                                 ONLINE       0     0     0
          mirror-0                                            ONLINE       0     0     0
            ata-ST500LX012-1LM162-SSHD_W3N179MS               ONLINE       0     0     0
            ata-WDC_WD5000BPKT-75PK4T0_WD-WX11A43M7336-part3  ONLINE       0     0     0

errors: No known data errors

在新磁盘重建好之前,这段时间不能重启。

1
2
3
proxmox-boot-tool format /dev/sdc2
proxmox-boot-tool init /dev/sdc2 ## 如果失败,则apt install systemd-boot
proxmox-boot-tool refresh

如果是换的更大的硬盘,则需要调整大小

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# resize partition 3 of sdc to use 50% of the available space  (partition 3 is the ZFS partition)
parted /dev/sdb resizepart 3 50%

# expand zfs on sdc to use the entire expanded partition
zpool online -e rpool ata-HGST_HUS726T4TALE6L4_V1G7DWXC-part3

# resize partition 3 of sdd to use 50% of the available space (partition 3 is the ZFS partition)
parted /dev/sdc resizepart 3 50%

# expand zfs on sdd to use the entire expanded partition
zpool online -e rpool ata-ST4000NM0115-1YZ107_ZC123Y73-part3

-e – 扩容

smartctl -a /dev/sda