vdev probe to slow disk can stall mmp write checker #15839

don-brady · 2024-01-30T04:40:19Z

Motivation and Context

MMP Updates
When MMP is enabled on a pool, the mmp_thread() will update uberblocks for the leaf VDEVs at a regular cadence. It will also check that these updates were completed within the appropriate time and if not, it will suspend the pool. These uberblock updates occur at a rate that ensures that even if one disk was slow to respond, a subsequent update to a different disk will complete in time. My initial investigation had one drive with response times greater than the write check threshold but that slow update is effectively ignored and does not trigger a pool suspend.

Before choosing a VDEV to update, the mmp_write_uberblock() code will acquire a reader lock on the spa config. Here the mmp thread is assuming that the acquisition of this lock never takes too long, such that it won't be able to complete the uberblock writes at a regular cadence (on my config this threshold was < 10 seconds).

VDEV Probes
A vdev probe issues some read and write requests to the pad area of the zfs labels and in the associated zio done will post a zevent for FM_EREPORT_ZFS_PROBE_FAILURE if it was not to successfully complete at least one read and one write. The caller of vdev_probe() can take action, and in the case of a vdev_open(), it will change the vdev state to faulted if the probe returns an error (ENXIO).

When an unexpected error is encounter during zio_vdev_io_done(), it will initiate a vdev_probe() operation. The vdev_probe() function will probe the disk and additionally request a spa_async_request(spa, SPA_ASYNC_PROBE), which will perform a future vdev_reopen(), which will call vdev_probe() again, but this time take action if the probe fails by faulting the disk.

This spa_async_request acquires the spa config lock as a writer during the entirety of the vdev_reopen() which includes a second set of probe io (3 reads, 3 writes). And when a slow disk is involved, this means that the spa config lock can be held as writer for a very long duration.

In summary, a single disk that is throwing errors and also has long response times (multiple seconds) will cause a MMP induced pool suspension due to a long vdev probe operation.

Description

The crux of the problem is that a vdev_probe() that is triggered by unexpected errors, holds the spa config lock across multiple issued IOs to a suspect disk.

One solution would be to downgrade the lock when waiting for the issued io to complete. Given the complexity of the spa config lock, this seems like it would present some challenges.

Another solution is to avoid calling vdev_probe() twice. Instead, if the initial probe IO did not successfully read or write, then take the same action seen in vdev_open() and change the vdev state to faulted. We can use a spa_async_request to make the change in syncing context (like we were doing with the vdev_reopen()).

The net change here is that we are not reopening the device for the probe but strictly performing the same IO tests and then if we notice a failure use a spa_async_request to change the vdev state.

Question for reviewers
Is a reopen performing some other benefit aside from the probing IO? Like is the VDEV label validation or other aspects of an open required or beneficial some how?

How Has This Been Tested?

Added a new ZTS test functional/mmp/mmp_write_slow_disk that will induce disk errors that are also slow to respond and trigger a mmp pool suspend for an unpatched zfs. With the fix in place the suspend no longer occurs.

Also ran existing functional/mmp tests

Manually confirmed that a vdev_probe() from zio_vdev_io_done() will continue to fault the vdev if the probe IOs fail.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

module/zfs/zio.c

tests/zfs-tests/tests/functional/mmp/mmp_write_slow_disk.ksh

module/zfs/zfs_ioctl.c

don-brady · 2024-02-08T19:28:09Z

Rebased to address merge conflicts

don-brady · 2024-02-14T22:24:41Z

Allow zpool clear for a mmp suspended pool if there is no evidence of remote host activity

behlendorf

That's for adding the ability to resume a multihost suspended pool. You'll also want to update this bit from the zpool clear man page.

Pools with multihost enabled which have been suspended cannot be resumed. While the pool was suspended, it may have been imported on another host, and resuming I/O could result in pool damage

module/zfs/spa.c

ofaaland · 2024-03-04T17:08:00Z

Hi @don I'm confused by the github UI. Do you need my review now, or are you reworking some part of the PR? thanks

…

________________________________________ From: Don Brady ***@***.***> Sent: Monday, March 4, 2024 8:27 AM To: openzfs/zfs Cc: Faaland, Olaf P.; Review requested Subject: Re: [openzfs/zfs] vdev probe to slow disk can stall mmp write checker (PR #15839) @don-brady<https://urldefense.us/v3/__https://github.com/don-brady__;!!G2kpM7uM-TzIFchu!0eV-GKofLalHL5pVZmYa0ZcM6HURLEUDfFCysjqjJ_jLu6WjvF3e-GlW-ON4gOmkp7ysh3P__wM9XfUDKxrdfNChZQ$> requested your review on: #15839<https://urldefense.us/v3/__https://github.com/openzfs/zfs/pull/15839__;!!G2kpM7uM-TzIFchu!0eV-GKofLalHL5pVZmYa0ZcM6HURLEUDfFCysjqjJ_jLu6WjvF3e-GlW-ON4gOmkp7ysh3P__wM9XfUDKxq096pxMQ$> vdev probe to slow disk can stall mmp write checker. — Reply to this email directly, view it on GitHub<https://urldefense.us/v3/__https://github.com/openzfs/zfs/pull/15839*event-12001683833__;Iw!!G2kpM7uM-TzIFchu!0eV-GKofLalHL5pVZmYa0ZcM6HURLEUDfFCysjqjJ_jLu6WjvF3e-GlW-ON4gOmkp7ysh3P__wM9XfUDKxo_QISdpg$>, or unsubscribe<https://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AB73C4YC7EJ4HS3CJUC2LILYWSOHNAVCNFSM6AAAAABCQSJFKGVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGAYDCNRYGM4DGMY__;!!G2kpM7uM-TzIFchu!0eV-GKofLalHL5pVZmYa0ZcM6HURLEUDfFCysjqjJ_jLu6WjvF3e-GlW-ON4gOmkp7ysh3P__wM9XfUDKxrKot-gqQ$>. You are receiving this because your review was requested.Message ID: ***@***.***>

don-brady · 2024-03-05T04:11:06Z

@ofaaland I was looking for a review of the latest changes. In particular the zpool clear changes. Thanks

allanjude · 2024-04-02T18:46:57Z

@ofaaland When do you think you'll have time to review this?

ofaaland · 2024-04-05T00:07:00Z

@ofaaland When do you think you'll have time to review this?

Hi @allanjude I'll review it tomorrow. Sorry for the long delay.

ofaaland

LGTM

ofaaland · 2024-04-08T16:55:55Z

module/zfs/spa.c


 int interations = 0;
 while ((now = gethrtime()) < import_expire) {
- if (interations % 30 == 0) {
+ if (importing && interations % 30 == 0) {


nit: The original code has a typo here, "interations" should be "iterations"

module/zfs/zfs_ioctl.c

stuartthebruce · 2024-04-11T01:50:19Z

I think I just ran into this in #16078 on a large HDD zpool, so it would be fantastic if this can make it into a 2.2.4 release.

Note, this occurred shortly after upgrading from 2.1.15 to 2.2.3: is this more likely to happen with 2.2.x than 2.1.x, or was I just unlucky?

don-brady · 2024-04-18T21:24:33Z

@behlendorf I'll take another look at re-enabling after suspend path

don-brady · 2024-04-19T04:12:12Z

I was able to clear (un-suspend) a pool (raidz2, 8 disks) that was suspended from mmp write checks . There was no other host involved. I Would like to understand Brian's failure case.

module/zfs/spa.c

Simplify vdev probes in the zio_vdev_io_done context to avoid holding the spa config lock for long durations. Also allow zpool clear if no evidence of another host is using the pool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>

ofaaland · 2024-04-29T21:28:12Z

module/zfs/spa.c

+ /*
+ * confirm that the best hostid matches our hostid
+ */
+ if (nvlist_exists(best_label, ZPOOL_CONFIG_HOSTID) &&


If the best_label we found does not have an nvpair for hostid, should we exit with B_TRUE?

Yes, I think so. This state should be impossible, but if the best_label somehow doesn't have a ZPOOL_CONFIG_HOSTID then we want to err on the side of caution and not attempt to resume the pool.

Simplify vdev probes in the zio_vdev_io_done context to avoid holding the spa config lock for a long duration. Also allow zpool clear if no evidence of another host is using the pool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #15839

Simplify vdev probes in the zio_vdev_io_done context to avoid holding the spa config lock for a long duration. Also allow zpool clear if no evidence of another host is using the pool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes openzfs#15839

don-brady force-pushed the mmp-slow-drive-suspend branch from 1474fa5 to 9186db2 Compare January 30, 2024 05:48

behlendorf added the Status: Code Review Needed Ready for review and testing label Jan 30, 2024

behlendorf requested review from behlendorf and ofaaland January 30, 2024 16:45

behlendorf reviewed Feb 3, 2024

View reviewed changes

don-brady force-pushed the mmp-slow-drive-suspend branch from 3893f21 to 80a9c3b Compare February 8, 2024 19:27

don-brady force-pushed the mmp-slow-drive-suspend branch from 80a9c3b to 97a0c8e Compare February 14, 2024 22:23

behlendorf reviewed Feb 15, 2024

View reviewed changes

module/zfs/spa.c Show resolved Hide resolved

don-brady force-pushed the mmp-slow-drive-suspend branch from 97a0c8e to 68dd346 Compare February 16, 2024 17:10

don-brady requested review from ofaaland and removed request for ofaaland March 4, 2024 16:27

ofaaland approved these changes Apr 8, 2024

View reviewed changes

behlendorf reviewed Apr 8, 2024

View reviewed changes

module/zfs/zfs_ioctl.c Show resolved Hide resolved

rincebrain mentioned this pull request Apr 11, 2024

MMP suspend after HDD read error during scrub #16078

Open

don-brady force-pushed the mmp-slow-drive-suspend branch from 68dd346 to 1fe8c17 Compare April 18, 2024 23:38

behlendorf reviewed Apr 26, 2024

View reviewed changes

module/zfs/spa.c Outdated Show resolved Hide resolved

don-brady force-pushed the mmp-slow-drive-suspend branch from a301fb6 to a137395 Compare April 26, 2024 23:22

don-brady force-pushed the mmp-slow-drive-suspend branch from a137395 to 40ad631 Compare April 29, 2024 16:57

behlendorf approved these changes Apr 29, 2024

View reviewed changes

behlendorf removed the Status: Code Review Needed Ready for review and testing label Apr 29, 2024

behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Apr 29, 2024

ofaaland reviewed Apr 29, 2024

View reviewed changes

behlendorf merged commit c3f2f1a into openzfs:master Apr 29, 2024
22 of 26 checks passed

robn mentioned this pull request Jun 11, 2024

vdev_open: clear async fault flag after reopen #16258

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vdev probe to slow disk can stall mmp write checker #15839

vdev probe to slow disk can stall mmp write checker #15839

don-brady commented Jan 30, 2024

don-brady commented Feb 8, 2024

don-brady commented Feb 14, 2024

behlendorf left a comment

ofaaland commented Mar 4, 2024 via email

don-brady commented Mar 5, 2024

allanjude commented Apr 2, 2024

ofaaland commented Apr 5, 2024

ofaaland left a comment

ofaaland Apr 8, 2024

stuartthebruce commented Apr 11, 2024 •

edited

Loading

don-brady commented Apr 18, 2024

don-brady commented Apr 19, 2024

ofaaland Apr 29, 2024

behlendorf Apr 29, 2024

vdev probe to slow disk can stall mmp write checker #15839

vdev probe to slow disk can stall mmp write checker #15839

Conversation

don-brady commented Jan 30, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

don-brady commented Feb 8, 2024

don-brady commented Feb 14, 2024

behlendorf left a comment

Choose a reason for hiding this comment

ofaaland commented Mar 4, 2024 via email

don-brady commented Mar 5, 2024

allanjude commented Apr 2, 2024

ofaaland commented Apr 5, 2024

ofaaland left a comment

Choose a reason for hiding this comment

ofaaland Apr 8, 2024

Choose a reason for hiding this comment

stuartthebruce commented Apr 11, 2024 • edited Loading

don-brady commented Apr 18, 2024

don-brady commented Apr 19, 2024

ofaaland Apr 29, 2024

Choose a reason for hiding this comment

behlendorf Apr 29, 2024

Choose a reason for hiding this comment

stuartthebruce commented Apr 11, 2024 •

edited

Loading