Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vdev probe to slow disk can stall mmp write checker #15839

Merged
merged 1 commit into from
Apr 29, 2024

Conversation

don-brady
Copy link
Contributor

Motivation and Context

MMP Updates
When MMP is enabled on a pool, the mmp_thread() will update uberblocks for the leaf VDEVs at a regular cadence. It will also check that these updates were completed within the appropriate time and if not, it will suspend the pool. These uberblock updates occur at a rate that ensures that even if one disk was slow to respond, a subsequent update to a different disk will complete in time. My initial investigation had one drive with response times greater than the write check threshold but that slow update is effectively ignored and does not trigger a pool suspend.

Before choosing a VDEV to update, the mmp_write_uberblock() code will acquire a reader lock on the spa config. Here the mmp thread is assuming that the acquisition of this lock never takes too long, such that it won't be able to complete the uberblock writes at a regular cadence (on my config this threshold was < 10 seconds).

VDEV Probes
A vdev probe issues some read and write requests to the pad area of the zfs labels and in the associated zio done will post a zevent for FM_EREPORT_ZFS_PROBE_FAILURE if it was not to successfully complete at least one read and one write. The caller of vdev_probe() can take action, and in the case of a vdev_open(), it will change the vdev state to faulted if the probe returns an error (ENXIO).

When an unexpected error is encounter during zio_vdev_io_done(), it will initiate a vdev_probe() operation. The vdev_probe() function will probe the disk and additionally request a spa_async_request(spa, SPA_ASYNC_PROBE), which will perform a future vdev_reopen(), which will call vdev_probe() again, but this time take action if the probe fails by faulting the disk.

This spa_async_request acquires the spa config lock as a writer during the entirety of the vdev_reopen() which includes a second set of probe io (3 reads, 3 writes). And when a slow disk is involved, this means that the spa config lock can be held as writer for a very long duration.

In summary, a single disk that is throwing errors and also has long response times (multiple seconds) will cause a MMP induced pool suspension due to a long vdev probe operation.

Description

The crux of the problem is that a vdev_probe() that is triggered by unexpected errors, holds the spa config lock across multiple issued IOs to a suspect disk.

One solution would be to downgrade the lock when waiting for the issued io to complete. Given the complexity of the spa config lock, this seems like it would present some challenges.

Another solution is to avoid calling vdev_probe() twice. Instead, if the initial probe IO did not successfully read or write, then take the same action seen in vdev_open() and change the vdev state to faulted. We can use a spa_async_request to make the change in syncing context (like we were doing with the vdev_reopen()).

The net change here is that we are not reopening the device for the probe but strictly performing the same IO tests and then if we notice a failure use a spa_async_request to change the vdev state.

Question for reviewers
Is a reopen performing some other benefit aside from the probing IO? Like is the VDEV label validation or other aspects of an open required or beneficial some how?

How Has This Been Tested?

Added a new ZTS test functional/mmp/mmp_write_slow_disk that will induce disk errors that are also slow to respond and trigger a mmp pool suspend for an unpatched zfs. With the fix in place the suspend no longer occurs.

Also ran existing functional/mmp tests

Manually confirmed that a vdev_probe() from zio_vdev_io_done() will continue to fault the vdev if the probe IOs fail.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

module/zfs/zio.c Show resolved Hide resolved
module/zfs/zfs_ioctl.c Outdated Show resolved Hide resolved
@don-brady
Copy link
Contributor Author

Rebased to address merge conflicts

@don-brady
Copy link
Contributor Author

Allow zpool clear for a mmp suspended pool if there is no evidence of remote host activity

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's for adding the ability to resume a multihost suspended pool. You'll also want to update this bit from the zpool clear man page.

Pools with multihost enabled which have been suspended cannot be resumed. While the pool was suspended, it may have been imported on another host, and resuming I/O could result in pool damage

module/zfs/spa.c Show resolved Hide resolved
@don-brady don-brady requested review from ofaaland and removed request for ofaaland March 4, 2024 16:27
@ofaaland
Copy link
Contributor

ofaaland commented Mar 4, 2024 via email

@don-brady
Copy link
Contributor Author

@ofaaland I was looking for a review of the latest changes. In particular the zpool clear changes. Thanks

@allanjude
Copy link
Contributor

@ofaaland When do you think you'll have time to review this?

@ofaaland
Copy link
Contributor

ofaaland commented Apr 5, 2024

@ofaaland When do you think you'll have time to review this?

Hi @allanjude I'll review it tomorrow. Sorry for the long delay.

Copy link
Contributor

@ofaaland ofaaland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

module/zfs/spa.c Outdated

int interations = 0;
while ((now = gethrtime()) < import_expire) {
if (interations % 30 == 0) {
if (importing && interations % 30 == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The original code has a typo here, "interations" should be "iterations"

@stuartthebruce
Copy link

stuartthebruce commented Apr 11, 2024

I think I just ran into this in #16078 on a large HDD zpool, so it would be fantastic if this can make it into a 2.2.4 release.

Note, this occurred shortly after upgrading from 2.1.15 to 2.2.3: is this more likely to happen with 2.2.x than 2.1.x, or was I just unlucky?

@don-brady
Copy link
Contributor Author

@behlendorf I'll take another look at re-enabling after suspend path

@don-brady
Copy link
Contributor Author

I was able to clear (un-suspend) a pool (raidz2, 8 disks) that was suspended from mmp write checks . There was no other host involved. I Would like to understand Brian's failure case.

module/zfs/spa.c Outdated Show resolved Hide resolved
Simplify vdev probes in the zio_vdev_io_done context to
avoid holding the spa config lock for long durations.

Also allow zpool clear if no evidence of another host
is using the pool.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Don Brady <[email protected]>
@behlendorf behlendorf removed the Status: Code Review Needed Ready for review and testing label Apr 29, 2024
@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Apr 29, 2024
/*
* confirm that the best hostid matches our hostid
*/
if (nvlist_exists(best_label, ZPOOL_CONFIG_HOSTID) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the best_label we found does not have an nvpair for hostid, should we exit with B_TRUE?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. This state should be impossible, but if the best_label somehow doesn't have a ZPOOL_CONFIG_HOSTID then we want to err on the side of caution and not attempt to resume the pool.

@behlendorf behlendorf merged commit c3f2f1a into openzfs:master Apr 29, 2024
22 of 26 checks passed
tonyhutter pushed a commit that referenced this pull request May 2, 2024
Simplify vdev probes in the zio_vdev_io_done context to
avoid holding the spa config lock for a long duration.

Also allow zpool clear if no evidence of another host
is using the pool.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Olaf Faaland <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes #15839
allanjude pushed a commit to KlaraSystems/zfs that referenced this pull request May 21, 2024
Simplify vdev probes in the zio_vdev_io_done context to
avoid holding the spa config lock for a long duration.

Also allow zpool clear if no evidence of another host
is using the pool.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Olaf Faaland <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes openzfs#15839
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
Simplify vdev probes in the zio_vdev_io_done context to
avoid holding the spa config lock for a long duration.

Also allow zpool clear if no evidence of another host
is using the pool.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Olaf Faaland <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes openzfs#15839
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants