-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vdev probe to slow disk can stall mmp write checker #15839
Conversation
1474fa5
to
9186db2
Compare
3893f21
to
80a9c3b
Compare
Rebased to address merge conflicts |
80a9c3b
to
97a0c8e
Compare
Allow zpool clear for a mmp suspended pool if there is no evidence of remote host activity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's for adding the ability to resume a multihost suspended pool. You'll also want to update this bit from the zpool clear
man page.
Pools with multihost enabled which have been suspended cannot be resumed. While the pool was suspended, it may have been imported on another host, and resuming I/O could result in pool damage
97a0c8e
to
68dd346
Compare
@ofaaland I was looking for a review of the latest changes. In particular the zpool clear changes. Thanks |
@ofaaland When do you think you'll have time to review this? |
Hi @allanjude I'll review it tomorrow. Sorry for the long delay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
module/zfs/spa.c
Outdated
|
||
int interations = 0; | ||
while ((now = gethrtime()) < import_expire) { | ||
if (interations % 30 == 0) { | ||
if (importing && interations % 30 == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The original code has a typo here, "interations" should be "iterations"
I think I just ran into this in #16078 on a large HDD zpool, so it would be fantastic if this can make it into a 2.2.4 release. Note, this occurred shortly after upgrading from 2.1.15 to 2.2.3: is this more likely to happen with 2.2.x than 2.1.x, or was I just unlucky? |
@behlendorf I'll take another look at re-enabling after suspend path |
68dd346
to
1fe8c17
Compare
I was able to clear (un-suspend) a pool (raidz2, 8 disks) that was suspended from mmp write checks . There was no other host involved. I Would like to understand Brian's failure case. |
a301fb6
to
a137395
Compare
Simplify vdev probes in the zio_vdev_io_done context to avoid holding the spa config lock for long durations. Also allow zpool clear if no evidence of another host is using the pool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>
a137395
to
40ad631
Compare
/* | ||
* confirm that the best hostid matches our hostid | ||
*/ | ||
if (nvlist_exists(best_label, ZPOOL_CONFIG_HOSTID) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the best_label we found does not have an nvpair for hostid, should we exit with B_TRUE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. This state should be impossible, but if the best_label
somehow doesn't have a ZPOOL_CONFIG_HOSTID
then we want to err on the side of caution and not attempt to resume the pool.
Simplify vdev probes in the zio_vdev_io_done context to avoid holding the spa config lock for a long duration. Also allow zpool clear if no evidence of another host is using the pool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #15839
Simplify vdev probes in the zio_vdev_io_done context to avoid holding the spa config lock for a long duration. Also allow zpool clear if no evidence of another host is using the pool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes openzfs#15839
Simplify vdev probes in the zio_vdev_io_done context to avoid holding the spa config lock for a long duration. Also allow zpool clear if no evidence of another host is using the pool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Olaf Faaland <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes openzfs#15839
Motivation and Context
MMP Updates
When MMP is enabled on a pool, the
mmp_thread()
will update uberblocks for the leaf VDEVs at a regular cadence. It will also check that these updates were completed within the appropriate time and if not, it will suspend the pool. These uberblock updates occur at a rate that ensures that even if one disk was slow to respond, a subsequent update to a different disk will complete in time. My initial investigation had one drive with response times greater than the write check threshold but that slow update is effectively ignored and does not trigger a pool suspend.Before choosing a VDEV to update, the
mmp_write_uberblock()
code will acquire a reader lock on the spa config. Here the mmp thread is assuming that the acquisition of this lock never takes too long, such that it won't be able to complete the uberblock writes at a regular cadence (on my config this threshold was < 10 seconds).VDEV Probes
A vdev probe issues some read and write requests to the pad area of the zfs labels and in the associated zio done will post a zevent for
FM_EREPORT_ZFS_PROBE_FAILURE
if it was not to successfully complete at least one read and one write. The caller ofvdev_probe()
can take action, and in the case of avdev_open()
, it will change the vdev state to faulted if the probe returns an error (ENXIO
).When an unexpected error is encounter during
zio_vdev_io_done()
, it will initiate avdev_probe()
operation. Thevdev_probe()
function will probe the disk and additionally request aspa_async_request(spa, SPA_ASYNC_PROBE)
, which will perform a futurevdev_reopen()
, which will callvdev_probe()
again, but this time take action if the probe fails by faulting the disk.This
spa_async_request
acquires the spa config lock as a writer during the entirety of thevdev_reopen()
which includes a second set of probe io (3 reads, 3 writes). And when a slow disk is involved, this means that the spa config lock can be held as writer for a very long duration.In summary, a single disk that is throwing errors and also has long response times (multiple seconds) will cause a MMP induced pool suspension due to a long vdev probe operation.
Description
The crux of the problem is that a
vdev_probe()
that is triggered by unexpected errors, holds the spa config lock across multiple issued IOs to a suspect disk.One solution would be to downgrade the lock when waiting for the issued io to complete. Given the complexity of the spa config lock, this seems like it would present some challenges.
Another solution is to avoid calling
vdev_probe()
twice. Instead, if the initial probe IO did not successfully read or write, then take the same action seen invdev_open()
and change the vdev state to faulted. We can use aspa_async_request
to make the change in syncing context (like we were doing with thevdev_reopen()
).The net change here is that we are not reopening the device for the probe but strictly performing the same IO tests and then if we notice a failure use a
spa_async_request
to change the vdev state.Question for reviewers
Is a reopen performing some other benefit aside from the probing IO? Like is the VDEV label validation or other aspects of an open required or beneficial some how?
How Has This Been Tested?
Added a new ZTS test
functional/mmp/mmp_write_slow_disk
that will induce disk errors that are also slow to respond and trigger a mmp pool suspend for an unpatched zfs. With the fix in place the suspend no longer occurs.Also ran existing
functional/mmp
testsManually confirmed that a
vdev_probe()
fromzio_vdev_io_done()
will continue to fault the vdev if the probe IOs fail.Types of changes
Checklist:
Signed-off-by
.