zio: add separate pipeline stages for logical IO #17388

robn · 2025-05-28T07:09:29Z

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

The "logical" IO responsible for farming work out to the vdevs goes through the VDEV_IO stages, even though it does no IO itself, does not have a vdev set, and is not a "vdev" child IO.

This means the VDEV_IO stages need special handling for this particular kind of IO, some of it totally irrelevant to real vdev IO (eg config lock, retry, etc). It also leads to some confusing asymmetries, eg the DVA throttle is held in the logical, and then released in pieces in the children.

(I can elaborate on what I'm doing if more justification is needed, but I'm hopeful this stands on its own as a good cleanup).

Description

This commit adds two new stages to the pipeline, ZIO_LOGICAL_IO_START and ZIO_LOGICAL_IO_DONE to handle this IO. This allows a clean separation between logical and vdev IO: vdev IO always has io_vd set, an io_child_type of ZIO_CHILD_VDEV, while logical IO is the inverse. Logical IO only ever goes throught through the LOGICAL_IO pipeline, and vdev IO through VDEV_IO.

This separation presents a new problem, in that previously the logical IO would call into the mirror vdev ops to issue the vdev IO, which is now not possible because non-vdev IOs can't use vdev operations. To keep the overall pipeline tidy, we press the root vdev into service. zio_logical_io_start() creates a child IO against spa_root_vdev, which then delegates to the mirror vdev ops to do its work.

How Has This Been Tested?

Successful ZTS run against Linux 6.1.137 and FreeBSD 14.2-p1.

Throughput appears to be similar in light performance tests, though I have not pushed it very hard.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2025-05-28T13:15:23Z

It also leads to some confusing asymmetries, eg the DVA throttle is held in the logical, and then released in pieces in the children.

It might not look great, but it is done that way intentionally. If we are writing several copies of a block to a several different top-level vdevs, we want each of them to be throttled individually. Otherwise single slow vdev would throttle others, in avoiding which is the point of allocation throttling mechanism.

Previously we even tracked dirty data in individual leaf vdev granularity, but I had to remove that some time ago due to too high processing overhead.

robn · 2025-05-29T11:49:02Z

It might not look great, but it is done that way intentionally. If we are writing several copies of a block to a several different top-level vdevs, we want each of them to be throttled individually. Otherwise single slow vdev would throttle others, in avoiding which is the point of allocation throttling mechanism.

Yeah that's fair. I knew why, -ish, but hadn't quite joined up all the dots.

I've put it back the way it was, adjusted for the new model. Tests will run overnight; if it comes up ok I'll push it. Thanks.

robn · 2025-05-30T09:05:34Z

Rebased, and updated to move the unthrottle back to when the top vdev IO finishes. resilver_restart_002 still fails, pretty sure it'll be something around about where errors are being accounted to. I'll get back to it properly next week.

robn · 2025-06-05T08:31:52Z

resilver_restart_002 still fails, pretty sure it'll be something around about where errors are being accounted to. I'll get back to it properly next week.

Sorta-kinda. It was actually around where fault injections targeting a DVA/bookmark/objecttype were injected. I had sorta blindly moved them to zio_io_logical_done() because those are logical concepts, however the injections still have to be done at the vdev level, (a) because they may target a specific vdev, and (b) we want to record the vdev as the "source" of the error.

As it turns out, there is exactly one test that uses this (zinject -b), and that's resilver_restart_002. So that's working now.

robn · 2025-06-12T02:23:00Z

I did some semi-scientific test runs this morning. Production (non-debug) build at the given commit. Each run is on a freshly-created pool of 14 disks. 100 threads, either 73 writing + 27 reading, or 100 writing, with or without fsync after each file write. There's various customer tuning applied, but it's the same on both, so I don't claim this is "good" or "bad" in general, but over the years comparison timings have proven very useful, and I believe so here too.

	14x RAIDz3						7x2 mirror
	3300 w73/r27		3301 w73/r27 + fsync		3302 w100	3303 w100 + fsync	3300 w73/r27		3301 w73/r27 + fsync		3302 w100	3303 w100 + fsync
	write	read	write	read	write	write	write	read	write	read	write	write
master @ `e0ef4d2` [PROD]	1134 MB/s	2246 MB/s	387 MB/s	10684 MB/s	1601 MB/s	473 MB/s	1064 MB/s	3722 MB/s	354 MB/s	10755 MB/s	1302 MB/s	431 MB/s

zio-logical-io @ `f4480ee` [PROD]	1139 MB/s	2284 MB/s	386 MB/s	10638 MB/s	1601 MB/s	458 MB/s	1065 MB/s	3731 MB/s	359 MB/s	10688 MB/s	1309 MB/s	430 MB/s
	0.44%	1.69%	-0.26%	-0.43%	0.00%	-3.17%	0.09%	0.24%	1.41%	-0.62%	0.54%	-0.23%

So performance difference seems negligible, and the tests are passing. Would appreciate a proper review on this please!

module/zfs/vdev_root.c

module/zfs/zio.c

amotin · 2025-06-19T16:11:15Z

module/zfs/zio.c

+	zio_nowait(zio_vdev_child_io(zio, zio->io_bp, spa->spa_root_vdev,
+	    zio->io_offset, zio->io_abd, zio->io_size, zio->io_type,
+	    zio->io_priority, 0, zio_logical_io_child_done, zio));


Considering that zio_t is more than 1KB in size, aside of time to allocate and handle it, the additional ZIO layer will also increase memory pressure. The most obvious it would be on small I/Os, like BRT/DDT, etc, so your benchmark should be better run on some very small record sizes, otherwise they may be not informative.

Yeah, I've been thinking a lot about that, and I'm sort of glad you pointed it out.

I'll run some small-blocksize tests soon, see what the difference is.

Maybe it's time to for me to start looking more seriously at splitting up zio_t. The observation is mostly that for any given instance, most of it is just dead space, so it could be a union or a separate per-type allocation. I've been thinking about it pretty often since #16722, and I know we have to do it eventually, but it's such a massive and invasive undertaking...

amotin · 2025-06-19T16:14:03Z

module/zfs/zio.c

+zio_logical_io_child_done(zio_t *zio)
+{
+	zio_t *pio = zio->io_private;
+	pio->io_error = zio->io_error;


May be out of this PR scope, but while I see similar code in random vdev types implementation, it seems to duplicate zio_inherit_child_errors() calls in zio_done() and a code in zio_notify_parent().

Yeah. It could be achieved by not setting ZIO_FLAG_DONT_PROPAGATE, but I haven't looked hard into the side-effects of that.

If nothing else, use of an io_done callback "inside" the pipeline smells a bit off to me.

I do have in my notes "why don't vdev child IOs propagate errors by default?". I think I would like to revisit that sometime.

But yeah, maybe another time.

The "logical" IO responsible for farming work out to the vdevs goes through the VDEV_IO stages, even though it does no IO itself, does not have a vdev set, and is not a "vdev" child IO. This means the VDEV_IO stages need special handling for this particular kind of IO, some of it totally irrelevant to real vdev IO (eg config lock, retry, etc). It also leads to some confusing asymmetries, eg the DVA throttle is held in the logical, and then released in pieces in the children. All this makes the code harder to read and understand, and hard to extend to limit behaviours to only logical or only vdev IO. This commit adds two new stages to the pipeline, ZIO_LOGICAL_IO_START and ZIO_LOGICAL_IO_DONE to handle this IO. This allows a clean separation between logical and vdev IO: vdev IO always has io_vd set, an io_child_type of ZIO_CHILD_VDEV, while logical IO is the inverse. Logical IO only ever goes throught through the LOGICAL_IO pipeline, and vdev IO through VDEV_IO. This separation presents a new problem, in that previously the logical IO would call into the mirror vdev ops to issue the vdev IO, which is now not possible because non-vdev IOs can't use vdev operations. To keep the overall pipeline tidy, we press the root vdev into service. zio_logical_io_start() creates a child IO against spa_root_vdev, which then delegates to the mirror vdev ops to do its work. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

robn force-pushed the zio-logical-io branch from e324f6b to 3f23f00 Compare May 28, 2025 07:28

robn force-pushed the zio-logical-io branch 2 times, most recently from 098bb06 to f0117bd Compare May 30, 2025 09:05

robn force-pushed the zio-logical-io branch from f0117bd to aac3042 Compare June 5, 2025 08:27

robn force-pushed the zio-logical-io branch 3 times, most recently from 7b750a7 to f4480ee Compare June 12, 2025 00:16

robn force-pushed the zio-logical-io branch from f4480ee to 13b70b1 Compare June 13, 2025 08:56

behlendorf self-requested a review June 13, 2025 21:36

robn force-pushed the zio-logical-io branch from 13b70b1 to 1bb3ac9 Compare June 15, 2025 11:14

amotin reviewed Jun 19, 2025

View reviewed changes

robn force-pushed the zio-logical-io branch from 1bb3ac9 to 39e8e26 Compare June 20, 2025 02:29

robn force-pushed the zio-logical-io branch from 39e8e26 to 694fba7 Compare June 20, 2025 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

zio: add separate pipeline stages for logical IO #17388

zio: add separate pipeline stages for logical IO #17388

Uh oh!

robn commented May 28, 2025 •

edited

Loading

Uh oh!

amotin commented May 28, 2025 •

edited

Loading

Uh oh!

robn commented May 29, 2025

Uh oh!

robn commented May 30, 2025

Uh oh!

robn commented Jun 5, 2025 •

edited

Loading

Uh oh!

robn commented Jun 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

amotin Jun 19, 2025

Uh oh!

robn Jun 20, 2025

Uh oh!

amotin Jun 19, 2025

Uh oh!

robn Jun 20, 2025

Uh oh!

Uh oh!

zio: add separate pipeline stages for logical IO #17388

Are you sure you want to change the base?

zio: add separate pipeline stages for logical IO #17388

Uh oh!

Conversation

robn commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

amotin commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robn commented May 29, 2025

Uh oh!

robn commented May 30, 2025

Uh oh!

robn commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robn commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amotin Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

robn Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

amotin Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

robn Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robn commented May 28, 2025 •

edited

Loading

amotin commented May 28, 2025 •

edited

Loading

robn commented Jun 5, 2025 •

edited

Loading

robn commented Jun 12, 2025 •

edited

Loading