Skip to content

zio: add separate pipeline stages for logical IO #17388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

robn
Copy link
Member

@robn robn commented May 28, 2025

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

The "logical" IO responsible for farming work out to the vdevs goes through the VDEV_IO stages, even though it does no IO itself, does not have a vdev set, and is not a "vdev" child IO.

This means the VDEV_IO stages need special handling for this particular kind of IO, some of it totally irrelevant to real vdev IO (eg config lock, retry, etc). It also leads to some confusing asymmetries, eg the DVA throttle is held in the logical, and then released in pieces in the children.

(I can elaborate on what I'm doing if more justification is needed, but I'm hopeful this stands on its own as a good cleanup).

Description

This commit adds two new stages to the pipeline, ZIO_LOGICAL_IO_START and ZIO_LOGICAL_IO_DONE to handle this IO. This allows a clean separation between logical and vdev IO: vdev IO always has io_vd set, an io_child_type of ZIO_CHILD_VDEV, while logical IO is the inverse. Logical IO only ever goes throught through the LOGICAL_IO pipeline, and vdev IO through VDEV_IO.

This separation presents a new problem, in that previously the logical IO would call into the mirror vdev ops to issue the vdev IO, which is now not possible because non-vdev IOs can't use vdev operations. To keep the overall pipeline tidy, we press the root vdev into service. zio_logical_io_start() creates a child IO against spa_root_vdev, which then delegates to the mirror vdev ops to do its work.

How Has This Been Tested?

Successful ZTS run against Linux 6.1.137 and FreeBSD 14.2-p1.

Throughput appears to be similar in light performance tests, though I have not pushed it very hard.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn robn force-pushed the zio-logical-io branch from e324f6b to 3f23f00 Compare May 28, 2025 07:28
@amotin
Copy link
Member

amotin commented May 28, 2025

It also leads to some confusing asymmetries, eg the DVA throttle is held in the logical, and then released in pieces in the children.

It might not look great, but it is done that way intentionally. If we are writing several copies of a block to a several different top-level vdevs, we want each of them to be throttled individually. Otherwise single slow vdev would throttle others, in avoiding which is the point of allocation throttling mechanism.

Previously we even tracked dirty data in individual leaf vdev granularity, but I had to remove that some time ago due to too high processing overhead.

@robn
Copy link
Member Author

robn commented May 29, 2025

It might not look great, but it is done that way intentionally. If we are writing several copies of a block to a several different top-level vdevs, we want each of them to be throttled individually. Otherwise single slow vdev would throttle others, in avoiding which is the point of allocation throttling mechanism.

Yeah that's fair. I knew why, -ish, but hadn't quite joined up all the dots.

I've put it back the way it was, adjusted for the new model. Tests will run overnight; if it comes up ok I'll push it. Thanks.

@robn robn force-pushed the zio-logical-io branch 2 times, most recently from 098bb06 to f0117bd Compare May 30, 2025 09:05
@robn
Copy link
Member Author

robn commented May 30, 2025

Rebased, and updated to move the unthrottle back to when the top vdev IO finishes. resilver_restart_002 still fails, pretty sure it'll be something around about where errors are being accounted to. I'll get back to it properly next week.

@robn robn force-pushed the zio-logical-io branch from f0117bd to aac3042 Compare June 5, 2025 08:27
@robn
Copy link
Member Author

robn commented Jun 5, 2025

resilver_restart_002 still fails, pretty sure it'll be something around about where errors are being accounted to. I'll get back to it properly next week.

Sorta-kinda. It was actually around where fault injections targeting a DVA/bookmark/objecttype were injected. I had sorta blindly moved them to zio_io_logical_done() because those are logical concepts, however the injections still have to be done at the vdev level, (a) because they may target a specific vdev, and (b) we want to record the vdev as the "source" of the error.

As it turns out, there is exactly one test that uses this (zinject -b), and that's resilver_restart_002. So that's working now.

@robn robn force-pushed the zio-logical-io branch 3 times, most recently from 7b750a7 to f4480ee Compare June 12, 2025 00:16
@robn
Copy link
Member Author

robn commented Jun 12, 2025

I did some semi-scientific test runs this morning. Production (non-debug) build at the given commit. Each run is on a freshly-created pool of 14 disks. 100 threads, either 73 writing + 27 reading, or 100 writing, with or without fsync after each file write. There's various customer tuning applied, but it's the same on both, so I don't claim this is "good" or "bad" in general, but over the years comparison timings have proven very useful, and I believe so here too.

14x RAIDz3 7x2 mirror
3300 w73/r27 3301 w73/r27 + fsync 3302 w100 3303 w100 + fsync 3300 w73/r27 3301 w73/r27 + fsync 3302 w100 3303 w100 + fsync
write read write read write write write read write read write write
master @ e0ef4d2 [PROD] 1134 MB/s 2246 MB/s 387 MB/s 10684 MB/s 1601 MB/s 473 MB/s 1064 MB/s 3722 MB/s 354 MB/s 10755 MB/s 1302 MB/s 431 MB/s
zio-logical-io @ f4480ee [PROD] 1139 MB/s 2284 MB/s 386 MB/s 10638 MB/s 1601 MB/s 458 MB/s 1065 MB/s 3731 MB/s 359 MB/s 10688 MB/s 1309 MB/s 430 MB/s
0.44% 1.69% -0.26% -0.43% 0.00% -3.17% 0.09% 0.24% 1.41% -0.62% 0.54% -0.23%

So performance difference seems negligible, and the tests are passing. Would appreciate a proper review on this please!

@robn robn force-pushed the zio-logical-io branch from f4480ee to 13b70b1 Compare June 13, 2025 08:56
@behlendorf behlendorf self-requested a review June 13, 2025 21:36
@robn robn force-pushed the zio-logical-io branch from 13b70b1 to 1bb3ac9 Compare June 15, 2025 11:14
Comment on lines +4533 to +4535
zio_nowait(zio_vdev_child_io(zio, zio->io_bp, spa->spa_root_vdev,
zio->io_offset, zio->io_abd, zio->io_size, zio->io_type,
zio->io_priority, 0, zio_logical_io_child_done, zio));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that zio_t is more than 1KB in size, aside of time to allocate and handle it, the additional ZIO layer will also increase memory pressure. The most obvious it would be on small I/Os, like BRT/DDT, etc, so your benchmark should be better run on some very small record sizes, otherwise they may be not informative.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've been thinking a lot about that, and I'm sort of glad you pointed it out.

I'll run some small-blocksize tests soon, see what the difference is.

Maybe it's time to for me to start looking more seriously at splitting up zio_t. The observation is mostly that for any given instance, most of it is just dead space, so it could be a union or a separate per-type allocation. I've been thinking about it pretty often since #16722, and I know we have to do it eventually, but it's such a massive and invasive undertaking...

zio_logical_io_child_done(zio_t *zio)
{
zio_t *pio = zio->io_private;
pio->io_error = zio->io_error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be out of this PR scope, but while I see similar code in random vdev types implementation, it seems to duplicate zio_inherit_child_errors() calls in zio_done() and a code in zio_notify_parent().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. It could be achieved by not setting ZIO_FLAG_DONT_PROPAGATE, but I haven't looked hard into the side-effects of that.

If nothing else, use of an io_done callback "inside" the pipeline smells a bit off to me.

I do have in my notes "why don't vdev child IOs propagate errors by default?". I think I would like to revisit that sometime.

But yeah, maybe another time.

@robn robn force-pushed the zio-logical-io branch from 1bb3ac9 to 39e8e26 Compare June 20, 2025 02:29
The "logical" IO responsible for farming work out to the vdevs goes
through the VDEV_IO stages, even though it does no IO itself, does not
have a vdev set, and is not a "vdev" child IO.

This means the VDEV_IO stages need special handling for this particular
kind of IO, some of it totally irrelevant to real vdev IO (eg config
lock, retry, etc). It also leads to some confusing asymmetries, eg the
DVA throttle is held in the logical, and then released in pieces in the
children.

All this makes the code harder to read and understand, and hard to
extend to limit behaviours to only logical or only vdev IO.

This commit adds two new stages to the pipeline, ZIO_LOGICAL_IO_START
and ZIO_LOGICAL_IO_DONE to handle this IO. This allows a clean
separation between logical and vdev IO: vdev IO always has io_vd set,
an io_child_type of ZIO_CHILD_VDEV, while logical IO is the inverse.
Logical IO only ever goes throught through the LOGICAL_IO pipeline, and
vdev IO through VDEV_IO.

This separation presents a new problem, in that previously the logical
IO would call into the mirror vdev ops to issue the vdev IO, which is
now not possible because non-vdev IOs can't use vdev operations. To keep
the overall pipeline tidy, we press the root vdev into service.
zio_logical_io_start() creates a child IO against spa_root_vdev, which
then delegates to the mirror vdev ops to do its work.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn robn force-pushed the zio-logical-io branch from 39e8e26 to 694fba7 Compare June 20, 2025 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants