-
Notifications
You must be signed in to change notification settings - Fork 1.9k
zio: add separate pipeline stages for logical IO #17388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
It might not look great, but it is done that way intentionally. If we are writing several copies of a block to a several different top-level vdevs, we want each of them to be throttled individually. Otherwise single slow vdev would throttle others, in avoiding which is the point of allocation throttling mechanism. Previously we even tracked dirty data in individual leaf vdev granularity, but I had to remove that some time ago due to too high processing overhead. |
Yeah that's fair. I knew why, -ish, but hadn't quite joined up all the dots. I've put it back the way it was, adjusted for the new model. Tests will run overnight; if it comes up ok I'll push it. Thanks. |
098bb06
to
f0117bd
Compare
Rebased, and updated to move the unthrottle back to when the top vdev IO finishes. |
Sorta-kinda. It was actually around where fault injections targeting a DVA/bookmark/objecttype were injected. I had sorta blindly moved them to As it turns out, there is exactly one test that uses this ( |
7b750a7
to
f4480ee
Compare
I did some semi-scientific test runs this morning. Production (non-debug) build at the given commit. Each run is on a freshly-created pool of 14 disks. 100 threads, either 73 writing + 27 reading, or 100 writing, with or without fsync after each file write. There's various customer tuning applied, but it's the same on both, so I don't claim this is "good" or "bad" in general, but over the years comparison timings have proven very useful, and I believe so here too.
So performance difference seems negligible, and the tests are passing. Would appreciate a proper review on this please! |
zio_nowait(zio_vdev_child_io(zio, zio->io_bp, spa->spa_root_vdev, | ||
zio->io_offset, zio->io_abd, zio->io_size, zio->io_type, | ||
zio->io_priority, 0, zio_logical_io_child_done, zio)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that zio_t
is more than 1KB in size, aside of time to allocate and handle it, the additional ZIO layer will also increase memory pressure. The most obvious it would be on small I/Os, like BRT/DDT, etc, so your benchmark should be better run on some very small record sizes, otherwise they may be not informative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've been thinking a lot about that, and I'm sort of glad you pointed it out.
I'll run some small-blocksize tests soon, see what the difference is.
Maybe it's time to for me to start looking more seriously at splitting up zio_t
. The observation is mostly that for any given instance, most of it is just dead space, so it could be a union or a separate per-type allocation. I've been thinking about it pretty often since #16722, and I know we have to do it eventually, but it's such a massive and invasive undertaking...
zio_logical_io_child_done(zio_t *zio) | ||
{ | ||
zio_t *pio = zio->io_private; | ||
pio->io_error = zio->io_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be out of this PR scope, but while I see similar code in random vdev types implementation, it seems to duplicate zio_inherit_child_errors()
calls in zio_done()
and a code in zio_notify_parent()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. It could be achieved by not setting ZIO_FLAG_DONT_PROPAGATE
, but I haven't looked hard into the side-effects of that.
If nothing else, use of an io_done
callback "inside" the pipeline smells a bit off to me.
I do have in my notes "why don't vdev child IOs propagate errors by default?". I think I would like to revisit that sometime.
But yeah, maybe another time.
The "logical" IO responsible for farming work out to the vdevs goes through the VDEV_IO stages, even though it does no IO itself, does not have a vdev set, and is not a "vdev" child IO. This means the VDEV_IO stages need special handling for this particular kind of IO, some of it totally irrelevant to real vdev IO (eg config lock, retry, etc). It also leads to some confusing asymmetries, eg the DVA throttle is held in the logical, and then released in pieces in the children. All this makes the code harder to read and understand, and hard to extend to limit behaviours to only logical or only vdev IO. This commit adds two new stages to the pipeline, ZIO_LOGICAL_IO_START and ZIO_LOGICAL_IO_DONE to handle this IO. This allows a clean separation between logical and vdev IO: vdev IO always has io_vd set, an io_child_type of ZIO_CHILD_VDEV, while logical IO is the inverse. Logical IO only ever goes throught through the LOGICAL_IO pipeline, and vdev IO through VDEV_IO. This separation presents a new problem, in that previously the logical IO would call into the mirror vdev ops to issue the vdev IO, which is now not possible because non-vdev IOs can't use vdev operations. To keep the overall pipeline tidy, we press the root vdev into service. zio_logical_io_start() creates a child IO against spa_root_vdev, which then delegates to the mirror vdev ops to do its work. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
The "logical" IO responsible for farming work out to the vdevs goes through the
VDEV_IO
stages, even though it does no IO itself, does not have a vdev set, and is not a "vdev" child IO.This means the
VDEV_IO
stages need special handling for this particular kind of IO, some of it totally irrelevant to real vdev IO (eg config lock, retry, etc). It also leads to some confusing asymmetries, eg the DVA throttle is held in the logical, and then released in pieces in the children.(I can elaborate on what I'm doing if more justification is needed, but I'm hopeful this stands on its own as a good cleanup).
Description
This commit adds two new stages to the pipeline,
ZIO_LOGICAL_IO_START
andZIO_LOGICAL_IO_DONE
to handle this IO. This allows a clean separation between logical and vdev IO: vdev IO always has io_vd set, an io_child_type ofZIO_CHILD_VDEV
, while logical IO is the inverse. Logical IO only ever goes throught through theLOGICAL_IO
pipeline, and vdev IO throughVDEV_IO
.This separation presents a new problem, in that previously the logical IO would call into the mirror vdev ops to issue the vdev IO, which is now not possible because non-vdev IOs can't use vdev operations. To keep the overall pipeline tidy, we press the root vdev into service.
zio_logical_io_start()
creates a child IO againstspa_root_vdev
, which then delegates to the mirror vdev ops to do its work.How Has This Been Tested?
Successful ZTS run against Linux 6.1.137 and FreeBSD 14.2-p1.
Throughput appears to be similar in light performance tests, though I have not pushed it very hard.
Types of changes
Checklist:
Signed-off-by
.