zio: use a separate allocation for io_bp_copy
and io_bp_orig
#17637
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
256 bytes of
zio_t
are devoted to twoblkptr_t
s, that aren't used on many zio types. Instead, lets make them pointers instead, and only allocate them when we need them.This PR is intended as a kind of "polished prototype" for discussion. Maybe no one has strong opinions and it sails through, but idk, I have some mild unease about this one, so lets chat :)
Description
The short version: add a new cache,
zio_bp_cache
. Inzio_create()
, only allocate these when the IO types demand it - forio_bp_copy
, when we need our own because the caller will free theirs, and forio_bp_orig
, when we will modify the bp (ie most writes) and need the original around in case of a roll back.The three commits are very straightforward and should be squashed in the end; the top one without the signoff is a dumb stats addition just to make it easier to understand what's happening while we're testing and playing around (probably those would be better in the cache stats really, but that's another change for another day).
But now lets talk about the things I don't like :)
I don't like the separate allocation, but its kind of what it has to be. It should be efficient enough once the cache is warm, though there are never too many out at once. The gain on the memory side seems worth it though. In my unscientific tests, some straightforward readng & writing to a 4xZ1 suggests we only allocate from
zio_bp_cache
for about 10% of allzio_cache
allocations. A full ZTS run did about 40%, which may be more representative over time, but still seems worth a look.This is a fairly naive conversion which took me all of a day, including testing runs, $dayjob and a big sleep. We could likely do better if we spent some time consider how we pass BPs around and where we could be a bit smarter. Could we share BPs that won't change more through the IO tree? Reach into our parent or
io_logical
to grab one it if we're downstream? Some refcounting and/or copy-on-write arrangement?(at least, we can probably easily fix that L2ARC thing by just passing it
arc_read_done()
instead of smuggling it inside the zio).Here's the size change. As you see, the new one has a pretty hefty hole in the middle of it. It's easy enough to reorder some things to fill it, but it doesn't materially change anything because of the cacheline alignment, we just get extra padding at the end. That feels like something for the next PR.
pahole before (1152 bytes)
pahole after (960 bytes)
The other thing I'm unsure about with this is whether we generally want to look towards this "componentised" kind of model of
zio_t
, where it only has the pieces it needs. The fields can be grouped into a bunch of different "uses", eg, there's a handful items specific to queue operations, which aren't needed outside of read/write/trim to leaf devices. I can see arguments for and against; it does let us get the memory usage down to just what we need, but C doesn't make it easy to work with that sort of structure, and its also gotta be done carefully to avoid fragmenting memory too much. Maybe it doesn't matter if we don't see this PR as a precedent, or we just consider this a step in a direction.How Has This Been Tested?
Full ZTS run completed successfully against Linux 6.12. I would not however be surprised to find I've introduced some very subtle NULL deref in some niche situation though.
Types of changes
Checklist:
Signed-off-by
.