Skip to content

zio: use a separate allocation for io_bp_copy and io_bp_orig #17637

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

robn
Copy link
Member

@robn robn commented Aug 14, 2025

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

256 bytes of zio_t are devoted to two blkptr_ts, that aren't used on many zio types. Instead, lets make them pointers instead, and only allocate them when we need them.

This PR is intended as a kind of "polished prototype" for discussion. Maybe no one has strong opinions and it sails through, but idk, I have some mild unease about this one, so lets chat :)

Description

The short version: add a new cache, zio_bp_cache. In zio_create(), only allocate these when the IO types demand it - for io_bp_copy, when we need our own because the caller will free theirs, and for io_bp_orig, when we will modify the bp (ie most writes) and need the original around in case of a roll back.

The three commits are very straightforward and should be squashed in the end; the top one without the signoff is a dumb stats addition just to make it easier to understand what's happening while we're testing and playing around (probably those would be better in the cache stats really, but that's another change for another day).

But now lets talk about the things I don't like :)

I don't like the separate allocation, but its kind of what it has to be. It should be efficient enough once the cache is warm, though there are never too many out at once. The gain on the memory side seems worth it though. In my unscientific tests, some straightforward readng & writing to a 4xZ1 suggests we only allocate from zio_bp_cache for about 10% of all zio_cache allocations. A full ZTS run did about 40%, which may be more representative over time, but still seems worth a look.

This is a fairly naive conversion which took me all of a day, including testing runs, $dayjob and a big sleep. We could likely do better if we spent some time consider how we pass BPs around and where we could be a bit smarter. Could we share BPs that won't change more through the IO tree? Reach into our parent or io_logical to grab one it if we're downstream? Some refcounting and/or copy-on-write arrangement?

(at least, we can probably easily fix that L2ARC thing by just passing it arc_read_done() instead of smuggling it inside the zio).

Here's the size change. As you see, the new one has a pretty hefty hole in the middle of it. It's easy enough to reorder some things to fill it, but it doesn't materially change anything because of the cacheline alignment, we just get extra padding at the end. That feels like something for the next PR.

pahole before (1152 bytes)
struct zio {
	zbookmark_phys_t           io_bookmark;          /*     0    32 */
	zio_prop_t                 io_prop;              /*    32    52 */
	/* --- cacheline 1 boundary (64 bytes) was 20 bytes ago --- */
	zio_type_t                 io_type;              /*    84     4 */
	enum zio_child             io_child_type;        /*    88     4 */
	enum trim_flag             io_trim_flags;        /*    92     4 */
	zio_priority_t             io_priority;          /*    96     4 */
	uint8_t                    io_post;              /*   100     1 */
	uint8_t                    io_state[2];          /*   101     2 */

	/* XXX 1 byte hole, try to pack */

	uint64_t                   io_txg;               /*   104     8 */
	spa_t *                    io_spa;               /*   112     8 */
	blkptr_t *                 io_bp;                /*   120     8 */
	/* --- cacheline 2 boundary (128 bytes) --- */
	blkptr_t *                 io_bp_override;       /*   128     8 */
	blkptr_t                   io_bp_copy;           /*   136   128 */
	/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
	list_t                     io_parent_list;       /*   264    24 */
	list_t                     io_child_list;        /*   288    24 */
	zio_t *                    io_logical;           /*   312     8 */
	/* --- cacheline 5 boundary (320 bytes) --- */
	zio_transform_t *          io_transform_stack;   /*   320     8 */
	zio_done_func_t *          io_ready;             /*   328     8 */
	zio_done_func_t *          io_children_ready;    /*   336     8 */
	zio_done_func_t *          io_done;              /*   344     8 */
	void *                     io_private;           /*   352     8 */
	int64_t                    io_prev_space_delta;  /*   360     8 */
	blkptr_t                   io_bp_orig;           /*   368   128 */
	/* --- cacheline 7 boundary (448 bytes) was 48 bytes ago --- */
	uint64_t                   io_lsize;             /*   496     8 */
	struct abd *               io_abd;               /*   504     8 */
	/* --- cacheline 8 boundary (512 bytes) --- */
	struct abd *               io_orig_abd;          /*   512     8 */
	uint64_t                   io_size;              /*   520     8 */
	uint64_t                   io_orig_size;         /*   528     8 */
	vdev_t *                   io_vd;                /*   536     8 */
	void *                     io_vsd;               /*   544     8 */
	const zio_vsd_ops_t  *     io_vsd_ops;           /*   552     8 */
	metaslab_class_t *         io_metaslab_class;    /*   560     8 */
	enum zio_qstate            io_queue_state;       /*   568     4 */

	/* XXX 4 bytes hole, try to pack */

	/* --- cacheline 9 boundary (576 bytes) --- */
	union {
		list_node_t        l;                    /*   576    16 */
		avl_node_t         a;                    /*   576    24 */
	} io_queue_node __attribute__((__aligned__(64))); /*   576    24 */
	avl_node_t                 io_offset_node;       /*   600    24 */
	uint64_t                   io_offset;            /*   624     8 */
	hrtime_t                   io_timestamp;         /*   632     8 */
	/* --- cacheline 10 boundary (640 bytes) --- */
	hrtime_t                   io_queued_timestamp;  /*   640     8 */
	hrtime_t                   io_target_timestamp;  /*   648     8 */
	hrtime_t                   io_delta;             /*   656     8 */
	hrtime_t                   io_delay;             /*   664     8 */
	zio_alloc_list_t           io_alloc_list;        /*   672    32 */
	/* --- cacheline 11 boundary (704 bytes) --- */
	zio_flag_t                 io_flags;             /*   704     8 */
	enum zio_stage             io_stage;             /*   712     4 */
	enum zio_stage             io_pipeline;          /*   716     4 */
	zio_flag_t                 io_orig_flags;        /*   720     8 */
	enum zio_stage             io_orig_stage;        /*   728     4 */
	enum zio_stage             io_orig_pipeline;     /*   732     4 */
	enum zio_stage             io_pipeline_trace;    /*   736     4 */
	int                        io_error;             /*   740     4 */
	int                        io_child_error[4];    /*   744    16 */
	uint64_t                   io_children[4][2];    /*   760    64 */
	/* --- cacheline 12 boundary (768 bytes) was 56 bytes ago --- */
	uint64_t *                 io_stall;             /*   824     8 */
	/* --- cacheline 13 boundary (832 bytes) --- */
	zio_t *                    io_gang_leader;       /*   832     8 */
	zio_gang_node_t *          io_gang_tree;         /*   840     8 */
	void *                     io_executor;          /*   848     8 */
	void *                     io_waiter;            /*   856     8 */
	void *                     io_bio;               /*   864     8 */
	kmutex_t                   io_lock;              /*   872    48 */
	/* --- cacheline 14 boundary (896 bytes) was 24 bytes ago --- */
	kcondvar_t                 io_cv;                /*   920    72 */
	/* --- cacheline 15 boundary (960 bytes) was 32 bytes ago --- */
	int                        io_allocator;         /*   992     4 */

	/* XXX 4 bytes hole, try to pack */

	zio_cksum_report_t *       io_cksum_report;      /*  1000     8 */
	uint64_t                   io_ena;               /*  1008     8 */
	taskq_ent_t                io_tqent;             /*  1016   136 */

	/* size: 1152, cachelines: 18, members: 64 */
	/* sum members: 1143, holes: 3, sum holes: 9 */
	/* forced alignments: 1, forced holes: 1, sum forced holes: 4 */
} __attribute__((__aligned__(64)));
pahole after (960 bytes)
struct zio {
	zbookmark_phys_t           io_bookmark;          /*     0    32 */
	zio_prop_t                 io_prop;              /*    32    52 */
	/* --- cacheline 1 boundary (64 bytes) was 20 bytes ago --- */
	zio_type_t                 io_type;              /*    84     4 */
	enum zio_child             io_child_type;        /*    88     4 */
	enum trim_flag             io_trim_flags;        /*    92     4 */
	zio_priority_t             io_priority;          /*    96     4 */
	uint8_t                    io_post;              /*   100     1 */
	uint8_t                    io_state[2];          /*   101     2 */

	/* XXX 1 byte hole, try to pack */

	uint64_t                   io_txg;               /*   104     8 */
	spa_t *                    io_spa;               /*   112     8 */
	blkptr_t *                 io_bp;                /*   120     8 */
	/* --- cacheline 2 boundary (128 bytes) --- */
	blkptr_t *                 io_bp_override;       /*   128     8 */
	blkptr_t *                 io_bp_copy;           /*   136     8 */
	list_t                     io_parent_list;       /*   144    24 */
	list_t                     io_child_list;        /*   168    24 */
	/* --- cacheline 3 boundary (192 bytes) --- */
	zio_t *                    io_logical;           /*   192     8 */
	zio_transform_t *          io_transform_stack;   /*   200     8 */
	zio_done_func_t *          io_ready;             /*   208     8 */
	zio_done_func_t *          io_children_ready;    /*   216     8 */
	zio_done_func_t *          io_done;              /*   224     8 */
	void *                     io_private;           /*   232     8 */
	int64_t                    io_prev_space_delta;  /*   240     8 */
	blkptr_t *                 io_bp_orig;           /*   248     8 */
	/* --- cacheline 4 boundary (256 bytes) --- */
	uint64_t                   io_lsize;             /*   256     8 */
	struct abd *               io_abd;               /*   264     8 */
	struct abd *               io_orig_abd;          /*   272     8 */
	uint64_t                   io_size;              /*   280     8 */
	uint64_t                   io_orig_size;         /*   288     8 */
	vdev_t *                   io_vd;                /*   296     8 */
	void *                     io_vsd;               /*   304     8 */
	const zio_vsd_ops_t  *     io_vsd_ops;           /*   312     8 */
	/* --- cacheline 5 boundary (320 bytes) --- */
	metaslab_class_t *         io_metaslab_class;    /*   320     8 */
	enum zio_qstate            io_queue_state;       /*   328     4 */

	/* XXX 52 bytes hole, try to pack */

	/* --- cacheline 6 boundary (384 bytes) --- */
	union {
		list_node_t        l;                    /*   384    16 */
		avl_node_t         a;                    /*   384    24 */
	} io_queue_node __attribute__((__aligned__(64))); /*   384    24 */
	avl_node_t                 io_offset_node;       /*   408    24 */
	uint64_t                   io_offset;            /*   432     8 */
	hrtime_t                   io_timestamp;         /*   440     8 */
	/* --- cacheline 7 boundary (448 bytes) --- */
	hrtime_t                   io_queued_timestamp;  /*   448     8 */
	hrtime_t                   io_target_timestamp;  /*   456     8 */
	hrtime_t                   io_delta;             /*   464     8 */
	hrtime_t                   io_delay;             /*   472     8 */
	zio_alloc_list_t           io_alloc_list;        /*   480    32 */
	/* --- cacheline 8 boundary (512 bytes) --- */
	zio_flag_t                 io_flags;             /*   512     8 */
	enum zio_stage             io_stage;             /*   520     4 */
	enum zio_stage             io_pipeline;          /*   524     4 */
	zio_flag_t                 io_orig_flags;        /*   528     8 */
	enum zio_stage             io_orig_stage;        /*   536     4 */
	enum zio_stage             io_orig_pipeline;     /*   540     4 */
	enum zio_stage             io_pipeline_trace;    /*   544     4 */
	int                        io_error;             /*   548     4 */
	int                        io_child_error[4];    /*   552    16 */
	uint64_t                   io_children[4][2];    /*   568    64 */
	/* --- cacheline 9 boundary (576 bytes) was 56 bytes ago --- */
	uint64_t *                 io_stall;             /*   632     8 */
	/* --- cacheline 10 boundary (640 bytes) --- */
	zio_t *                    io_gang_leader;       /*   640     8 */
	zio_gang_node_t *          io_gang_tree;         /*   648     8 */
	void *                     io_executor;          /*   656     8 */
	void *                     io_waiter;            /*   664     8 */
	void *                     io_bio;               /*   672     8 */
	kmutex_t                   io_lock;              /*   680    48 */
	/* --- cacheline 11 boundary (704 bytes) was 24 bytes ago --- */
	kcondvar_t                 io_cv;                /*   728    72 */
	/* --- cacheline 12 boundary (768 bytes) was 32 bytes ago --- */
	int                        io_allocator;         /*   800     4 */

	/* XXX 4 bytes hole, try to pack */

	zio_cksum_report_t *       io_cksum_report;      /*   808     8 */
	uint64_t                   io_ena;               /*   816     8 */
	taskq_ent_t                io_tqent;             /*   824   136 */

	/* size: 960, cachelines: 15, members: 64 */
	/* sum members: 903, holes: 3, sum holes: 57 */
	/* forced alignments: 1, forced holes: 1, sum forced holes: 52 */
} __attribute__((__aligned__(64)));

The other thing I'm unsure about with this is whether we generally want to look towards this "componentised" kind of model of zio_t, where it only has the pieces it needs. The fields can be grouped into a bunch of different "uses", eg, there's a handful items specific to queue operations, which aren't needed outside of read/write/trim to leaf devices. I can see arguments for and against; it does let us get the memory usage down to just what we need, but C doesn't make it easy to work with that sort of structure, and its also gotta be done carefully to avoid fragmenting memory too much. Maybe it doesn't matter if we don't see this PR as a precedent, or we just consider this a step in a direction.

How Has This Been Tested?

Full ZTS run completed successfully against Linux 6.12. I would not however be surprised to find I've introduced some very subtle NULL deref in some niche situation though.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

robn added 4 commits August 14, 2025 12:58
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
We only need our own copy of the BP if the caller provided one that they
would free before we're done with it.

Additionally, L2ARC currently abuses io_bp_copy to smuggle the
originally BP back to arc_read_done() inside the L2ARC read ZIO. A small
helper function is provided to handle that allocation when necessary.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
We only need a copy if we would modify the original and might need to
revert, which only applies to non-vdev writes.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Just a rough eyeball comparison to get a feel for what we're saving.
@robn robn changed the title zio: use a separate allocation for io_bp_copy and io_bp_orig. zio: use a separate allocation for io_bp_copy and io_bp_orig Aug 14, 2025
zio->io_bp_orig =
kmem_cache_alloc(zio_bp_cache, KM_SLEEP);
ZIOSTAT_BUMP(ziostat_zio_bp_total);
*zio->io_bp_orig = *bp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe have a subroutine here, that does the alloc, stat bump, and memcpy?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants