romio: daos: performance problem with long list of tiny requests list-io #7283

roblatham00 · 2025-01-29T19:59:59Z

Bringing an email discussion out to a little more public place.

The LAMMPS code can write files with parallel-netcdf. pnetcdf in turn will use MPI-IO (ROMIO) to write out data.

The clients in this workload all write highly noncontiguous data, but none of it overlaps with other processes. As a result, ROMIO's interleaving heuristic guides ROMIO to take the independent i/o path instead of two-phase collective buffering.

For regular Unix file systems, ROMIO's data sieving optimization kicks in, but if we use DAOS, that driver uses a different "list io" optimization. https://github.com/pmodels/mpich/blob/main/src/mpi/romio/adio/ad_daos/ad_daos_io_str.c#L66

This data is contiguous in memory, non-contiguous in file. The request ends up being 175k 4-byte requests from each MPI process

The way this translates into DAOS is that it creates an RPC with an IO descriptor that matches that iovec and send to the server.

The array API combines the IO requests that are in a single chunk/stripe (default 1mib) into a single RPC / task, and sends that RPC. I expect that, given the file size, there are not a lot of RPC requests created here from each process that can cause issues. So no, there is not going to be 175k RPC requests from each client process.

The issue is on the server, not the client as I can tell. Each of those iovec entries (4 bytes), 175k of them x number of clients ranks, is an independent update in the VOS tree and the pmem. After that there is the background aggregation process that aggregates all those 4 byte requests into large chunks if possible. This is the expensive part that causes the bad performance; not the client processing or the RPC/network request.

A couple thoughts about how we could handle this:

We have a ADIOI_NOLOCK_WriteStrided -- that's not exactly what we want here, though it does have a useful "data packing" optimization for the noncontiguos in memory, contiguous in file case.
A contributor a few years back made locking routines driver-specific... see ADIOI_GEN_SetLock64 and the ADIOI_xxx_SetLock function pointer in the ADIOI_Fns_struct....
Since we are calling this code from the collective path, and already determined there is no interleaving among processes, could we patch up the ADIOI_Fns_struct to no-op locks, do data sieving in a risky no-lock way, then patch the proper locking routine back into the Fns struct? (this might be too clever... definitely risky...)

@mchaarawi @pkcoff -- did I miss anything? I'll also bring this up on the DAOS Jira

mchaarawi · 2025-01-29T20:10:57Z

nice summary! looks accurate to me.

roblatham00 · 2025-01-29T20:15:27Z

Also tracking this in https://daosio.atlassian.net/browse/DAOS-17008

roblatham00 · 2025-01-29T20:17:23Z

More brainstorming:

chunk up the list into smaller sections. instead of sending hundreds of thousands, send 100 at a time.
look at the length of the requests, total size, average request size (not sure exactly what information we have at that level) and don't take list-io path in some cases

pkcoff · 2025-01-29T20:25:34Z

I think you covered all the angles Rob

wkliao · 2025-01-29T21:45:06Z

If the large number of clients makes the performance worse, then maybe the PnetCDF's
new feature, called intra-node write aggregation, added in the latest release of 1.14.0
can help . It aggregates write requests from processes running on the same compute
node into a subset of MPI ranks, named intra-node aggregators. Then, only the
intra-node aggregators make non-zero write requests to ROMIO. The aggregated
requests will be coalesced at each aggregator, if can.

My experimental results using E3SM-IO F case on Perlmutter, which also writes a large
number of small, noncontiguous requests, shows a good improvement.

pkcoff · 2025-01-30T00:16:01Z

Thanks for the suggestion @wkliao for the lammps pnetcdf case we are only running 12ppn, the data is 3d composition rank ordered so we would get some benefit from the aggregation, but 1/12 of 175k noncontig writes is still alot and I think we need to solve for the general case, if I have time I will look at the e3smio pnetcdf implementation and apply to lammps pnetcdf and see....

wkliao · 2025-01-30T01:04:32Z

If the aggregation can coalesce the requests into a smaller list, then
the number may become much less than 1/12 of 175k.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

romio: daos: performance problem with long list of tiny requests list-io #7283

romio: daos: performance problem with long list of tiny requests list-io #7283

roblatham00 commented Jan 29, 2025

mchaarawi commented Jan 29, 2025

roblatham00 commented Jan 29, 2025

roblatham00 commented Jan 29, 2025

pkcoff commented Jan 29, 2025

wkliao commented Jan 29, 2025

pkcoff commented Jan 30, 2025

wkliao commented Jan 30, 2025

romio: daos: performance problem with long list of tiny requests list-io #7283

romio: daos: performance problem with long list of tiny requests list-io #7283

Comments

roblatham00 commented Jan 29, 2025

mchaarawi commented Jan 29, 2025

roblatham00 commented Jan 29, 2025

roblatham00 commented Jan 29, 2025

pkcoff commented Jan 29, 2025

wkliao commented Jan 29, 2025

pkcoff commented Jan 30, 2025

wkliao commented Jan 30, 2025