Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

romio: daos: performance problem with long list of tiny requests list-io #7283

Open
roblatham00 opened this issue Jan 29, 2025 · 7 comments
Open

Comments

@roblatham00
Copy link
Contributor

Bringing an email discussion out to a little more public place.

The LAMMPS code can write files with parallel-netcdf. pnetcdf in turn will use MPI-IO (ROMIO) to write out data.

The clients in this workload all write highly noncontiguous data, but none of it overlaps with other processes. As a result, ROMIO's interleaving heuristic guides ROMIO to take the independent i/o path instead of two-phase collective buffering.

For regular Unix file systems, ROMIO's data sieving optimization kicks in, but if we use DAOS, that driver uses a different "list io" optimization. https://github.com/pmodels/mpich/blob/main/src/mpi/romio/adio/ad_daos/ad_daos_io_str.c#L66

This data is contiguous in memory, non-contiguous in file. The request ends up being 175k 4-byte requests from each MPI process

The way this translates into DAOS is that it creates an RPC with an IO descriptor that matches that iovec and send to the server.

The array API combines the IO requests that are in a single chunk/stripe (default 1mib) into a single RPC / task, and sends that RPC. I expect that, given the file size, there are not a lot of RPC requests created here from each process that can cause issues. So no, there is not going to be 175k RPC requests from each client process.

The issue is on the server, not the client as I can tell. Each of those iovec entries (4 bytes), 175k of them x number of clients ranks, is an independent update in the VOS tree and the pmem. After that there is the background aggregation process that aggregates all those 4 byte requests into large chunks if possible. This is the expensive part that causes the bad performance; not the client processing or the RPC/network request.

A couple thoughts about how we could handle this:

  • We have a ADIOI_NOLOCK_WriteStrided -- that's not exactly what we want here, though it does have a useful "data packing" optimization for the noncontiguos in memory, contiguous in file case.
  • A contributor a few years back made locking routines driver-specific... see ADIOI_GEN_SetLock64 and the ADIOI_xxx_SetLock function pointer in the ADIOI_Fns_struct....
  • Since we are calling this code from the collective path, and already determined there is no interleaving among processes, could we patch up the ADIOI_Fns_struct to no-op locks, do data sieving in a risky no-lock way, then patch the proper locking routine back into the Fns struct? (this might be too clever... definitely risky...)

@mchaarawi @pkcoff -- did I miss anything? I'll also bring this up on the DAOS Jira

@mchaarawi
Copy link
Contributor

nice summary! looks accurate to me.

@roblatham00
Copy link
Contributor Author

Also tracking this in https://daosio.atlassian.net/browse/DAOS-17008

@roblatham00
Copy link
Contributor Author

More brainstorming:

  • chunk up the list into smaller sections. instead of sending hundreds of thousands, send 100 at a time.
  • look at the length of the requests, total size, average request size (not sure exactly what information we have at that level) and don't take list-io path in some cases

@pkcoff
Copy link
Contributor

pkcoff commented Jan 29, 2025

I think you covered all the angles Rob

@wkliao
Copy link
Contributor

wkliao commented Jan 29, 2025

If the large number of clients makes the performance worse, then maybe the PnetCDF's
new feature, called intra-node write aggregation, added in the latest release of 1.14.0
can help . It aggregates write requests from processes running on the same compute
node into a subset of MPI ranks, named intra-node aggregators. Then, only the
intra-node aggregators make non-zero write requests to ROMIO. The aggregated
requests will be coalesced at each aggregator, if can.

My experimental results using E3SM-IO F case on Perlmutter, which also writes a large
number of small, noncontiguous requests, shows a good improvement.

@pkcoff
Copy link
Contributor

pkcoff commented Jan 30, 2025

Thanks for the suggestion @wkliao for the lammps pnetcdf case we are only running 12ppn, the data is 3d composition rank ordered so we would get some benefit from the aggregation, but 1/12 of 175k noncontig writes is still alot and I think we need to solve for the general case, if I have time I will look at the e3smio pnetcdf implementation and apply to lammps pnetcdf and see....

@wkliao
Copy link
Contributor

wkliao commented Jan 30, 2025

If the aggregation can coalesce the requests into a smaller list, then
the number may become much less than 1/12 of 175k.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants