-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
romio: daos: performance problem with long list of tiny requests list-io #7283
Comments
nice summary! looks accurate to me. |
Also tracking this in https://daosio.atlassian.net/browse/DAOS-17008 |
More brainstorming:
|
I think you covered all the angles Rob |
If the large number of clients makes the performance worse, then maybe the PnetCDF's My experimental results using E3SM-IO F case on Perlmutter, which also writes a large |
Thanks for the suggestion @wkliao for the lammps pnetcdf case we are only running 12ppn, the data is 3d composition rank ordered so we would get some benefit from the aggregation, but 1/12 of 175k noncontig writes is still alot and I think we need to solve for the general case, if I have time I will look at the e3smio pnetcdf implementation and apply to lammps pnetcdf and see.... |
If the aggregation can coalesce the requests into a smaller list, then |
Bringing an email discussion out to a little more public place.
The LAMMPS code can write files with parallel-netcdf. pnetcdf in turn will use MPI-IO (ROMIO) to write out data.
The clients in this workload all write highly noncontiguous data, but none of it overlaps with other processes. As a result, ROMIO's interleaving heuristic guides ROMIO to take the independent i/o path instead of two-phase collective buffering.
For regular Unix file systems, ROMIO's data sieving optimization kicks in, but if we use DAOS, that driver uses a different "list io" optimization. https://github.com/pmodels/mpich/blob/main/src/mpi/romio/adio/ad_daos/ad_daos_io_str.c#L66
This data is contiguous in memory, non-contiguous in file. The request ends up being 175k 4-byte requests from each MPI process
The way this translates into DAOS is that it creates an RPC with an IO descriptor that matches that iovec and send to the server.
The array API combines the IO requests that are in a single chunk/stripe (default 1mib) into a single RPC / task, and sends that RPC. I expect that, given the file size, there are not a lot of RPC requests created here from each process that can cause issues. So no, there is not going to be 175k RPC requests from each client process.
The issue is on the server, not the client as I can tell. Each of those iovec entries (4 bytes), 175k of them x number of clients ranks, is an independent update in the VOS tree and the pmem. After that there is the background aggregation process that aggregates all those 4 byte requests into large chunks if possible. This is the expensive part that causes the bad performance; not the client processing or the RPC/network request.
A couple thoughts about how we could handle this:
@mchaarawi @pkcoff -- did I miss anything? I'll also bring this up on the DAOS Jira
The text was updated successfully, but these errors were encountered: