Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

README.md: trivial change to force mpi4py testing #4

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jsquyres
Copy link
Owner

Signed-off-by: Jeff Squyres jsquyres@cisco.com

jsquyres and others added 3 commits March 22, 2021 10:29
Many thanks to Lisandro Dalcin for 99% of this script (inspired from
https://raw.githubusercontent.com/mpi4py/mpi4py-testing/master/.github/workflows/openmpi.yml).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
@jsquyres
Copy link
Owner Author

Hey @dalcinl -- I'm testing your mpi4py scripts. Per this PR (on my personal repo, where I'm testing -- this won't be the real/final PR onto the main Open MPI repo), I think it would be great to run these tests on every PR.

I'm getting a failure on some MPI_Comm_spawn tests, which may well be expected. ☹️

Is there a way to specify to mpi4py that it's ok to fail some tests?

@dalcinl
Copy link

dalcinl commented Mar 22, 2021

@jsquyres The problem with failures is that sometimes they lead to deadlock or abort. If spawn is giving problems, then you can pass -e spawn -e dynproc to test/runtests.py to skip all these tests. If any of the mpi4py.futures tests is not working, just comment the full step for now.

@dalcinl
Copy link

dalcinl commented Mar 22, 2021

@jsquyres You do not really need to commit to use GitHub Actions, note the workflow_dispatch in the YAML file. Google for the relevant docs, you can just trigger a build manually from your browser using GitHub's UI. Oh, but for this to work, you need to have the workflow YAML file already in master.

@jsquyres
Copy link
Owner Author

Yeah, I've been playing with GHAs for a few different projects for the last month or two now. If we can get a stable set, I'd be happy to include these tests to be run for every PR.

Is there a stable release tag I should use for mpi4py, or is master HEAD considered to be stable?

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@dalcinl
Copy link

dalcinl commented Mar 22, 2021

For the time being I recommend to use master HEAD. I consider master HEAD very stable, the CI infrastructure is large and covers Linux/macOS MPICH/Open MPI and Windows/MSMPI.

However, before you go ahead with this PR, perhaps you should wait until I finish mine? This way we push any required changes/fixes in both mpi4py/mpi4py and open-mpi/mpi to a point where everything is OK.

@jsquyres
Copy link
Owner Author

However, before you go ahead with this PR, perhaps you should wait until I finish mine? This way we push any required changes/fixes in both mpi4py/mpi4py and open-mpi/mpi to a point where everything is OK.

Works for me. This is just a testing PR on my fork.

@dalcinl
Copy link

dalcinl commented Mar 22, 2021

Jeff, a quick question. I'm configuring with --disable-mpi-fortran. However, the Fortran compiler wrappers are still installed. Is there any way to completely disable Fortran? FYI, my only interest in this is testing, making sure that mpi4py testsuite can run with MPI implementation (or custom builds of them) lacking Fortran support.

With --disable-mpi-fortran, the calls MPI_Create_f90_{integer|real}() seem to return valid datatypes. However, the call MPI_Create_f90_complex() seems to return a non-null handle, but of size zero (MPI_Type_size() returns zero size).
Any idea whether this is intentional or just an oversight? I believe that if you can define F90 real without a Fortran compiler, so you can define F90 complex, right?

@jsquyres
Copy link
Owner Author

jsquyres commented Mar 23, 2021

Sorry for the delay; it took a little time to go reproduce this. There are two issues here:

  1. It looks like --disable-mpi-fortran does still install mpifort and friends. On the surface, this feels like a bug, if for no other reason than they're trying to -lmpi_mpifh, and that library does not exist in the installed location. Offhand, I can't think of a reason why we'd be installing mpifort (and friends) if there's no Fortran bindings. I'll file an issue about this. EDIT Filed Configuring with --disable-mpi-fortran still installs mpifort(1) open-mpi/ompi#8682.
  2. Even with --disable-mpi-fortran, we do still return valid datatypes from MPI_Type_create_f90_*(). I need to dig into the MPI standard a bit more, but I think that this is correct behavior.

@dalcinl
Copy link

dalcinl commented Mar 23, 2021

About my second comment, I was not clear enough.

I'm not complaining that MPI_Create_f90_{integer|real}() without a Fortran compiler, quite the opposite! My complain is that MPI_Create_f90_complex() seems to be broken, returning a datatype of size zero. If you support the first two (integer|real), I fail to see why you wouldn't support the third (complex). Looks like this is a trivial bug handling a HAVE_FORTRAN preprocessor guard.

See below, note the zero size in the last line:

In [1]: from mpi4py import MPI

In [2]: t = MPI.Datatype.Create_f90_integer(9)
In [3]: t.size, t.name
Out[3]: (4, 'COMBINER MPI_INT')

In [4]: t = MPI.Datatype.Create_f90_real(9, MPI.UNDEFINED)
In [5]: t.size, t.name
Out[5]: (8, 'COMBINER MPI_DOUBLE')

In [6]: t = MPI.Datatype.Create_f90_complex(9, MPI.UNDEFINED)
In [7]: t.size, t.name
Out[7]: (0, 'COMBINER MPI_DOUBLE_COMPLEX')

@dalcinl
Copy link

dalcinl commented Mar 23, 2021

Any idea what's going on with Spawn?
https://github.com/mpi4py/mpi4py-testing/runs/2178162342?check_suite_focus=true

I thought it could be related to oversubcription, but I enabled it this way:
https://github.com/mpi4py/mpi4py-testing/actions/runs/680630523/workflow#L63

@jsquyres
Copy link
Owner Author

Spawn: ugh. Frankly, our COMM_SPAWN support has never been rock solid. That being said, v5.0 just had a pretty major overhaul in its runtime system. As I'm typing, Ralph is closing our a lot of Open MPI runtime bugs as "This has been verified fixed for v5.0..." We're (extremely) unlikely to backport most of those fixes to existing release branches.

As for MPI_Type_create_f90_complex -- I tracked this down to the C code and talked to @bosilca about it. You are correct: we are returning a valid datatype (i.e., not MPI_DATATYPE_NULL) that has an extent of 0. The MPI spec does not mandate that MPI_DATATYPE_NULL should be returned if there is no compatible datatype. Indeed, MPI-3.1 says that it is erroneous to make such a request -- but I recognize that this is a bit of a chicken-and-egg situation.

All that boils down to: at least as of now, this is expected Open MPI behavior.

@dalcinl
Copy link

dalcinl commented Mar 25, 2021

Jeff, about my Spawn question, and the test failures I linked, let me clarify that the issues are with OMPI master. I understand you do not have the time to backport all the fixes, and I'm not making a request for it. I'm just looking forward to a state where all mpi4py tests pass cleanly with OMPI/master

About the MPI_Type_create_f90_complex, sorry for being so insistent, but I have not managed to make my point clear, and then you still do not see the inconsistencies in the implementation of the f90 integer/real/complex trio.

Please look again at the implementation, and compare with the code for f90 integer/real/complex. There is an inconsistency in complex! And that inconsistency is what makes MPI_Type_create_f90_complex behave differently (unavailable, zero-size datatype) with building without a Fortran compiler.

A few lines of code are worth hundred words. IMHO, you have to apply the following fix:

diff --git a/ompi/mpi/c/type_create_f90_complex.c b/ompi/mpi/c/type_create_f90_complex.c
index 638afc2f26..cb86277a89 100644
--- a/ompi/mpi/c/type_create_f90_complex.c
+++ b/ompi/mpi/c/type_create_f90_complex.c
@@ -84,10 +84,10 @@ int MPI_Type_create_f90_complex(int p, int r, MPI_Datatype *newtype)
      */
 
     if     ( (LDBL_DIG < p) || (LDBL_MAX_10_EXP < r) || (-LDBL_MIN_10_EXP < r) ) *newtype = &ompi_mpi_datatype_null.dt;
-    else if( (DBL_DIG  < p) || (DBL_MAX_10_EXP  < r) || (-DBL_MIN_10_EXP  < r) ) *newtype = &ompi_mpi_ldblcplex.dt;
-    else if( (FLT_DIG  < p) || (FLT_MAX_10_EXP  < r) || (-FLT_MIN_10_EXP  < r) ) *newtype = &ompi_mpi_dblcplex.dt;
+    else if( (DBL_DIG  < p) || (DBL_MAX_10_EXP  < r) || (-DBL_MIN_10_EXP  < r) ) *newtype = &ompi_mpi_c_long_double_complex.dt;
+    else if( (FLT_DIG  < p) || (FLT_MAX_10_EXP  < r) || (-FLT_MIN_10_EXP  < r) ) *newtype = &ompi_mpi_c_double_complex.dt;
     else if( ! OMPI_HAVE_FORTRAN_COMPLEX4 ||
-             (sflt_dig < p) || (sflt_max_10_exp < r) || (-sflt_min_10_exp < r) ) *newtype = &ompi_mpi_cplex.dt;
+             (sflt_dig < p) || (sflt_max_10_exp < r) || (-sflt_min_10_exp < r) ) *newtype = &ompi_mpi_c_complex.dt;
     else                                                                         *newtype = &ompi_mpi_complex4.dt;

Note that this change implements MPI_Type_create_f90_complex in term of C types, not Fortran types. And that's EXACTLY how things are done in MPI_Type_create_f90_{integer|real}. Do you understand my point now? With my patch, the all of the f90 integer/real/complex trio gets implemented in terms of C types. A foolish consistency is the hobgoblin of little minds.

@bosilca
Copy link

bosilca commented Mar 25, 2021

@dalcinl your patch assumes that the Fortran and C complex types are identical which unfortunately is not a correct assumption. I know we are abusing for real and integer by making the same assumption, but it "holds" because, unlike for complex, up to now we haven't found any compiler where this was not the case. I do not recall which compiler trigger that issue, maybe @jsquyres does.

@jsquyres
Copy link
Owner Author

jsquyres commented Apr 5, 2021

@dalcinl Let me know when you have a final recipe ready and I'll incorporate it into my nightly MTT testing.

@dalcinl
Copy link

dalcinl commented Apr 5, 2021

@jsquyres I'm waiting for open-mpi#8756 to provide a fix for open-mpi#8677. I'm also waiting for an already approved PR to get merged in MPICH. Then I can merge some required changes to mpi4py testsuite (mpi4py/mpi4py#37). After that, I have to reassess the Spawn situation, possibly disabling failings tests under ompi@master.

@awlauria
Copy link

awlauria commented Apr 6, 2021

@dalcinl I'll split out the fix for open-mpi#8677 since it isn't relevant to the main PR, and that may not go in for a few weeks.

@dalcinl
Copy link

dalcinl commented Apr 6, 2021

@dalcinl I'll split out the fix

Getting that fix first will allow Jeff and I to move forward with this one. Many thanks!

@awlauria
Copy link

awlauria commented Apr 6, 2021

@dalcinl opened open-mpi#8779 to fix it.

@dalcinl
Copy link

dalcinl commented Apr 7, 2021

@jsquyres Ideally, issue open-mpi#4135 should also be addressed, see my last comment. For the time being, mpi4py is skipping reduction tests with zero-size buffers.

@dalcinl
Copy link

dalcinl commented Apr 22, 2021

@jsquyres With current master, is there a way to tell mpiexec that you want to oversubscribe? Before, I could do it via mca-params.conf or a environment variable, but now it looks that the only way is to do it seems to pass --map-by :oversubscribe.

@dalcinl
Copy link

dalcinl commented May 13, 2021

@jsquyres We are almost there

@dalcinl
Copy link

dalcinl commented Jun 3, 2021

@jsquyres This is finally ready from my side. There is no point on delaying things for a minor issue left (external32 for long double). Now we have all the spawn tests passing thanks to recent oversubscription fixes. Please use the updated YAML file. Any tips on additional MCA parameters I could set to increase the level of debugging checks?

PS: Some day we could use a build matrix, for example to test builds with UCX. I would need your help to properly setup MCA parameters such that we trigger interesting code paths within Open MPI. For example, look at the MPICH builds, I'm testing eight different configurations.

@opoplawski
Copy link

ping?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants