Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Ucc integration #591

Merged
merged 38 commits into from
Jul 14, 2022
Merged

Ucc integration #591

merged 38 commits into from
Jul 14, 2022

Conversation

kaiyingshan
Copy link
Collaborator

No description provided.

@nirandaperera
Copy link
Collaborator

@kaiyingshan I'm getting the following seg fault. I can't figure out where and why it is coming from. I'm using latest UCC master branch.

(cylon_dev) niranda@aurora-r10:~/git/cylon/build$ ./bin/ucc_allgather_example 
[1656170916.449172] [aurora-r10:49843:0]          ucc_cl.c:57   UCC  ERROR no TLs are selected for CL_BASIC
[1656170916.449190] [aurora-r10:49843:0]         ucc_lib.c:127  UCC  ERROR lib_init failed for component: basic
[aurora-r10:49843:0:49843] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x164)
==== backtrace (tid:  49843) ====
 0  /home/niranda/miniconda3/envs/cylon_dev/lib/libucs.so.0(ucs_handle_error+0x2fd) [0x7f3eff20778d]
 1  /home/niranda/miniconda3/envs/cylon_dev/lib/libucs.so.0(+0x2b994) [0x7f3eff207994]
 2  /home/niranda/miniconda3/envs/cylon_dev/lib/libucs.so.0(+0x2bb5a) [0x7f3eff207b5a]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7f3f00071090]
 4  /home/niranda/git/ucc/install/lib/libucc.so.1(ucc_collective_init+0x1e0) [0x7f3eff09bf30]
 5  /home/niranda/git/cylon/install/lib/libcylon.so.0.5.0(_ZNK5cylon3ucc21UccTableAllgatherImpl20AllgatherBufferSizesEPKiiPi+0x8e) [0x7f3f005f2b9e]
 6  /home/niranda/git/cylon/install/lib/libcylon.so.0.5.0(_ZN5cylon3net18TableAllgatherImpl7ExecuteERKSt10shared_ptrINS_15TableSerializerEERKS2_INS_9AllocatorEEiPSt6vectorIiSaIiEEPSB_IS2_INS_6BufferEESaISG_EEPSB_ISD_SaISD_EE+0xfc) [0x7f3f008d44dc]
 7  /home/niranda/git/cylon/install/lib/libcylon.so.0.5.0(_ZN5cylon3net18TableAllgatherImpl7ExecuteERKSt10shared_ptrINS_5TableEEPSt6vectorIS4_SaIS4_EE+0x1d1) [0x7f3f008d4ea1]
 8  /home/niranda/git/cylon/install/lib/libcylon.so.0.5.0(_ZNK5cylon3net15UCXCommunicator9AllGatherERKSt10shared_ptrINS_5TableEEPSt6vectorIS4_SaIS4_EE+0x5c) [0x7f3f005f192c]
 9  ./bin/ucc_allgather_example(+0x5f39) [0x55fbb51b8f39]
10  ./bin/ucc_allgather_example(+0x5c69) [0x55fbb51b8c69]
11  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f3f00052083]
12  ./bin/ucc_allgather_example(+0x5e1e) [0x55fbb51b8e1e]
=================================
Segmentation fault (core dumped)

@kaiyingshan
Copy link
Collaborator Author

I don't know if this is the same issue that I experienced, which is due to the ucc team size. When I run the example this way, it gives a warning "ucc_team.c:114 UCC WARN minimal size of UCC team is 2, provided 1", and it is caused by ucc_collective_init; when I run with mpirun -n it won't result in segfault.

std::cout<<std::endl;

/* Cleanup UCC */
UCC_CHECK(ucc_team_destroy(team));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

team destroy is nonblocking operation, it might return UCC_INPROGRESS, should be something like this

    ucc_status_t status;
    while (UCC_INPROGRESS == (status = ucc_team_destroy(team.team))) {
        if (UCC_OK != status) {
            std::cerr << "ucc_team_destroy failed\n";
            break;
        }
    }


RETURN_CYLON_STATUS_IF_UCC_FAILED(ucc_context_config_read(lib, nullptr, &ctx_config));
RETURN_CYLON_STATUS_IF_UCC_FAILED(ucc_context_create(lib, &ctx_params, ctx_config, &uccContext));
while (UCC_OK != ucc_context_progress(uccContext)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucc_context_create is blocking, no need to call ucc_context_progress

@Sergei-Lebedev
Copy link

I don't know if this is the same issue that I experienced, which is due to the ucc team size. When I run the example this way, it gives a warning "ucc_team.c:114 UCC WARN minimal size of UCC team is 2, provided 1", and it is caused by ucc_collective_init; when I run with mpirun -n it won't result in segfault.

recently we added support for team size 1 (openucx/ucc#511)

@nirandaperera
Copy link
Collaborator

nirandaperera commented Jul 12, 2022

I don't know if this is the same issue that I experienced, which is due to the ucc team size. When I run the example this way, it gives a warning "ucc_team.c:114 UCC WARN minimal size of UCC team is 2, provided 1", and it is caused by ucc_collective_init; when I run with mpirun -n it won't result in segfault.

recently we added support for team size 1 (openucx/ucc#511)

@Sergei-Lebedev I'm still getting the following error with team size 1.
I opened an issue in ucx regarding this openucx/ucc#567

(cylon_dev) niranda@aurora-r10:~/git/cylon$ ./build/bin/ucc_example 
[1657643132.702496] [aurora-r10:188948:0]   cl_basic_team.c:131  CL_BASIC ERROR no tl teams were created
[1657643132.702508] [aurora-r10:188948:0]        ucc_team.c:294  UCC  ERROR No CL teams were created
failed to create ucc team
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
(cylon_dev) niranda@aurora-r10:~/git/cylon$ mpirun -n 1 ./build/bin/ucc_example 
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[1657643157.590587] [aurora-r10:189016:0]   cl_basic_team.c:131  CL_BASIC ERROR no tl teams were created
[1657643157.590601] [aurora-r10:189016:0]        ucc_team.c:294  UCC  ERROR No CL teams were created
failed to create ucc team

@nirandaperera nirandaperera marked this pull request as ready for review July 13, 2022 15:23
@nirandaperera
Copy link
Collaborator

@kaiyingshan I reviewed your code and made some changes myself in this commit 9a7e9aa

Could you please check that?

@nirandaperera nirandaperera merged commit 4dd359f into main Jul 14, 2022
@nirandaperera
Copy link
Collaborator

@kaiyingshan thank you for doing this

@nirandaperera nirandaperera mentioned this pull request Jul 15, 2022
5 tasks
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants