Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Question] COnfiguration issues with mlcommon benchmarking #421

Open
raghavendrachari08 opened this issue Sep 22, 2023 · 2 comments
Open
Labels
question Further information is requested

Comments

@raghavendrachari08
Copy link

Hi,
I Am trying to bringup the setup for multinode GPU Hugectr training benchmark using the code https://github.com/mlcommons/training_results_v3.0/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr

For single node am able to run the benchmark test , but while am executing the multinode (say 2 node) am facing issue shown below , could you please help me resolving this issue??

[HCTR][17:28:15.456][WARNING][RK0][main]: The model name is not specified when creating the solver.
[1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out
[hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed: Endpoint timeout
[hpci5201:103648] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 1
Traceback (most recent call last):
File "/dev/shm/data/hugectl/train.py", line 344, in
model = hugectr.Model(solver, reader, optimizer)
RuntimeError: Runtime error: MPI_ERR_OTHER: known error not in list
MPI_Bcast(&seed, 1, (static_cast<MPI_Datatype> (static_cast<void *> (&(ompi_mpi_unsigned_long_long)))), 0, (static_cast<MPI_Comm> (static_cast<void *> (&(ompi_mpi_comm_world))))) at create (/workspace/dlrm/hugectr/HugeCTR/src/resource_managers/resource_manager_ext.cpp:39)

@raghavendrachari08 raghavendrachari08 added the question Further information is requested label Sep 22, 2023
@shijieliu
Copy link
Collaborator

Hi @raghavendrachari08

[1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out
[hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed: Endpoint timeout
[hpci5201:103648] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 1

Looks to me the error is related with multinode MPI setting. Could you check your multinode MPI setting by running some demo code?

@Abatpool
Copy link

Abatpool commented Aug 5, 2024

hello @raghavendrachari08 Did your dlrm mlcommon training on single node, go through successfully. I am coming from perspective where i am facing issues of related to the error #445 mentioned in previous link. How did you solve it. I am doing it on a single Nvidia DGX H100 node. Any help will be highly appreciated.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants