Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

comment out PSM2 dependency in recent libfabric easyconfigs, since it pulls in CUDA as dependency #20794

Merged

Conversation

boegel
Copy link
Member

@boegel boegel commented Jun 10, 2024

(created using eb --new-pr)

This undoes what was done in #20501 (and tweaked in #20585), because PSM2 pulls in CUDA as a dependency, even on non-GPU systems.

We should figure out a way to make specific dependencies opt-in, rather than making this opt-out by forcing people to add PSM2 to the filter-deps configuration option...

@boegel boegel added the change label Jun 10, 2024
@boegel boegel added this to the 4.9.2 milestone Jun 10, 2024
@boegel boegel changed the title comment out PSM2 dependency in recent libfabric easyconfigs comment out PSM2 dependency in recent libfabric easyconfigs, since it pulls in CUDA as dependency Jun 10, 2024
@boegel
Copy link
Member Author

boegel commented Jun 10, 2024

@boegelbot please test @ generoso

@boegel
Copy link
Member Author

boegel commented Jun 10, 2024

@jfgrimm Would love to get your feedback on this, since this effectively undoes what you contributed in #20501 .

The impact for people on a non-Omnipath system (which is the vast majority of the EasyBuild community) is a bit too high though...

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=20794 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20794 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13706

Test results coming soon (I hope)...

- notification for comment with ID 2158766229 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member Author

boegel commented Jun 10, 2024

@boegelbot please test @ jsc-zen3

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20794 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20794 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 4355

Test results coming soon (I hope)...

- notification for comment with ID 2158816548 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 9 out of 9 (9 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/ba0bd375cdf2de9be1c3bf9dbbfe7f4b for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 9 out of 9 (9 easyconfigs in total)
cns3 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/4761a7f6ddedd3cf30f1aa264e77742f for a full test report.

@boegel
Copy link
Member Author

boegel commented Jun 10, 2024

Test report by @boegel
SUCCESS
Build succeeded for 9 out of 9 (9 easyconfigs in total)
node3132.skitty.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/boegel/7c097ab49a6205ae267480ed17b7b087 for a full test report.

@ocaisa ocaisa merged commit 4914367 into easybuilders:develop Jun 10, 2024
9 checks passed
@boegel boegel deleted the 20240610180214_new_pr_libfabric1121 branch June 10, 2024 19:27
@jfgrimm
Copy link
Member

jfgrimm commented Jun 11, 2024

@jfgrimm Would love to get your feedback on this, since this effectively undoes what you contributed in #20501 .

The impact for people on a non-Omnipath system (which is the vast majority of the EasyBuild community) is a bit too high though...

yeah, I get that
I maintain that we should have an easier way to at least support omnipath in some way, than requiring people to use hooks. Perhaps introducing an EB configuration option -- something like EASYBUILD_INTERCONNECT=(infiniband|omnipath) -- could help, and we adjust builds based on that (defaulting to infiniband if not set)?

@bartoldeman
Copy link
Contributor

It may be possible to do something similar for PSM2 as we do for openmpi, ie. include some minimal cuda prototypes, as iirc PSM2 dlopen()s libcuda.so.1, so needs very little from CUDA.

Note also that OpenMPI can directly use PSM2 without libfabric, which is what we've been doing for many years on our omnipath cluster Cedar.

@jfgrimm
Copy link
Member

jfgrimm commented Jun 11, 2024

@bartoldeman indeed, although going forwards we'll need libfabric anyway for opx

@boegel
Copy link
Member Author

boegel commented Jun 11, 2024

@jfgrimm Please open a framework issue where we can try to figure out how we can support this better.
Perhaps through opt-in hooks that are part of EasyBuild framework, which can be enabled selectively?

@bartoldeman
Copy link
Contributor

The following cuda.h (+ empty file driver_types.h ) put in PSM2's sources include directory makes it build successfully without the CUDA dep:

/* This header provides minimal parts of the CUDA Driver API, without having to
   rely on the proprietary CUDA toolkit.

   References (to avoid copying from NVidia's proprietary cuda.h):
   https://github.com/gcc-mirror/gcc/blob/master/include/cuda/cuda.h
   https://github.com/Theano/libgpuarray/blob/master/src/loaders/libcuda.h
   https://github.com/CPFL/gdev/blob/master/cuda/driver/cuda.h
   https://github.com/CudaWrangler/cuew/blob/master/include/cuew.h
*/

#ifndef PSM2_CUDA_H
#define PSM2_CUDA_H

#include <stddef.h>

#define CUDA_VERSION 8000

typedef void *CUcontext;
typedef int CUdevice;
#if defined(__LP64__) || defined(_WIN64)
typedef unsigned long long CUdeviceptr;
#else
typedef unsigned CUdeviceptr;
#endif
typedef void *CUevent;
typedef void *CUstream;

typedef enum {
  CUDA_SUCCESS = 0,
  CUDA_ERROR_ALREADY_MAPPED = 208,
  CUDA_ERROR_NOT_READY = 600,
} CUresult;

enum {
  CU_EVENT_DEFAULT = 0x0,
};

enum {
  CU_IPC_MEM_LAZY_ENABLE_PEER_ACCESS = 0x1,
};

typedef enum {
  CU_POINTER_ATTRIBUTE_MEMORY_TYPE = 2,
  CU_POINTER_ATTRIBUTE_SYNC_MEMOPS = 6,
  CU_POINTER_ATTRIBUTE_IS_MANAGED = 8,
} CUpointer_attribute;


typedef enum {
  CU_MEMORYTYPE_HOST = 0x01,
  CU_MEMORYTYPE_DEVICE = 0x02,
} CUmemorytype;

#define CU_IPC_HANDLE_SIZE 64

typedef struct CUipcMemHandle_st {
    char reserved[CU_IPC_HANDLE_SIZE];
} CUipcMemHandle;

typedef enum {
  CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41,
  CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR = 75,
} CUdevice_attribute;

enum {
  CU_STREAM_NON_BLOCKING = 1
};

enum {
  CU_MEMHOSTALLOC_PORTABLE = 0x01,
};

#endif

now just to get that into a patch.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants