Ansible based playbooks for the deployment and orchestration of the Conda Compute Cluster.
Conda Compute Cluster (CCC) has been developed by ViCoS UL, FRI to enable deep learning researches a seamless migration between different GPU servers when working on specific projects. Main features of Conda Compute Cluster are:
- Running multiple docker containers on different hosts simultainously.
- Seamless transition from one host to another.
- SSH access to containers through reverse proxy (FRP proxy).
- Designed for running Conda enviroment on NVIDIA GPUs for deep learning research.
Containers are based on Conda Compute Containers that enable seamless transition from one host to another due to:
- Home folder mounted on common shared storage .
- Forbidden modification of non-home files.
- Users can modify certain propreties from withing container:
- can modify container image (must be based on
vicos/ccc-base:latest
) - can modify apt packages, repositories and soruces installed at container boot
- can modify on which hosts to deploy containers
- can modify container image (must be based on
- Pre-installed Miniconda on /home/USER/Conda.
Cluster management is done through a single ansible script and enables deployment of the following features:
- Automatic deployment of containers uppon change of config.
- Using NFS with FS-cache for shared storage.
- Management of local disk with ZFS.
- Harwdware monitoring and management:
- automatic management of system FANs when using SUPERMICRO server based on GPU temperature (Sperfans GPU Controller)
- monitoring of GPU and CPU reported as Prometheus metrics
- monitoring of GPU usage for automatic reservation using patroller
Two playbooks are available that deploy Conda Compute Cluster and Containers:
cluster-deploy.yml
: deployment of cluster infrastructure (network, docker, FRP client, ZFS, NFS, FS-Cache, HW monitoring, GPU fan controlers, etc.)containers-deploy.yml
: depyloment of compute containers based on Conda Compute Container (CCC) images
Run the following command to deploy the infrastructure:
ansible-playbook cluster-deploy.yml -i <path-to-inventory> \
--vault-password-file <path-to-secret> -e vars_file=<path-to-secret-vars-dir> \
-e machines=<node-or-group-pattern> \
-e only_roles=<list of roles>
You can specifcy the cluster definition in the supplied inventory folder. See sample-inventory
for example. Tasks are deployed on the nodes defined by the -e machines=<node-or-group-pattern>
.
By default all roles are executed in the order as specifid below. Deployment can be limited to only specific roles by supplying -e only_roles=<list of roles>
. List of roles can be comma seperated list of role names:
netplan
: network intrface definition using netplandocker
: docker with pre-defined docker neworks, repository logins and portrainer agent for GUI managementfrp-client
: FRP client for access to containers through the proxy serverzfs
: ZFS pools for local storagecachefilesd
: FS-Cache for caching of the NFS storage into local scratch disksnfs-storage
: NFS storage for shared storage (needed for shared/home/user
over all compute nodes)superfan-gpu
: superfans GPU controller for regulating SYSTEM FANs based on GPU temperaturemonitoring-agent
: HW monitoring for providing Prometheus metrics of CPU and GPUscompute-container-nightwatch
: CCC nightwatch for providing automatic updated of the compute container upon changes to to the Ansible config or user-supplied configpatroller
: GPU Patroler for automatic GPU reservation system based on https://github.com/vicoslab/patrollersshd-hostkey
: not an actual role but a minor task to deploy ssh-daemon keys for CCC containers
Example of how to provide cluster configurations is in the sample-inventory
folder that includes:
- hosts definitions:
your-cluster.yml
withccc-cluster
as main group of your cluster nodes - cluster settings:
group_vars/ccc-cluster/cluster-vars.yml
- cluster secrets:
vault_vars/cluster-secrets.yml
(requires --vault-password-file to unlock) - host-specific settings:
sample-inventory/host_vars
Cluster-wide settings contain principal configuration of the whole cluster and are sectioned into settings for individual roles. Settings are used both by the cluster-deploy.yml
and containers-deploy.yml
playbooks.
Cluster secrets are stored in a seperate vault_vars
folder and should not be in present in group_vars
to allow running containers-deploy.yml
without needing vault secret. Secrets can be instead loaded for cluster deployment using -e vars_file=<path-to-secret-vars-dir>
which will load vars only for cluster-deploy.yml
playbook.
Run the following command to deploy compute containers:
ansible-playbook containers-deploy.yml -i <path-to-inventory> \
-e machines=<node-or-group-pattern> \
-e containers=<list of STACK_NAME> \
-e users=<list of USER_EMAIL>
By default all containers are deployed!!
To limit the deployment of only specific containers two additional filters can be used. For both filters, the provided values must be a comma separated list in a string format:
-e containers=<list of STACK_NAME>
: filters based on containers` STACK_NAME value-e users=<list of USER_EMAIL>
: filters based on containers` USER_EMAIL value
List of containers for deployment and list of users are stored need to be set in the inventory configuration:
- yaml variable
deployment_containers
: list of containers for deployment (e.g., seegroup_vars/ccc-cluster/user-containers.yml
) - yaml variable
deployment_users
: list of users for deployment (e.g., seegroup_vars/ccc-cluster/user-list.yml
) - yaml variable
deployment_types
: list of users types (e.g., seegroup_vars/ccc-cluster/user-list.yml
)
Example of how to provide cluster configurations is in the sample-inventory
folder that includes:
- list of containers for deployment as
deployment_containers
var ingroup_vars/ccc-cluster/user-containers.yml
- list of users for deployment as
deployment_users
var ingroup_vars/ccc-cluster/user-list.yml
- list of users types as
deployment_types
var ingroup_vars/ccc-cluster/user-list.yml
Each container for depoyment must be provided in deployment_containers
variable as an array/list of dictionary with the following keys for each container:
STACK_NAME
: name of the compute containersCONTAINER_IMAGE
: container image that will be deployed (e.g., "registry.vicos.si/ccc-juypter:ubuntu18.04-cuda10.1")USER_EMAIL
: user's emailINSTALL_PACKAGES
: additional apt packages that are installed at startup (registry.vicos.si/ccc-base:<...> images do not provide sudo access by default !!)INSTALL_REPOSITORY_KEYS
: comma separated list of links to fingerprint keys for installed repositoriy sources (added usingapt-key add
)INSTALL_REPOSITORY_SOURCES
: comma separated list repositoriy sources (deb ...
sources orppa
links that can be added usingadd-apt-repository
)SHM_SIZE
: shared memory settingsFRP_PORTS
:dict()
with TCP and HTTP keys with info of the forwarded ports to the FRP serverTCP
: a list of tcp ports as string valuesHTTP
: a list of http ports asdict()
objects withport
,subdomain
,pass
(optional),health_check
(optional) andsubdomain_hostname_prefix
(optional - bool) keys
User informations can be centralized in separate file for quick reuse. Containers and users are matched based on emails. The following user information must be present within the deployment_containers[<USER_EMAIL>]
dictionary:
USER_FULLNAME
: user's first and last name (fromUSER_MENTOR
: user's mentor (optional)USER_NAME
: username for the OSUSER_PUBKEY
: SSH public key for access to the compute containreUSER_TYPE
: user group/type that restricts network, nodes and GPU devices (groups/types are defined indeployment_types
key)ADDITIONAL_DEVICE_GROUPS
: allowed additional device groups besides ones defined byUSER_TYPE
- setting docker repository login from config
- encrypted data for authentication settings
- can deploy compute-container only to specific group nodes (student or lab nodes) or specific node
- can control deploying compute-container through config
- support for NVIDIA GPU driver installation
- performance tunded NFS mount settings with FS-cache
- custom ZFS storage mounting
- IPMI FAN controler using NVIDIA GPU temperatures (designed for supermicro server)
- centralized storage of users (with thier names, email and PUBKEY) in a single file
- loading of SSH pubkey from GITHUB
- prometheus export for monitoring of the HW (for CPU and GPU - GPU utilization, temperature, etc)
- users can provide custom settings inside of the containers by editing ~/.containers/<STACK_NAME>.yml file
- compute-container-nightwatch that monitors ~/.containers/<STACK_NAME>.yml files and redeploys them using ansible-pull
- constraining to specific GPUs based on device groups and user group
- enable of redirection of container loging output to the user