Skip to content

Latest commit

 

History

History
129 lines (103 loc) · 11 KB

File metadata and controls

129 lines (103 loc) · 11 KB

Description

This module creates a slurm controller node via the SchedMD/slurm-gcp controller module.

More information about Slurm On GCP can be found at the project's GitHub page and in the Slurm on Google Cloud User Guide.

The user guide provides detailed instructions on customizing and enhancing the Slurm on GCP cluster as well as recommendations on configuring the controller for optimal performance at different scales.

Example

- source: community/modules/scheduler/SchedMD-slurm-on-gcp-controller
  kind: terraform
  id: slurm_controller
  use:
  - network1
  - homefs
  - compute_partition
  settings:
    login_node_count: 1

This creates a controller node connected to the primary subnetwork with 1 login node (defined elsewhere). The controller will also have the homefs file system mounted via the use field and manage one partition, also declared in the use field. For more context see the hpc-cluster-small example.

Support

The HPC Toolkit team maintains the wrapper around the slurm-on-gcp terraform modules. For support with the underlying modules, see the instructions in the slurm-gcp README.

License

Copyright 2022 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Requirements

Name Version
terraform >= 0.14.0
google >= 3.83

Providers

Name Version
google >= 3.83

Modules

Name Source Version
slurm_cluster_controller github.com/SchedMD/slurm-gcp//tf/modules/controller/ v4.1.8

Resources

Name Type
google_compute_image.compute_image data source

Inputs

Name Description Type Default Required
boot_disk_size Size of boot disk to create for the cluster controller node number 50 no
boot_disk_type Type of boot disk to create for the cluster controller node.
Choose from: pd-ssd, pd-standard, pd-balanced, pd-extreme.
pd-ssd is recommended if the controller is hosting the SlurmDB and NFS share.
If SlurmDB and NFS share are not running on the controller, pd-standard is
recommended. See "Controller configuration recommendations" in the Slurm on
Google Cloud User Guide for more information:
https://goo.gle/slurm-gcp-user-guide
string "pd-ssd" no
cloudsql Define an existing CloudSQL instance to use instead of instance-local MySQL
object({
server_ip = string,
user = string,
password = string,
db_name = string
})
null no
cluster_name Name of the cluster string null no
compute_node_scopes Scopes to apply to compute nodes. list(string)
[
"https://www.googleapis.com/auth/monitoring.write",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/devstorage.read_only"
]
no
compute_node_service_account Service Account for compute nodes. string null no
compute_startup_script Custom startup script to run on the compute nodes string null no
controller_instance_template Instance template to use to create controller instance string null no
controller_machine_type Compute Platform machine type to use in controller node creation. c2-standard-4
is recommended for clusters up to 50 nodes, for larger clusters see
"Controller configuration recommendations" in the Slurm on Google Cloud User
Guide: https://goo.gle/slurm-gcp-user-guide
string "c2-standard-4" no
controller_scopes Scopes to apply to the controller list(string)
[
"https://www.googleapis.com/auth/cloud-platform",
"https://www.googleapis.com/auth/devstorage.read_only"
]
no
controller_secondary_disk Create secondary disk mounted to controller node bool false no
controller_secondary_disk_size Size of disk for the secondary disk number 100 no
controller_secondary_disk_type Disk type (pd-ssd or pd-standard) for secondary disk string "pd-ssd" no
controller_service_account Service Account for the controller string null no
controller_startup_script Custom startup script to run on the controller string null no
deployment_name Name of the deployment string n/a yes
disable_compute_public_ips If set to true, create Cloud NAT gateway and enable IAP FW rules bool true no
disable_controller_public_ips If set to true, create Cloud NAT gateway and enable IAP FW rules bool false no
instance_image Slurm image to use for the controller instance
object({
family = string,
project = string
})
{
"family": "schedmd-slurm-21-08-8-hpc-centos-7",
"project": "schedmd-slurm-public"
}
no
intel_select_solution Configure the cluster to meet the performance requirement of the Intel Select Solution string null no
jwt_key Specific libjwt key to use any null no
labels Labels to add to controller instance. List of key key, value pairs. any {} no
login_node_count Number of login nodes in the cluster number 0 no
munge_key Specific munge key to use any null no
network_storage An array of network attached storage mounts to be configured on all instances.
list(object({
server_ip = string,
remote_mount = string,
local_mount = string,
fs_type = string,
mount_options = string
}))
[] no
partition An array of configurations for specifying multiple machine types residing in their own Slurm partitions.
list(object({
name = string,
machine_type = string,
max_node_count = number,
zone = string,
image = string,
image_hyperthreads = bool,
compute_disk_type = string,
compute_disk_size_gb = number,
compute_labels = any,
cpu_platform = string,
gpu_type = string,
gpu_count = number,
network_storage = list(object({
server_ip = string,
remote_mount = string,
local_mount = string,
fs_type = string,
mount_options = string
})),
preemptible_bursting = string,
vpc_subnet = string,
exclusive = bool,
enable_placement = bool,
regional_capacity = bool,
regional_policy = any,
instance_template = string,
static_node_count = number
}))
n/a yes
project_id Compute Platform project that will host the Slurm cluster string n/a yes
region Compute Platform region where the Slurm cluster will be located string n/a yes
shared_vpc_host_project Host project of shared VPC string null no
subnetwork_name The name of the pre-defined VPC subnet you want the nodes to attach to based on Region. string null no
suspend_time Idle time (in sec) to wait before nodes go away number 300 no
zone Compute Platform zone where the servers will be located string n/a yes

Outputs

Name Description
controller_name Name of the controller node