Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Adds local storage support for slurm #83

Merged
merged 1 commit into from
May 5, 2023

Conversation

steven-safeai
Copy link
Contributor

Creates the playbooks to assign the quotas and the directory for local storage to allocated to each user on each job run.

@steven-safeai steven-safeai added enhancement New feature or request High priority This needs to be addressed ASAP labels May 4, 2023
@steven-safeai
Copy link
Contributor Author

Testing I did:

sudo vim slurm.conf 

JobContainerType=job_container/tmpfs
sudo vim job_container.conf

AutoBasePath=true
BasePath=/mnt/localdisk/slurm_tmp
sudo vim /opt/oci-hpc/playbooks/roles/cais-compute/tasks/ol-7.yml

- name: Create a project
  shell: echo "101:/mnt/localdisk/slurm_tmp" >> /etc/projects
  run_once: true

- name: Create project id
  shell: echo "slurm_tmp:101" >> /etc/projid
  run_once: true

# I think it's redundant to make a project called /mnt/localdisk/slurm_tmp so feel free to rewrite.
- name: Add the project to xsf_quotas
  become: true
  shell: xfs_quota -x -c 'project /mnt/localdisk/slurm_tmp' /mnt/localdisk/slurm_tmp
  run_once: true

# Switched to 4 GB for testing
- name: Create project quotas
  become: true
  shell: xfs_quota -x -c 'limit -p bsoft=2g bhard=4g slurm_tmp' /mnt/localdisk/slurm_tmp

- name: Create local storage location for all compute nodes
  become: true
  file: 
    path: /mnt/localdisk/slurm_tmp
    state: directory
    owner: root
    group: slurm
    mode: '0770'

Run playbook

cd /opt/oci-hpc/playbooks
ansible-playbook cais_compute.yml
TASK [cais-compute : Create a project] *****************************************
changed: [inst-botq7-dev-cpu-cluster]

TASK [cais-compute : Create project id] ****************************************
changed: [inst-botq7-dev-cpu-cluster]

TASK [cais-compute : Add the project to xsf_quotas] ****************************
changed: [inst-botq7-dev-cpu-cluster]

TASK [cais-compute : Create project quotas] ************************************
changed: [inst-botq7-dev-cpu-cluster]

TASK [cais-compute : Create local storage location for all compute nodes] ******
changed: [inst-botq7-dev-cpu-cluster]
changed: [inst-ib9zm-dev-cpu-cluster]

TASK [cais-compute : include] **************************************************
skipping: [inst-botq7-dev-cpu-cluster]
skipping: [inst-ib9zm-dev-cpu-cluster]

TASK [cais-compute : include] **************************************************
skipping: [inst-botq7-dev-cpu-cluster]
skipping: [inst-ib9zm-dev-cpu-cluster]

TASK [cais-compute : include] **************************************************
skipping: [inst-botq7-dev-cpu-cluster]
skipping: [inst-ib9zm-dev-cpu-cluster]

PLAY RECAP *********************************************************************
inst-botq7-dev-cpu-cluster : ok=6    changed=5    unreachable=0    failed=0    skipped=3    rescued=0    ignored=0   
inst-ib9zm-dev-cpu-cluster : ok=2    changed=1    unreachable=0    failed=0    skipped=3    rescued=0    ignored=0   

[opc@dev-cpu-cluster-bastion playbooks]$

Testing it out officially

sudo scontrol reconfigure
sudo srun --pty bash

cd /tmp
[root@compute-permanent-node-384 tmp]# ls
[root@compute-permanent-node-384 tmp]#
# Good its empty

xfs_mkfile 2g 2Gigfile
[root@compute-permanent-node-384 tmp]# xfs_mkfile 3g 3Gigfile
[root@compute-permanent-node-384 tmp]# ll -h
total 5.0G
-rw------- 1 root root 2.0G May  5 00:17 2Gigfile
-rw------- 1 root root 3.0G May  5 00:17 3Gigfile
[root@compute-permanent-node-384 tmp]# exit

Check to make sure it's deleted:

[opc@dev-cpu-cluster-bastion ~]$ ssh compute-permanent-node-384
Last login: Fri May  5 01:20:06 2023 from dev-cpu-cluster-bastion.public.cluster.oraclevcn.com
[opc@compute-permanent-node-384 ~]$ ls /mnt/localdisk/slurm_tmp/
[opc@compute-permanent-node-384 ~]$

And it's gone!

@steven-safeai steven-safeai force-pushed the 68-add-local-storage-support-for-slurm branch from bf2ae73 to 2db4c81 Compare May 5, 2023 01:51
Copy link
Contributor

@andriy-safe-ai andriy-safe-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two questions.

@steven-safeai steven-safeai force-pushed the 68-add-local-storage-support-for-slurm branch from 16eabbd to 31ac220 Compare May 5, 2023 14:51
Copy link
Contributor

@andriy-safe-ai andriy-safe-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@steven-safeai steven-safeai merged commit 51a76b8 into main May 5, 2023
@steven-safeai steven-safeai deleted the 68-add-local-storage-support-for-slurm branch May 5, 2023 14:53
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request High priority This needs to be addressed ASAP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants