Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Running NVIDIA NeMo 2.0 with Nemo-Run on Slurm

This project provides a step-by-step guide to deploying NVIDIA NeMo 2.0 on a Slurm cluster. It covers preparing the environment, and running NeMo jobs for large-scale AI training.

1. Prerequisites

This guide assumes that you have the following:

  • A functional Slurm cluster on AWS. This test case also assumes that the cluster node uses Ubuntu-based OS.
  • Docker, for Slurm Pyxis and Enroot need to be installed as well.
  • An FSx for Lustre filesystem mounted on /fsx in all Slurm nodes. Also, this test case assumes that the home directory is also a shared directory.

It is recommended that you use the templates in the architectures directory for setting up Amazon SageMaker HyperPod or AWS Parallel Cluster.

Make sure that your current directory is under a shared filesystem such as /fsx.

3. Clone this repo

cd ~
git clone https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/3.test_cases/22.nemo-run/slurm

4. Build and Configure the NeMo Job Container

Before running NeMo jobs, build a custom optimized container image for EFA and Convert the Docker container image to an Enroot squash file that will be stored under your home dirctory.

Build Image:

docker build --progress=plain -t aws-nemo:24.12 -f ../Dockerfile ..
enroot import -o ~/aws-nemo-24-12.sqsh dockerd://aws-nemo:24.12

5. Install Dependencies and Prepare NeMo 2.0 Environment

Install Python 3.10 and NeMo-Run

Update and ensure the correct python version is installed

sudo apt update
sudo apt install software-properties-common -y
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install -y python3.10 python3.10-venv python3.10-dev
python3.10 --version

Create Virtual environment and install NeMo Run and other dependencies

python3.10 -m venv nemo-env
source nemo-env/bin/activate
bash venv.sh

Download Tokenizer Files

We will be running the pre-training of LLaMa8B model in an offline mode, so we will download the vocab and merges files for the tokenizer

mkdir -p ~/megatron
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -O ~/megatron/megatron-gpt-345m_vocab
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -O ~/megatron/megatron-gpt-345m_merges

6. Launch Pretraining Job with NeMo-Run

In NeMo-Run, you can build and configure everything using Python, eliminating the need for multiple combinations of tools to manage your experiments. The only exception is when setting up the environment for remote execution, where we rely on Docker. You can find how to set write your Python script for Nemo-Run in NVIDIA documentation. In this example, we run the following script to start the LLaMa 8B pretraining job:

python run.py --container_image ~/aws-nemo-24-12.sqsh --nodes 2 --partition dev --env_vars_file env_vars.json --max_steps 1000

7. References