This guide outlines the steps to configure the environment required to run benchmark recipes on a Google Kubernetes Engine (GKE) cluster with A3 Mega node pools.
Before you begin, complete the following:
-
Create a Google Cloud project with billing enabled.
a. To create a project, see Creating and managing projects. b. To enable billing, see Verify the billing status of your projects.
-
Enable the following APIs:
-
Request enough GPU quotas. Each
a3-megagpu-8g
machine has 8 H100 80GB GPUs attached. -
To view quotas, see View the quotas for your project. In the Filter field, select Dimensions(e.g., location) and specify
gpu_family:NVIDIA_H100_MEGA
. -
If you don't have enough quota, request a higher quota.
The environment comprises of the following components:
- Client workstation: used to prepare, submit, and monitor ML workloads.
- Google Cloud Storage (GCS) Bucket: used for logs.
- Artifact Registry: serves as a private container registry for storing and managing Docker images used in the deployment.
- Google Kubernetes Engine (GKE) Cluster with A3 Mega Node Pools: provides a managed Kubernetes environment to run benchmark recipes.
You have two options, you can use either a local machine or Google Cloud Shell.
We recommend using Google Cloud Shell as it comes with all necessary components pre-installed.
If you prefer to use your local machine, ensure your local machine has the following components installed.
- Google Cloud SDK. To install, see Install the gcloud CLI.
- kubectl. To install, see the kuberenetes documentation.
- Helm. To install, see the Helm documentation.
- Docker. To install, see the Docker documentation.
The recipes use a Google Cloud Storage bucket to maintain workload logs.
gcloud storage buckets create gs://<BUCKET_NAME> --location=<BUCKET_LOCATION> --no-public-access-prevention
Replace the following:
BUCKET_NAME
: the name of your bucket. The name must comply with the Cloud Storage bucket naming conventions.BUCKET_LOCATION
: the location of your bucket. The bucket must be located in the same region as the GKE cluster.
-
If you use Cloud KMS for repository encryption, create your artifact registry by using the instructions here.
-
If you don't use Cloud KMS, you can create your repository by using the following command:
gcloud artifacts repositories create <REPOSITORY> \ --repository-format=docker \ --location=<LOCATION> \ --description="<DESCRIPTION>" \
Replace the following:
REPOSITORY
: the name of the repository. For each repository location in a project, repository names must be unique.LOCATION
: the regional or multi-regional location for the repository. You can omit this flag if you set a default region.DESCRIPTION
: a description of the repository. Don't include sensitive data because repository descriptions are not encrypted.
Follow this guide for detailed instructions to create a GKE cluster with A3 Mega node pools.
The documentation uses Cluster Toolkit to create your GKE cluster quickly while incorporating best practices:
- Creation of the necessary VPC networks and subnets.
- Creation of a GKE cluster with multi-networking enabled.
- Creation of an A3 Mega node pool with NVIDIA H100 GPUs.
- Installation of the required components for the GPUDirect-TCPXO networking stack.
Some recipes require Google Cloud Storage buckets with hierarchical namespace enabled to manage data and checkpoints.
You can create a bucket with hierarchical namespace enabled using the following command.
gcloud storage buckets create gs://<BUCKET_NAME> --location=<BUCKET_LOCATION> \
--no-public-access-prevention \
--uniform-bucket-level-access \
--enable-hierarchical-namespace
Replace the following:
BUCKET_NAME
: the name of your bucket. The name must comply with the Cloud Storage bucket naming conventions.BUCKET_LOCATION
: the location of your bucket. The bucket must be located in the same region as the GKE cluster.
Some recipes require a Google Cloud Parallelstore instance for data and checkpointing. You can create and configure the instance using the following steps.
You must be granted the following roles:
roles/parallelstore.admin
roles/compute.networkAdmin
orroles/roles/servicenetworking.networksAdmin
gcloud services enable parallelstore.googleapis.com --project=<PROJECT_ID>
Replace the following:
PROJECT_ID
: the project ID of your project
Parallelstore runs within a Virtual Private Cloud (VPC), which provides networking functionality to Compute Engine virtual machine (VM) instances, Google Kubernetes Engine (GKE) clusters, and serverless workloads.
You must use the same VPC network when creating the Parallelstore instance that you used for your Google Kubernetes Engine cluster.
You must also configure private services access within this VPC.
gcloud services enable servicenetworking.googleapis.com
NETWORK_NAME=$(
gcloud container clusters describe <CLUSTER_NAME> \
--location <CLUSTER_REGION> \
--format="value(network)"
)
Replace the following:
CLUSTER_NAME
: the name of your GKE clusterCLUSTER_REGION
: the region of you GKE cluster
Private services access requires a prefix-length of at least /24 (256 addresses). Parallelstore reserves 64 addresses per instance, which means that you can re-use this IP range with other services or other Parallelstore instances if needed.
IP_RANGE_NAME=<IP_RANGE_NAME>
gcloud compute addresses create $IP_RANGE_NAME \
--global \
--purpose=VPC_PEERING \
--prefix-length=24 \
--description="Parallelstore VPC Peering" \
--network=$NETWORK_NAME
Replace the following:
IP_RANGE_NAME
- the name of the IP range. You can use any name that hasn't been already used.
CIDR_RANGE=$(
gcloud compute addresses describe $IP_RANGE_NAME \
--global \
--format="value[separator=/](address, prefixLength)"
)
FIREWALL_RULE_NAME=<FIREWALL_RULE_NAME>
gcloud compute firewall-rules create $FIREWALL_RULE_NAME \
--allow=tcp \
--network=$NETWORK_NAME \
--source-ranges=$CIDR_RANGE
Replace the following:
FIREWALL_RULE_NAME
: the name of the firewall rule. You can use any name that hasn't been already used.
gcloud services vpc-peerings connect \
--network=$NETWORK_NAME \
--ranges=$IP_RANGE_NAME \
--service=servicenetworking.googleapis.com
After the VPC network used by your GKE cluster is configured, you can create a Parallestore instance.
When creating a Parallelstore instance, you must define the following properties:
- The instance's name.
- The storage capacity. Capacity can range from 12TiB (tebibytes) to 100TiB, in multiples of 4. For example, 16TiB, 20TiB, 24TiB.
- The location.
- File and directory striping settings.
Your instance must be located in the same zone as your cluster's A3 Mega node pool. The storage capacity and file and directory striping settings depend on the recipe and will be specified in the recipe's instructions.
gcloud beta parallelstore instances create <INSTANCE_ID> \
--capacity-gib=<CAPACITY_GIB> \
--location=<LOCATION> \
--network=<NETWORK_NAME> \
--project=<PROJECT_ID> \
--directory-stripe-level=<DIRECTORY_STRIPE_LEVEL> \
--file-stripe-level=<FILE_STRIPE_LEVEL>
Replace the following with:
PROJECT_ID
- the name of your project.LOCATION
- the zone where your cluster's A3 Mega node pool is located.NETWORK_NAME
- the name of your cluster VPC.CAPACITY_GIB
- the storage capacity of the instance in Gibibytes (GiB). Allowed values are from 12000 to 100000, in multiples of 4000.DIRECTORY_STRIPE_LEVEL
- the striping level for directories . Allowed values are:- directory-stripe-level-balanced
- directory-stripe-level-max
- directory-stripe-level-min
FILE_STRIPE_LEVEL
- the file striping settings. Allowed values are:- file-stripe-level-balanced
- file-stripe-level-max
- file-stripe-level-min
Once you've set up your GKE cluster with A3 Mega node pools, you can proceed to deploy and run your benchmark recipes.
If you encounter any issues or have questions about this setup, use one of the following resources:
- Consult the official GKE documentation.
- Check the issues section of this repository for known problems and solutions.
- Reach out to Google Cloud support.