-
Notifications
You must be signed in to change notification settings - Fork 48
RHOAI Install and Testing
Refer to the Red Hat docs here for more detail: https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.12/html-single/installing_and_uninstalling_openshift_ai_self-managed/index
Table of Contents
- 0. Prerequisites
- 1. Install the Red Hat OpenShift AI Operator
- 2. Monitor DSCI
- 3. Install the Red Hat OpenShift AI components via DSC
- 4. Check that everything is running
- 5. Configure your Kueue minimum requirements:
- 6. Testing
- 7. Cleanup
0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.14.17)
0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.
0.3 Also logged into the terminal with oc login: For example:
oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443
0.4 Also need GPU prereqs from here: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html
1.1 Create a namespace:
cat << EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: redhat-ods-operator
EOF
1.2 Create an OperatorGroup
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: rhods-operator
namespace: redhat-ods-operator
EOF
1.3 Install Servicemesh operator
Note, if you are installing in production, you probably want installPlanApproval: Manual
so that you're not surprised with operator updates until you've had chance to verify them on a dev/stage server frist.
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/servicemeshoperator.openshift-operators: ""
name: servicemeshoperator
namespace: openshift-operators
spec:
channel: stable
installPlanApproval: Automatic
name: servicemeshoperator
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: servicemeshoperator.v2.6.2
EOF
and make sure it works:
watch oc get pods -n openshift-operators
and it should look something like this:
NAME READY STATUS RESTARTS AGE
istio-operator-6c99f6bf7b-rrh2j 1/1 Running 0 13m
1.4 Create a subscription (Recommend changing installPlanApproval to Manual in production)
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: rhods-operator
namespace: redhat-ods-operator
spec:
name: rhods-operator
channel: fast
installPlanApproval: Automatic
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
And watch that it starts:
watch oc get pods -n redhat-ods-operator
Watch the dsci until it's complete:
watch oc get dsci
and it'll finish up like this:
NAME AGE PHASE CREATED AT
default-dsci 16m Ready 2024-07-02T19:56:18Z
cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
name: default-dsc
spec:
components:
codeflare:
managementState: Removed
dashboard:
managementState: Removed
datasciencepipelines:
managementState: Removed
kserve:
managementState: Managed
defaultDeploymentMode: RawDeployment
serving:
ingressGateway:
certificate:
secretName: knative-serving-cert
type: SelfSigned
managementState: Managed
name: knative-serving
kueue:
managementState: Managed
modelmeshserving:
managementState: Removed
ray:
managementState: Removed
workbenches:
managementState: Removed
trainingoperator:
managementState: Managed
EOF
4.1 Check that your operators are running:
oc get pods -n redhat-ods-operator
Will return:
NAME READY STATUS RESTARTS AGE
rhods-operator-7c54d9d6b5-j97mv 1/1 Running 0 22h
4.2 Check that the service mesh operator is running:
oc get pods -n openshift-operators
Will return:
NAME READY STATUS RESTARTS AGE
istio-cni-node-v2-5-9qkw7 1/1 Running 0 84s
istio-cni-node-v2-5-dbtz5 1/1 Running 0 84s
istio-cni-node-v2-5-drc9l 1/1 Running 0 84s
istio-cni-node-v2-5-k4x4t 1/1 Running 0 84s
istio-cni-node-v2-5-pbltn 1/1 Running 0 84s
istio-cni-node-v2-5-xbmz5 1/1 Running 0 84s
istio-operator-6c99f6bf7b-4ckdx 1/1 Running 1 (2m39s ago) 2m56s
4.3 Check that the DSC components are running:
watch oc get pods -n redhat-ods-applications
Will return:
NAME READY STATUS RESTARTS AGE
kubeflow-training-operator-77b578788c-bbgfk 1/1 Running 0 22h
kueue-controller-manager-6b44689c95-r6qnq 1/1 Running 0 22h
cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cq-small"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "cpu-flavor"
resources:
- name: "cpu"
nominalQuota: 5
- name: "memory"
nominalQuota: 20Gi
- name: "nvidia.com/gpu"
nominalQuota: 5
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: lq-trainer
namespace: default
spec:
clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p2
value: 10000
description: "low priority"
EOF
Note: Here's an alternative Kueue requirements file with GPU that you could use as a guide...
cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "non-gpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "gpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cq-small"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "pods", "nvidia.com/gpu"]
flavors:
- name: "non-gpu-flavor"
resources:
- name: "cpu"
nominalQuota: 10
- name: "memory"
nominalQuota: 50Gi
- name: "pods"
nominalQuota: 10
- name: "nvidia.com/gpu"
nominalQuota: 0
- name: "gpu-flavor"
resources:
- name: "cpu"
nominalQuota: 10
- name: "memory"
nominalQuota: 50Gi
- name: "pods"
nominalQuota: 10
- name: "nvidia.com/gpu"
nominalQuota: 2
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: lq-trainer
namespace: default
spec:
clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p2
value: 10000
description: "low priority"
EOF
6.1 I've been using Ted's script, changing the image tag depending on the fms-hf-tuning image that we want to use.
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: my-config
namespace: default
data:
config.json: |
{
"accelerate_launch_args": {
"num_machines": 1,
"num_processes": 2
},
"model_name_or_path": "bigscience/bloom-560m",
"training_data_path": "/etc/config/twitter_complaints_small.json",
"output_dir": "/tmp/out",
"num_train_epochs": 1.0,
"per_device_train_batch_size": 4,
"per_device_eval_batch_size": 4,
"gradient_accumulation_steps": 4,
"eval_strategy": "no",
"save_strategy": "epoch",
"learning_rate": 1e-5,
"weight_decay": 0.0,
"lr_scheduler_type": "cosine",
"logging_steps": 1.0,
"packing": false,
"include_tokens_per_second": true,
"response_template": "\n### Label:",
"dataset_text_field": "output",
"use_flash_attn": false,
"torch_dtype": "float32",
"peft_method": "pt",
"tokenizer_name_or_path": "bigscience/bloom"
}
twitter_complaints_small.json: |
{"Tweet text":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}
{"Tweet text":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}
{"Tweet text":"If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService","ID":2,"Label":1,"text_label":"complaint","output":"### Text: If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService\n\n### Label: complaint"}
{"Tweet text":"@EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.","ID":3,"Label":1,"text_label":"complaint","output":"### Text: @EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.\n\n### Label: complaint"}
{"Tweet text":"Couples wallpaper, so cute. :) #BrothersAtHome","ID":4,"Label":2,"text_label":"no complaint","output":"### Text: Couples wallpaper, so cute. :) #BrothersAtHome\n\n### Label: no complaint"}
{"Tweet text":"@mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG","ID":5,"Label":2,"text_label":"no complaint","output":"### Text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG\n\n### Label: no complaint"}
{"Tweet text":"@Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?","ID":6,"Label":2,"text_label":"no complaint","output":"### Text: @Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?\n\n### Label: no complaint"}
{"Tweet text":"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?","ID":7,"Label":1,"text_label":"complaint","output":"### Text: @nationalgridus I have no water and the bill is current and paid. Can you do something about this?\n\n### Label: complaint"}
{"Tweet text":"Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora","ID":8,"Label":1,"text_label":"complaint","output":"### Text: Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora\n\n### Label: complaint"}
{"Tweet text":"@JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd","ID":9,"Label":2,"text_label":"no complaint","output":"### Text: @JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd\n\n### Label: no complaint"}
---
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: ted-kfto-sft
namespace: default
labels:
kueue.x-k8s.io/queue-name: lq-trainer
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never # Do not restart the pod on failure. If you do set it to OnFailure, be sure to also set backoffLimit
template:
spec:
containers:
- name: pytorch
# This is the temp location util image is officially released
#image: image-registry.openshift-image-registry.svc:5000/opendatahub/fms-hf-tuning:0.0.1rc7
#image: quay.io/jbusche/fms-hf-tuning:issue758-1
#image: quay.io/modh/fms-hf-tuning:01b3824c9aba22d9d0695399681e6f0507840e7f
#image: quay.io/modh/fms-hf-tuning:a130d1c890501a4fac1d9522f1198b6273ade2d4
image: quay.io/modh/fms-hf-tuning:release
imagePullPolicy: IfNotPresent
command:
- "python"
- "/app/accelerate_launch.py"
env:
- name: SFT_TRAINER_CONFIG_JSON_PATH
value: /etc/config/config.json
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
name: my-config
items:
- key: config.json
path: config.json
- key: twitter_complaints_small.json
path: twitter_complaints_small.json
EOF
6.2 And then in a perfect world, it'll start up a pytorchjob and run to completion:
watch oc get pytorchjobs,pods -n default
and it'll look like this:
Every 2.0s: oc get pytorchjobs,pods -n default api.ted414.cp.fyre.ibm.com: Wed Apr 24 18:34:49 2024
NAME STATE AGE
pytorchjob.kubeflow.org/ted-kfto-sft Succeeded 58m
NAME READY STATUS RESTARTS AGE
pod/ted-kfto-sft-master-0 0/1 Completed 0 58m
7.1 Cleanup of your pytorchjob and cm:
oc delete pytorchjob ted-kfto-sft -n default
oc delete cm my-config -n default
7.2 Cleanup of your Kueue resouces, if you want that:
oc delete flavor gpu-flavor non-gpu-flavor cpu-flavor
oc delete cq cq-small
oc delete lq lq-trainer
oc delete WorkloadPriorityClass p1 p2
7.3 Cleanup of dsc items (if you want that)
oc delete dsc default-dsc
7.4 Cleanup of DSCI (if you want that)
oc delete dsci default-dsci
7.5 Cleanup of the Operators (if you want that)
oc delete sub rhods-operator -n redhat-ods-operator
oc delete csv rhods-operator.2.12.0 -n redhat-ods-operator
7.6 Cleanup of the operatorgroup
oc delete OperatorGroup rhods-operator -n redhat-ods-operator