Skip to content

RHOAI Install and Testing

James Busche edited this page Nov 6, 2024 · 1 revision

Refer to the Red Hat docs here for more detail: https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.12/html-single/installing_and_uninstalling_openshift_ai_self-managed/index

Table of Contents

0. Prerequisites

0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.14.17)

0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.

0.3 Also logged into the terminal with oc login: For example:

oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443

Note: If you have a GPU cluster:

0.4 Also need GPU prereqs from here: https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html

1. Install the Red Hat OpenShift AI Operator

1.1 Create a namespace:

cat << EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: redhat-ods-operator 
EOF

1.2 Create an OperatorGroup

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator
EOF

1.3 Install Servicemesh operator

Note, if you are installing in production, you probably want installPlanApproval: Manual so that you're not surprised with operator updates until you've had chance to verify them on a dev/stage server frist.

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/servicemeshoperator.openshift-operators: ""
  name: servicemeshoperator
  namespace: openshift-operators
spec:
  channel: stable
  installPlanApproval: Automatic
  name: servicemeshoperator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: servicemeshoperator.v2.6.2
EOF

and make sure it works:

watch oc get pods -n openshift-operators

and it should look something like this:

NAME                              READY   STATUS    RESTARTS   AGE
istio-operator-6c99f6bf7b-rrh2j   1/1     Running   0          13m

1.4 Create a subscription (Recommend changing installPlanApproval to Manual in production)

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: rhods-operator
  namespace: redhat-ods-operator 
spec:
  name: rhods-operator
  channel: fast
  installPlanApproval: Automatic 
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

And watch that it starts:

watch oc get pods -n redhat-ods-operator

2. Monitor DSCI

Watch the dsci until it's complete:

watch oc get dsci

and it'll finish up like this:

NAME           AGE   PHASE   CREATED AT
default-dsci   16m   Ready   2024-07-02T19:56:18Z

3. Install the Red Hat OpenShift AI components via DSC

cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  components:
    codeflare:
      managementState: Removed
    dashboard:
      managementState: Removed
    datasciencepipelines:
      managementState: Removed
    kserve:
      managementState: Managed
      defaultDeploymentMode: RawDeployment
      serving:
        ingressGateway:
          certificate:
            secretName: knative-serving-cert
            type: SelfSigned
        managementState: Managed
        name: knative-serving 
    kueue:
      managementState: Managed
    modelmeshserving:
      managementState: Removed
    ray:
      managementState: Removed
    workbenches:
      managementState: Removed
    trainingoperator:
      managementState: Managed
EOF

4. Check that everything is running

4.1 Check that your operators are running:

oc get pods -n redhat-ods-operator

Will return:

NAME                              READY   STATUS    RESTARTS   AGE
rhods-operator-7c54d9d6b5-j97mv   1/1     Running   0          22h

4.2 Check that the service mesh operator is running:

oc get pods -n openshift-operators 

Will return:

NAME                              READY   STATUS    RESTARTS        AGE
istio-cni-node-v2-5-9qkw7         1/1     Running   0               84s
istio-cni-node-v2-5-dbtz5         1/1     Running   0               84s
istio-cni-node-v2-5-drc9l         1/1     Running   0               84s
istio-cni-node-v2-5-k4x4t         1/1     Running   0               84s
istio-cni-node-v2-5-pbltn         1/1     Running   0               84s
istio-cni-node-v2-5-xbmz5         1/1     Running   0               84s
istio-operator-6c99f6bf7b-4ckdx   1/1     Running   1 (2m39s ago)   2m56s

4.3 Check that the DSC components are running:

watch oc get pods -n redhat-ods-applications

Will return:

NAME                                          READY   STATUS    RESTARTS   AGE
kubeflow-training-operator-77b578788c-bbgfk   1/1     Running   0          22h
kueue-controller-manager-6b44689c95-r6qnq     1/1     Running   0          22h

5. Configure your Kueue minimum requirements:

cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cq-small"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "cpu-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 5
      - name: "memory"
        nominalQuota: 20Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 5
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: lq-trainer
  namespace: default
spec:
  clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p2
value: 10000
description: "low priority"
EOF

Note: Here's an alternative Kueue requirements file with GPU that you could use as a guide...

cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "non-gpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "gpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cq-small"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "pods", "nvidia.com/gpu"]
    flavors:
    - name: "non-gpu-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 10
      - name: "memory"
        nominalQuota: 50Gi
      - name: "pods"
        nominalQuota: 10
      - name: "nvidia.com/gpu"
        nominalQuota: 0
    - name: "gpu-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 10
      - name: "memory"
        nominalQuota: 50Gi
      - name: "pods"
        nominalQuota: 10
      - name: "nvidia.com/gpu"
        nominalQuota: 2
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: lq-trainer
  namespace: default
spec:
  clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p2
value: 10000
description: "low priority"
EOF

6. Testing

6.1 I've been using Ted's script, changing the image tag depending on the fms-hf-tuning image that we want to use.

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-config
  namespace: default
data:
  config.json: |
    {
      "accelerate_launch_args": {
        "num_machines": 1,
        "num_processes": 2
      },
      "model_name_or_path": "bigscience/bloom-560m",
      "training_data_path": "/etc/config/twitter_complaints_small.json",
      "output_dir": "/tmp/out",
      "num_train_epochs": 1.0,
      "per_device_train_batch_size": 4,
      "per_device_eval_batch_size": 4,
      "gradient_accumulation_steps": 4,
      "eval_strategy": "no",
      "save_strategy": "epoch",
      "learning_rate": 1e-5,
      "weight_decay": 0.0,
      "lr_scheduler_type": "cosine",
      "logging_steps": 1.0,
      "packing": false,
      "include_tokens_per_second": true,
      "response_template": "\n### Label:",
      "dataset_text_field": "output",
      "use_flash_attn": false,
      "torch_dtype": "float32",
      "peft_method": "pt",
      "tokenizer_name_or_path": "bigscience/bloom"
    }
  twitter_complaints_small.json: |
    {"Tweet text":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}
    {"Tweet text":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}
    {"Tweet text":"If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService","ID":2,"Label":1,"text_label":"complaint","output":"### Text: If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService\n\n### Label: complaint"}
    {"Tweet text":"@EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.","ID":3,"Label":1,"text_label":"complaint","output":"### Text: @EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.\n\n### Label: complaint"}
    {"Tweet text":"Couples wallpaper, so cute. :) #BrothersAtHome","ID":4,"Label":2,"text_label":"no complaint","output":"### Text: Couples wallpaper, so cute. :) #BrothersAtHome\n\n### Label: no complaint"}
    {"Tweet text":"@mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG","ID":5,"Label":2,"text_label":"no complaint","output":"### Text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG\n\n### Label: no complaint"}
    {"Tweet text":"@Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?","ID":6,"Label":2,"text_label":"no complaint","output":"### Text: @Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?\n\n### Label: no complaint"}
    {"Tweet text":"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?","ID":7,"Label":1,"text_label":"complaint","output":"### Text: @nationalgridus I have no water and the bill is current and paid. Can you do something about this?\n\n### Label: complaint"}
    {"Tweet text":"Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora","ID":8,"Label":1,"text_label":"complaint","output":"### Text: Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora\n\n### Label: complaint"}
    {"Tweet text":"@JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd","ID":9,"Label":2,"text_label":"no complaint","output":"### Text: @JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd\n\n### Label: no complaint"}
---
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: ted-kfto-sft
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: lq-trainer
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never # Do not restart the pod on failure. If you do set it to OnFailure, be sure to also set backoffLimit
      template:
        spec:
          containers:
            - name: pytorch
              # This is the temp location util image is officially released
              #image: image-registry.openshift-image-registry.svc:5000/opendatahub/fms-hf-tuning:0.0.1rc7
              #image: quay.io/jbusche/fms-hf-tuning:issue758-1
              #image: quay.io/modh/fms-hf-tuning:01b3824c9aba22d9d0695399681e6f0507840e7f
              #image: quay.io/modh/fms-hf-tuning:a130d1c890501a4fac1d9522f1198b6273ade2d4
              image: quay.io/modh/fms-hf-tuning:release
              imagePullPolicy: IfNotPresent
              command:
                - "python"
                - "/app/accelerate_launch.py"
              env:
                - name: SFT_TRAINER_CONFIG_JSON_PATH
                  value: /etc/config/config.json
              volumeMounts:
              - name: config-volume
                mountPath: /etc/config
          volumes:
          - name: config-volume
            configMap:
              name: my-config
              items:
              - key: config.json
                path: config.json
              - key: twitter_complaints_small.json
                path: twitter_complaints_small.json
EOF

6.2 And then in a perfect world, it'll start up a pytorchjob and run to completion:

watch oc get pytorchjobs,pods -n default

and it'll look like this:

Every 2.0s: oc get pytorchjobs,pods -n default                                api.ted414.cp.fyre.ibm.com: Wed Apr 24 18:34:49 2024

NAME                                   STATE	   AGE
pytorchjob.kubeflow.org/ted-kfto-sft   Succeeded   58m

NAME                        READY   STATUS	RESTARTS   AGE
pod/ted-kfto-sft-master-0   0/1     Completed   0          58m

7. Cleanup

7.1 Cleanup of your pytorchjob and cm:

oc delete pytorchjob ted-kfto-sft  -n default
oc delete cm my-config -n default

7.2 Cleanup of your Kueue resouces, if you want that:

oc delete flavor gpu-flavor non-gpu-flavor cpu-flavor
oc delete cq cq-small
oc delete lq lq-trainer
oc delete WorkloadPriorityClass p1 p2

7.3 Cleanup of dsc items (if you want that)

oc delete dsc default-dsc

7.4 Cleanup of DSCI (if you want that)

oc delete dsci default-dsci

7.5 Cleanup of the Operators (if you want that)

oc delete sub rhods-operator -n redhat-ods-operator
oc delete csv rhods-operator.2.12.0 -n redhat-ods-operator

7.6 Cleanup of the operatorgroup

oc delete OperatorGroup rhods-operator -n redhat-ods-operator