Skip to content

Installing and Testing OpenShift fms‐hf‐tuning Stack

James Busche edited this page Apr 26, 2024 · 30 revisions

Installing and Testing OpenShift fms-hf-tuning Stack

0. Prerequisites

0.1 OpenShift

0.2 Logged onto the OS UI

0.3 Also logged into the terminal with oc login

0.4 Have an opendatahub namespace created:

1. Install ODH with Fast Channel

Using your terminal where you're logged in with oc login, issue this command:

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/opendatahub-operator.openshift-operators: ""
  name: opendatahub-operator
  namespace: openshift-operators
spec:
  channel: fast
  installPlanApproval: Automatic
  name: opendatahub-operator
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: opendatahub-operator.v2.11.0
EOF

You can check it started with:

oc get pods -n openshift-operators

2. Install the DSCI prerequisite Operators

2.1 Install service mesh

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/servicemeshoperator.openshift-operators: ""
  name: servicemeshoperator
  namespace: openshift-operators
spec:
  channel: stable
  installPlanApproval: Automatic
  name: servicemeshoperator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: servicemeshoperator.v2.5.0
EOF

And then check it with:

oc get pods -n openshift-operators

2.2 Install Authorino Operator

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/authorino-operator.openshift-operators: ""
  name: authorino-operator
  namespace: openshift-operators
spec:
  channel: stable
  installPlanApproval: Automatic
  name: authorino-operator
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: authorino-operator.v0.11.1
EOF

And then check it with:

oc get pods -n openshift-operators

3. Install DSCI

cat << EOF | oc apply -f -
kind: DSCInitialization
apiVersion: dscinitialization.opendatahub.io/v1
metadata:
  name: default-dsci
spec:
  applicationsNamespace: opendatahub
  monitoring:
    managementState: Managed
    namespace: opendatahub
  serviceMesh:
    auth:
      audiences:
      - https://kubernetes.default.svc
    controlPlane:
      metricsCollection: Istio
      name: data-science-smcp
      namespace: istio-system
    managementState: Managed
  trustedCABundle:
    customCABundle: ""
    managementState: Managed
EOF

And then check it: (It should go into "Ready" state after about a minute or so)

oc get dsci

Also note that you'll see the istio control pane start up as well here:

oc get pods -n openshift-operators

4. Install the DSC

cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  components:
    codeflare:
      managementState: Removed
    dashboard:
      managementState: Managed
    datasciencepipelines:
      managementState: Removed
    kserve:
      managementState: Removed
      serving:
        ingressGateway:
          certificate:
            type: SelfSigned
        managementState: Managed
        name: knative-serving
    kueue:
      managementState: Managed
    modelmeshserving:
      managementState: Removed
    modelregistry:
      managementState: Removed
    ray:
      managementState: Removed
    trainingoperator:
      managementState: Managed
    trustyai:
      managementState: Removed
    workbenches:
      managementState: Removed
EOF

Check that the pods are running:

oc get pods -n opendatahub

You should see these pods:

oc get pods -n opendatahub
NAME                                         READY   STATUS    RESTARTS   AGE
kubeflow-training-operator-dc9cf9bb5-595xx   1/1     Running   0          4h50m
kueue-controller-manager-66768ccc94-4xq4v    1/1     Running   0          4h51m
odh-dashboard-5969fd7b5b-gd6rt               2/2     Running   0          4h51m
odh-dashboard-5969fd7b5b-xd7qj               2/2     Running   0          4h51m

Note, if you're having pull issues from docker.io, you can change your deployment to pull from quay.io instead with this:

oc set image deployment kubeflow-training-operator training-operator=quay.io/jbusche/training-operator:v1-855e096 -n opendatahub

Note: the initContainer pulls from docker.io/alpine:3.10 automatically, which causes trouble on some clusters that are ratelimited to Docker.io. To get around this, you can run the following command to patch the training-operator to use a different repo for the initContainer:

oc patch deployment kubeflow-training-operator -n opendatahub --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["/manager",  "--pytorch-init-container-image=quay.io/jbusche/alpine:3.10"]}]'

Now configure your Kueue minimum requirements:

cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cq-small"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: "cpu-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 5
      - name: "memory"
        nominalQuota: 20Gi
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: lq-trainer
  namespace: default
spec:
  clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p2
value: 10000
description: "low priority"
EOF

Testing

I've been using Ted's script, changing the image tag depending on the fms-hf-tuning image that we want to use.

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-config
data:
  config.json: |
    {
      "accelerate_launch_args": {
        "num_machines": 1,
        "num_processes": 2
      },
      "model_name_or_path": "bigscience/bloom-560m",
      "training_data_path": "/etc/config/twitter_complaints_small.json",
      "output_dir": "/tmp/out",
      "num_train_epochs": 1.0,
      "per_device_train_batch_size": 4,
      "per_device_eval_batch_size": 4,
      "gradient_accumulation_steps": 4,
      "evaluation_strategy": "no",
      "save_strategy": "epoch",
      "learning_rate": 1e-5,
      "weight_decay": 0.0,
      "lr_scheduler_type": "cosine",
      "logging_steps": 1.0,
      "packing": false,
      "include_tokens_per_second": true,
      "response_template": "\n### Label:",
      "dataset_text_field": "output",
      "use_flash_attn": false,
      "torch_dtype": "float32",
      "peft_method": "pt",
      "tokenizer_name_or_path": "bigscience/bloom"
    }
  twitter_complaints_small.json: |
    {"Tweet text":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}
    {"Tweet text":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}
    {"Tweet text":"If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService","ID":2,"Label":1,"text_label":"complaint","output":"### Text: If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService\n\n### Label: complaint"}
    {"Tweet text":"@EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.","ID":3,"Label":1,"text_label":"complaint","output":"### Text: @EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.\n\n### Label: complaint"}
    {"Tweet text":"Couples wallpaper, so cute. :) #BrothersAtHome","ID":4,"Label":2,"text_label":"no complaint","output":"### Text: Couples wallpaper, so cute. :) #BrothersAtHome\n\n### Label: no complaint"}
    {"Tweet text":"@mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG","ID":5,"Label":2,"text_label":"no complaint","output":"### Text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG\n\n### Label: no complaint"}
    {"Tweet text":"@Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?","ID":6,"Label":2,"text_label":"no complaint","output":"### Text: @Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?\n\n### Label: no complaint"}
    {"Tweet text":"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?","ID":7,"Label":1,"text_label":"complaint","output":"### Text: @nationalgridus I have no water and the bill is current and paid. Can you do something about this?\n\n### Label: complaint"}
    {"Tweet text":"Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora","ID":8,"Label":1,"text_label":"complaint","output":"### Text: Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora\n\n### Label: complaint"}
    {"Tweet text":"@JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd","ID":9,"Label":2,"text_label":"no complaint","output":"### Text: @JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd\n\n### Label: no complaint"}
---
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: ted-kfto-sft
  labels:
    kueue.x-k8s.io/queue-name: lq-trainer
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never # Do not restart the pod on failure. If you do set it to OnFailure, be sure to also set backoffLimit
      template:
        spec:
          containers:
            - name: pytorch
              # This is the temp location util image is officially released
              #image: image-registry.openshift-image-registry.svc:5000/opendatahub/fms-hf-tuning:0.0.1rc7
              #image: quay.io/jbusche/fms-hf-tuning:issue758-1
              image: quay.io/modh/fms-hf-tuning:01b3824c9aba22d9d0695399681e6f0507840e7f
              imagePullPolicy: IfNotPresent
              command:
                - "python"
                - "/app/accelerate_launch.py"
              env:
                - name: SFT_TRAINER_CONFIG_JSON_PATH
                  value: /etc/config/config.json
              volumeMounts:
              - name: config-volume
                mountPath: /etc/config
          volumes:
          - name: config-volume
            configMap:
              name: my-config
              items:
              - key: config.json
                path: config.json
              - key: twitter_complaints_small.json
                path: twitter_complaints_small.json
EOF

And then in a perfect world, it'll start up a pytorchjob and run to completion:

watch oc get pytorchjobs,pods

and it'll look like this:

Every 2.0s: oc get pytorchjobs,pods                                                           api.ted414.cp.fyre.ibm.com: Wed Apr 24 18:34:49 2024

NAME                                   STATE	   AGE
pytorchjob.kubeflow.org/ted-kfto-sft   Succeeded   58m

NAME                        READY   STATUS	RESTARTS   AGE
pod/ted-kfto-sft-master-0   0/1     Completed   0          58m