Skip to content

ML Batch Testing on OpenShift with Pytorchjob

James Busche edited this page Jul 2, 2024 · 2 revisions

Trying to install with these instructions: https://github.com/project-codeflare/mlbatch/blob/main/SETUP.md

  1. Prerequisites 0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.14.17)

0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.

0.3 Also logged into the terminal with oc login: For example:

oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443

0.4 The project pulled down into a folder, for example:

git clone --recursive https://github.com/project-codeflare/mlbatch.git
cd mlbatch

Installation

I followed the SETUP.md very closely and it worked... I think the only difference would be to change KubeRay in the dsc to managementState: Removed

Run the pytorchjob

I did this:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-config
  namespace: team1
data:
  config.json: |
    {
      "accelerate_launch_args": {
        "num_machines": 1,
        "num_processes": 2
      },
      "model_name_or_path": "bigscience/bloom-560m",
      "training_data_path": "/etc/config/twitter_complaints_small.json",
      "output_dir": "/tmp/out",
      "num_train_epochs": 1.0,
      "per_device_train_batch_size": 4,
      "per_device_eval_batch_size": 4,
      "gradient_accumulation_steps": 4,
      "eval_strategy": "no",
      "save_strategy": "epoch",
      "learning_rate": 1e-5,
      "weight_decay": 0.0,
      "lr_scheduler_type": "cosine",
      "logging_steps": 1.0,
      "packing": false,
      "include_tokens_per_second": true,
      "response_template": "\n### Label:",
      "dataset_text_field": "output",
      "use_flash_attn": false,
      "torch_dtype": "float32",
      "peft_method": "pt",
      "tokenizer_name_or_path": "bigscience/bloom"
    }
  twitter_complaints_small.json: |
    {"Tweet text":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}
    {"Tweet text":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}
    {"Tweet text":"If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService","ID":2,"Label":1,"text_label":"complaint","output":"### Text: If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService\n\n### Label: complaint"}
    {"Tweet text":"@EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.","ID":3,"Label":1,"text_label":"complaint","output":"### Text: @EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.\n\n### Label: complaint"}
    {"Tweet text":"Couples wallpaper, so cute. :) #BrothersAtHome","ID":4,"Label":2,"text_label":"no complaint","output":"### Text: Couples wallpaper, so cute. :) #BrothersAtHome\n\n### Label: no complaint"}
    {"Tweet text":"@mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG","ID":5,"Label":2,"text_label":"no complaint","output":"### Text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG\n\n### Label: no complaint"}
    {"Tweet text":"@Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?","ID":6,"Label":2,"text_label":"no complaint","output":"### Text: @Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?\n\n### Label: no complaint"}
    {"Tweet text":"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?","ID":7,"Label":1,"text_label":"complaint","output":"### Text: @nationalgridus I have no water and the bill is current and paid. Can you do something about this?\n\n### Label: complaint"}
    {"Tweet text":"Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora","ID":8,"Label":1,"text_label":"complaint","output":"### Text: Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora\n\n### Label: complaint"}
    {"Tweet text":"@JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd","ID":9,"Label":2,"text_label":"no complaint","output":"### Text: @JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd\n\n### Label: no complaint"}
---
apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
metadata:
  name: sample-aw-pytorchjob
spec:
  components:
  - template:
      # job specification
      apiVersion: "kubeflow.org/v1"
      kind: PyTorchJob
      metadata:
        name: ted-kfto-sft
        namespace: team1
        labels:
          kueue.x-k8s.io/queue-name: default-queue
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            #restartPolicy: Never # Do not restart the pod on failure. If you do set it to OnFailure, be sure to also set backoffLimit
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - name: pytorch
                    # This is the temp location util image is officially released
                    image: quay.io/modh/fms-hf-tuning:release
                    imagePullPolicy: IfNotPresent
                    command:
                      - "python"
                      - "/app/accelerate_launch.py"
                    env:
                      - name: SFT_TRAINER_CONFIG_JSON_PATH
                        value: /etc/config/config.json
                    volumeMounts:
                    - name: config-volume
                      mountPath: /etc/config
                volumes:
                - name: config-volume
                  configMap:
                    name: my-config
                    items:
                    - key: config.json
                      path: config.json
                    - key: twitter_complaints_small.json
                      path: twitter_complaints_small.json
EOF

and then I watched it run:

watch oc get appwrapper,pytorchjobs,pods   

For example:

NAME                                                     STATUS      QUOTA RESERVED   RESOURCES DEPLOYED   UNHEALTHY
appwrapper.workload.codeflare.dev/sample-aw-pytorchjob   Succeeded   False            True                 False

NAME                                   STATE	   AGE
pytorchjob.kubeflow.org/ted-kfto-sft   Succeeded   42m

NAME                        READY   STATUS	RESTARTS   AGE
pod/ted-kfto-sft-master-0   0/1     Completed   0          42m