-
Notifications
You must be signed in to change notification settings - Fork 48
ML Batch Testing on OpenShift with Pytorchjob
James Busche edited this page Jul 2, 2024
·
2 revisions
Trying to install with these instructions: https://github.com/project-codeflare/mlbatch/blob/main/SETUP.md
- Prerequisites 0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.14.17)
0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.
0.3 Also logged into the terminal with oc login: For example:
oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443
0.4 The project pulled down into a folder, for example:
git clone --recursive https://github.com/project-codeflare/mlbatch.git
cd mlbatch
I followed the SETUP.md very closely and it worked... I think the only difference would be to change KubeRay in the dsc to managementState: Removed
I did this:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: my-config
namespace: team1
data:
config.json: |
{
"accelerate_launch_args": {
"num_machines": 1,
"num_processes": 2
},
"model_name_or_path": "bigscience/bloom-560m",
"training_data_path": "/etc/config/twitter_complaints_small.json",
"output_dir": "/tmp/out",
"num_train_epochs": 1.0,
"per_device_train_batch_size": 4,
"per_device_eval_batch_size": 4,
"gradient_accumulation_steps": 4,
"eval_strategy": "no",
"save_strategy": "epoch",
"learning_rate": 1e-5,
"weight_decay": 0.0,
"lr_scheduler_type": "cosine",
"logging_steps": 1.0,
"packing": false,
"include_tokens_per_second": true,
"response_template": "\n### Label:",
"dataset_text_field": "output",
"use_flash_attn": false,
"torch_dtype": "float32",
"peft_method": "pt",
"tokenizer_name_or_path": "bigscience/bloom"
}
twitter_complaints_small.json: |
{"Tweet text":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}
{"Tweet text":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}
{"Tweet text":"If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService","ID":2,"Label":1,"text_label":"complaint","output":"### Text: If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService\n\n### Label: complaint"}
{"Tweet text":"@EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.","ID":3,"Label":1,"text_label":"complaint","output":"### Text: @EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.\n\n### Label: complaint"}
{"Tweet text":"Couples wallpaper, so cute. :) #BrothersAtHome","ID":4,"Label":2,"text_label":"no complaint","output":"### Text: Couples wallpaper, so cute. :) #BrothersAtHome\n\n### Label: no complaint"}
{"Tweet text":"@mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG","ID":5,"Label":2,"text_label":"no complaint","output":"### Text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG\n\n### Label: no complaint"}
{"Tweet text":"@Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?","ID":6,"Label":2,"text_label":"no complaint","output":"### Text: @Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?\n\n### Label: no complaint"}
{"Tweet text":"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?","ID":7,"Label":1,"text_label":"complaint","output":"### Text: @nationalgridus I have no water and the bill is current and paid. Can you do something about this?\n\n### Label: complaint"}
{"Tweet text":"Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora","ID":8,"Label":1,"text_label":"complaint","output":"### Text: Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora\n\n### Label: complaint"}
{"Tweet text":"@JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd","ID":9,"Label":2,"text_label":"no complaint","output":"### Text: @JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd\n\n### Label: no complaint"}
---
apiVersion: workload.codeflare.dev/v1beta2
kind: AppWrapper
metadata:
name: sample-aw-pytorchjob
spec:
components:
- template:
# job specification
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: ted-kfto-sft
namespace: team1
labels:
kueue.x-k8s.io/queue-name: default-queue
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
#restartPolicy: Never # Do not restart the pod on failure. If you do set it to OnFailure, be sure to also set backoffLimit
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
# This is the temp location util image is officially released
image: quay.io/modh/fms-hf-tuning:release
imagePullPolicy: IfNotPresent
command:
- "python"
- "/app/accelerate_launch.py"
env:
- name: SFT_TRAINER_CONFIG_JSON_PATH
value: /etc/config/config.json
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
name: my-config
items:
- key: config.json
path: config.json
- key: twitter_complaints_small.json
path: twitter_complaints_small.json
EOF
and then I watched it run:
watch oc get appwrapper,pytorchjobs,pods
For example:
NAME STATUS QUOTA RESERVED RESOURCES DEPLOYED UNHEALTHY
appwrapper.workload.codeflare.dev/sample-aw-pytorchjob Succeeded False True False
NAME STATE AGE
pytorchjob.kubeflow.org/ted-kfto-sft Succeeded 42m
NAME READY STATUS RESTARTS AGE
pod/ted-kfto-sft-master-0 0/1 Completed 0 42m