-
Notifications
You must be signed in to change notification settings - Fork 48
Installing and Testing OpenShift fms‐hf‐tuning Stack
0.1 OpenShift
0.2 Logged onto the OS UI
0.3 Also logged into the terminal with oc login
0.4 Have an opendatahub namespace created:
Using your terminal where you're logged in with oc login, issue this command:
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/opendatahub-operator.openshift-operators: ""
name: opendatahub-operator
namespace: openshift-operators
spec:
channel: fast
installPlanApproval: Automatic
name: opendatahub-operator
source: community-operators
sourceNamespace: openshift-marketplace
startingCSV: opendatahub-operator.v2.11.0
EOF
You can check it started with:
oc get pods -n openshift-operators
2.1 Install service mesh
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/servicemeshoperator.openshift-operators: ""
name: servicemeshoperator
namespace: openshift-operators
spec:
channel: stable
installPlanApproval: Automatic
name: servicemeshoperator
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: servicemeshoperator.v2.5.0
EOF
And then check it with:
oc get pods -n openshift-operators
2.2 Install Authorino Operator
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/authorino-operator.openshift-operators: ""
name: authorino-operator
namespace: openshift-operators
spec:
channel: stable
installPlanApproval: Automatic
name: authorino-operator
source: community-operators
sourceNamespace: openshift-marketplace
startingCSV: authorino-operator.v0.11.1
EOF
And then check it with:
oc get pods -n openshift-operators
cat << EOF | oc apply -f -
kind: DSCInitialization
apiVersion: dscinitialization.opendatahub.io/v1
metadata:
name: default-dsci
spec:
applicationsNamespace: opendatahub
monitoring:
managementState: Managed
namespace: opendatahub
serviceMesh:
auth:
audiences:
- https://kubernetes.default.svc
controlPlane:
metricsCollection: Istio
name: data-science-smcp
namespace: istio-system
managementState: Managed
trustedCABundle:
customCABundle: ""
managementState: Managed
EOF
And then check it: (It should go into "Ready" state after about a minute or so)
oc get dsci
Also note that you'll see the istio control pane start up as well here:
oc get pods -n openshift-operators
cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
name: default-dsc
spec:
components:
codeflare:
managementState: Removed
dashboard:
managementState: Managed
datasciencepipelines:
managementState: Removed
kserve:
managementState: Removed
serving:
ingressGateway:
certificate:
type: SelfSigned
managementState: Managed
name: knative-serving
kueue:
managementState: Managed
modelmeshserving:
managementState: Removed
modelregistry:
managementState: Removed
ray:
managementState: Removed
trainingoperator:
managementState: Managed
trustyai:
managementState: Removed
workbenches:
managementState: Removed
EOF
Check that the pods are running:
oc get pods -n opendatahub
You should see these pods:
oc get pods -n opendatahub
NAME READY STATUS RESTARTS AGE
kubeflow-training-operator-dc9cf9bb5-595xx 1/1 Running 0 4h50m
kueue-controller-manager-66768ccc94-4xq4v 1/1 Running 0 4h51m
odh-dashboard-5969fd7b5b-gd6rt 2/2 Running 0 4h51m
odh-dashboard-5969fd7b5b-xd7qj 2/2 Running 0 4h51m
Note, if you're having pull issues from docker.io, you can change your deployment to pull from quay.io instead with this:
oc set image deployment kubeflow-training-operator training-operator=quay.io/jbusche/training-operator:v1-855e096 -n opendatahub
Note: the initContainer pulls from docker.io/alpine:3.10 automatically, which causes trouble on some clusters that are ratelimited to Docker.io. To get around this, you can run the following command to patch the training-operator to use a different repo for the initContainer:
oc patch deployment kubeflow-training-operator -n opendatahub --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["/manager", "--pytorch-init-container-image=quay.io/jbusche/alpine:3.10"]}]'
cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cq-small"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory"]
flavors:
- name: "cpu-flavor"
resources:
- name: "cpu"
nominalQuota: 5
- name: "memory"
nominalQuota: 20Gi
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: lq-trainer
namespace: default
spec:
clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p2
value: 10000
description: "low priority"
EOF
I've been using Ted's script, changing the image tag depending on the fms-hf-tuning image that we want to use.
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: my-config
data:
config.json: |
{
"accelerate_launch_args": {
"num_machines": 1,
"num_processes": 2
},
"model_name_or_path": "bigscience/bloom-560m",
"training_data_path": "/etc/config/twitter_complaints_small.json",
"output_dir": "/tmp/out",
"num_train_epochs": 1.0,
"per_device_train_batch_size": 4,
"per_device_eval_batch_size": 4,
"gradient_accumulation_steps": 4,
"evaluation_strategy": "no",
"save_strategy": "epoch",
"learning_rate": 1e-5,
"weight_decay": 0.0,
"lr_scheduler_type": "cosine",
"logging_steps": 1.0,
"packing": false,
"include_tokens_per_second": true,
"response_template": "\n### Label:",
"dataset_text_field": "output",
"use_flash_attn": false,
"torch_dtype": "float32",
"peft_method": "pt",
"tokenizer_name_or_path": "bigscience/bloom"
}
twitter_complaints_small.json: |
{"Tweet text":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}
{"Tweet text":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}
{"Tweet text":"If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService","ID":2,"Label":1,"text_label":"complaint","output":"### Text: If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService\n\n### Label: complaint"}
{"Tweet text":"@EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.","ID":3,"Label":1,"text_label":"complaint","output":"### Text: @EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.\n\n### Label: complaint"}
{"Tweet text":"Couples wallpaper, so cute. :) #BrothersAtHome","ID":4,"Label":2,"text_label":"no complaint","output":"### Text: Couples wallpaper, so cute. :) #BrothersAtHome\n\n### Label: no complaint"}
{"Tweet text":"@mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG","ID":5,"Label":2,"text_label":"no complaint","output":"### Text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG\n\n### Label: no complaint"}
{"Tweet text":"@Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?","ID":6,"Label":2,"text_label":"no complaint","output":"### Text: @Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?\n\n### Label: no complaint"}
{"Tweet text":"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?","ID":7,"Label":1,"text_label":"complaint","output":"### Text: @nationalgridus I have no water and the bill is current and paid. Can you do something about this?\n\n### Label: complaint"}
{"Tweet text":"Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora","ID":8,"Label":1,"text_label":"complaint","output":"### Text: Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora\n\n### Label: complaint"}
{"Tweet text":"@JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd","ID":9,"Label":2,"text_label":"no complaint","output":"### Text: @JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd\n\n### Label: no complaint"}
---
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: ted-kfto-sft
labels:
kueue.x-k8s.io/queue-name: lq-trainer
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never # Do not restart the pod on failure. If you do set it to OnFailure, be sure to also set backoffLimit
template:
spec:
containers:
- name: pytorch
# This is the temp location util image is officially released
#image: image-registry.openshift-image-registry.svc:5000/opendatahub/fms-hf-tuning:0.0.1rc7
#image: quay.io/jbusche/fms-hf-tuning:issue758-1
image: quay.io/modh/fms-hf-tuning:01b3824c9aba22d9d0695399681e6f0507840e7f
imagePullPolicy: IfNotPresent
command:
- "python"
- "/app/accelerate_launch.py"
env:
- name: SFT_TRAINER_CONFIG_JSON_PATH
value: /etc/config/config.json
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
name: my-config
items:
- key: config.json
path: config.json
- key: twitter_complaints_small.json
path: twitter_complaints_small.json
EOF
And then in a perfect world, it'll start up a pytorchjob and run to completion:
watch oc get pytorchjobs,pods
and it'll look like this:
Every 2.0s: oc get pytorchjobs,pods api.ted414.cp.fyre.ibm.com: Wed Apr 24 18:34:49 2024
NAME STATE AGE
pytorchjob.kubeflow.org/ted-kfto-sft Succeeded 58m
NAME READY STATUS RESTARTS AGE
pod/ted-kfto-sft-master-0 0/1 Completed 0 58m