-
Notifications
You must be signed in to change notification settings - Fork 48
Installing and Testing OpenShift fms‐hf‐tuning Stack
Note, the steps below were written by jbusche@us.ibm.com based on his knowledge while working on the project. Since then, I see the official instructions begin here: https://opendatahub.io/docs/installing-open-data-hub/#installing-odh-v2_installv2
with customization for kueue and running a tuning job here: https://opendatahub.io/docs/working-with-distributed-workloads/
0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.14.17)
0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login
token.
0.3 Also logged into the terminal with oc login
: For example:
oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443
Using your terminal where you're logged in with oc login, issue this command:
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/opendatahub-operator.openshift-operators: ""
name: opendatahub-operator
namespace: openshift-operators
spec:
channel: fast
installPlanApproval: Automatic
name: opendatahub-operator
source: community-operators
sourceNamespace: openshift-marketplace
startingCSV: opendatahub-operator.v2.17.0
EOF
You can check it started with:
watch oc get pods -n openshift-operators
2.1 Install service mesh
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/servicemeshoperator.openshift-operators: ""
name: servicemeshoperator
namespace: openshift-operators
spec:
channel: stable
installPlanApproval: Automatic
name: servicemeshoperator
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: servicemeshoperator.v2.6.1
EOF
And then check it with:
watch oc get pods -n openshift-operators
2.2 Install Authorino Operator
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/authorino-operator.openshift-operators: ""
name: authorino-operator
namespace: openshift-operators
spec:
channel: stable
installPlanApproval: Automatic
name: authorino-operator
source: community-operators
sourceNamespace: openshift-marketplace
startingCSV: authorino-operator.v0.11.1
EOF
And then check it with:
watch oc get pods -n openshift-operators
cat << EOF | oc apply -f -
kind: DSCInitialization
apiVersion: dscinitialization.opendatahub.io/v1
metadata:
name: default-dsci
spec:
applicationsNamespace: opendatahub
monitoring:
managementState: Managed
namespace: opendatahub
serviceMesh:
auth:
audiences:
- https://kubernetes.default.svc
controlPlane:
metricsCollection: Istio
name: data-science-smcp
namespace: istio-system
managementState: Managed
trustedCABundle:
customCABundle: ""
managementState: Managed
EOF
And then check it: (It should go into "Ready" state after about a minute or so)
watch oc get dsci
Also note that you'll see the istio control pane start up as well here:
oc get pods -n openshift-operators
cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
name: default-dsc
spec:
components:
codeflare:
managementState: Removed
dashboard:
managementState: Managed
datasciencepipelines:
managementState: Removed
kserve:
managementState: Removed
serving:
ingressGateway:
certificate:
type: SelfSigned
managementState: Managed
name: knative-serving
kueue:
managementState: Managed
modelmeshserving:
managementState: Removed
modelregistry:
managementState: Removed
ray:
managementState: Removed
trainingoperator:
managementState: Managed
trustyai:
managementState: Removed
workbenches:
managementState: Removed
EOF
Check that the pods are running:
watch oc get pods -n opendatahub
You should see these pods:
oc get pods -n opendatahub
NAME READY STATUS RESTARTS AGE
kubeflow-training-operator-dc9cf9bb5-595xx 1/1 Running 0 4h50m
kueue-controller-manager-66768ccc94-4xq4v 1/1 Running 0 4h51m
odh-dashboard-5969fd7b5b-gd6rt 2/2 Running 0 4h51m
odh-dashboard-5969fd7b5b-xd7qj 2/2 Running 0 4h51m
Note, if you're having pull issues from docker.io, you can change your deployment to pull from quay.io instead with this:
oc set image deployment kubeflow-training-operator training-operator=quay.io/jbusche/training-operator:v1-855e096 -n opendatahub
Note: the initContainer pulls from docker.io/alpine:3.10 automatically, which causes trouble on some clusters that are ratelimited to Docker.io. To get around this, you can run the following command to patch the training-operator to use a different repo for the initContainer:
oc patch deployment kubeflow-training-operator -n opendatahub --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["/manager", "--pytorch-init-container-image=quay.io/jbusche/alpine:3.10"]}]'
cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cq-small"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory"]
flavors:
- name: "cpu-flavor"
resources:
- name: "cpu"
nominalQuota: 5
- name: "memory"
nominalQuota: 20Gi
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: lq-trainer
namespace: default
spec:
clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p2
value: 10000
description: "low priority"
EOF
I've been using Ted's script, changing the image tag depending on the fms-hf-tuning image that we want to use.
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: my-config
namespace: default
data:
config.json: |
{
"accelerate_launch_args": {
"num_machines": 1,
"num_processes": 2
},
"model_name_or_path": "bigscience/bloom-560m",
"training_data_path": "/etc/config/twitter_complaints_small.json",
"output_dir": "/tmp/out",
"num_train_epochs": 1.0,
"per_device_train_batch_size": 4,
"per_device_eval_batch_size": 4,
"gradient_accumulation_steps": 4,
"eval_strategy": "no",
"save_strategy": "epoch",
"learning_rate": 1e-5,
"weight_decay": 0.0,
"lr_scheduler_type": "cosine",
"logging_steps": 1.0,
"packing": false,
"include_tokens_per_second": true,
"response_template": "\n### Label:",
"dataset_text_field": "output",
"use_flash_attn": false,
"torch_dtype": "float32",
"peft_method": "pt",
"tokenizer_name_or_path": "bigscience/bloom"
}
twitter_complaints_small.json: |
{"Tweet text":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}
{"Tweet text":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}
{"Tweet text":"If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService","ID":2,"Label":1,"text_label":"complaint","output":"### Text: If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService\n\n### Label: complaint"}
{"Tweet text":"@EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.","ID":3,"Label":1,"text_label":"complaint","output":"### Text: @EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.\n\n### Label: complaint"}
{"Tweet text":"Couples wallpaper, so cute. :) #BrothersAtHome","ID":4,"Label":2,"text_label":"no complaint","output":"### Text: Couples wallpaper, so cute. :) #BrothersAtHome\n\n### Label: no complaint"}
{"Tweet text":"@mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG","ID":5,"Label":2,"text_label":"no complaint","output":"### Text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG\n\n### Label: no complaint"}
{"Tweet text":"@Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?","ID":6,"Label":2,"text_label":"no complaint","output":"### Text: @Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?\n\n### Label: no complaint"}
{"Tweet text":"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?","ID":7,"Label":1,"text_label":"complaint","output":"### Text: @nationalgridus I have no water and the bill is current and paid. Can you do something about this?\n\n### Label: complaint"}
{"Tweet text":"Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora","ID":8,"Label":1,"text_label":"complaint","output":"### Text: Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora\n\n### Label: complaint"}
{"Tweet text":"@JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd","ID":9,"Label":2,"text_label":"no complaint","output":"### Text: @JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd\n\n### Label: no complaint"}
---
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: ted-kfto-sft
namespace: default
labels:
kueue.x-k8s.io/queue-name: lq-trainer
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never # Do not restart the pod on failure. If you do set it to OnFailure, be sure to also set backoffLimit
template:
spec:
containers:
- name: pytorch
# This is the temp location util image is officially released
#image: image-registry.openshift-image-registry.svc:5000/opendatahub/fms-hf-tuning:0.0.1rc7
#image: quay.io/jbusche/fms-hf-tuning:issue758-1
#image: quay.io/modh/fms-hf-tuning:01b3824c9aba22d9d0695399681e6f0507840e7f
#image: quay.io/modh/fms-hf-tuning:a130d1c890501a4fac1d9522f1198b6273ade2d4
image: quay.io/modh/fms-hf-tuning:release
imagePullPolicy: IfNotPresent
command:
- "python"
- "/app/accelerate_launch.py"
env:
- name: SFT_TRAINER_CONFIG_JSON_PATH
value: /etc/config/config.json
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
name: my-config
items:
- key: config.json
path: config.json
- key: twitter_complaints_small.json
path: twitter_complaints_small.json
EOF
And then in a perfect world, it'll start up a pytorchjob and run to completion:
watch oc get pytorchjobs,pods -n default
and it'll look like this:
Every 2.0s: oc get pytorchjobs,pods -n default api.ted414.cp.fyre.ibm.com: Wed Apr 24 18:34:49 2024
NAME STATE AGE
pytorchjob.kubeflow.org/ted-kfto-sft Succeeded 58m
NAME READY STATUS RESTARTS AGE
pod/ted-kfto-sft-master-0 0/1 Completed 0 58m
- First you need to be oc logged into the OpenShift cluster. For example:
oc login --token=sha256~eNI_S6ah... --server=https://api.jimfips.cp.fyre.ibm.com:6443
- Enable to local repository with this step, then wait a few minutes, you'll know when it's ready when step 3 below succeeds.
oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
- Using podman, login to your local OpenShift repository:
podman login -u kubeadmin -p $(oc whoami -t) $(oc registry info) --tls-verify=false
- Build a new sft-hf-trainer image from main and or your branch using these steps.
4.1 Download the repo from main:
git clone https://github.com/jbusche/fms-hf-tuning.git
cd fms-hf-tuning
4.2 Alternatively, you could download from your repo and use your PR branch like this:
git clone https://github.com/jbusche/fms-hf-tuning.git -b jb-828-python-cves
cd fms-hf-tuning
4.3 Build the image locally, naming it as you'd like (I used today's date):
docker build --progress=plain -t fms-hf-tuning:jim-0509-fixed . -f build/Dockerfile
- Podman login (if you haven't already), tag and push the image to your local registry:
podman login -u kubeadmin -p $(oc whoami -t) $(oc registry info) --tls-verify=false
podman tag localhost/fms-hf-tuning:jim-0509-fixed $(oc registry info)/opendatahub/fms-hf-tuning:jim-0509-fixed
podman push --tls-verify=false $(oc registry info)/opendatahub/fms-hf-tuning:jim-0509-fixed
- Run the test step from https://github.com/foundation-model-stack/fms-hf-tuning/wiki/Installing-and-Testing-OpenShift-fms%E2%80%90hf%E2%80%90tuning-Stack#6-testing only substitute in your image name. For example:
Change:
image: quay.io/modh/fms-hf-tuning:a130d1c890501a4fac1d9522f1198b6273ade2d4
to
image: image-registry.openshift-image-registry.svc:5000/opendatahub/fms-hf-tuning:jim-0509-fixed
- And then in a perfect world, it'll start up a pytorchjob and run to completion:
watch oc get pytorchjobs,pods -n default
and it'll look like this:
Every 2.0s: oc get pytorchjobs,pods -n default api.ted414.cp.fyre.ibm.com: Wed Apr 24 18:34:49 2024
NAME STATE AGE
pytorchjob.kubeflow.org/ted-kfto-sft Succeeded 58m
NAME READY STATUS RESTARTS AGE
pod/ted-kfto-sft-master-0 0/1 Completed 0 58m
Cleanup of your pytorchjob and cm:
oc delete pytorchjob ted-kfto-sft -n default
oc delete cm my-config -n default
Cleanup of your Kueue resouces, if you want that:
cat <<EOF | kubectl delete -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cq-small"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory"]
flavors:
- name: "cpu-flavor"
resources:
- name: "cpu"
nominalQuota: 5
- name: "memory"
nominalQuota: 20Gi
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: lq-trainer
namespace: default
spec:
clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: p2
value: 10000
description: "low priority"
EOF
Cleanup of dsc items (if you want that)
oc delete dsc default-dsc
Cleanup of DSCI (if you want that)
oc delete dsci default-dsci
Cleanup of ODH operators (if you want that)
oc delete sub authorino-operator opendatahub-operator servicemeshoperator -n openshift-operators
oc delete csv authorino-operator.v0.11.1 opendatahub-operator.v2.17.0 servicemeshoperator.v2.6.1 -n openshift-operators