Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add mlx and datashim deployment with openshift #41

Merged
merged 14 commits into from
Jun 11, 2022
Merged

Conversation

Tomcli
Copy link

@Tomcli Tomcli commented Jun 9, 2022

Which issue is resolved by this Pull Request:
Resolves #

Description of your changes:

Checklist:

  • Unit tests pass:
    Make sure you have installed kustomize == 3.2.1
    1. make generate-changed-only
    2. make test

@yhwang yhwang merged commit 2dc4436 into IBM:master Jun 11, 2022
@ckadner
Copy link

ckadner commented Jun 13, 2022

Thank You @Tomcli for updating the manifests for MLX!

For KIND and OpenShift on Fyre we are still seeing issues with the kfp-csi-s3 pod getting stuck in ContainerCreating.

You had mentioned it last Friday but I wanted to keep a record of it. Should I create an issue on this or the MLX repo to keep track of it?

[IBM_manifests] (v1.5-branch=)$ kubectl get pod -n kubeflow kfp-csi-s3-4x6dl

NAME               READY   STATUS              RESTARTS   AGE
kfp-csi-s3-4x6dl   0/2     ContainerCreating   0          28m


[IBM_manifests] (v1.5-branch=)$ kubectl describe pod -n kubeflow kfp-csi-s3-4x6dl

Name:           kfp-csi-s3-4x6dl
Namespace:      kubeflow
Priority:       0
Node:           mlx-control-plane/172.18.0.2
Start Time:     Mon, 13 Jun 2022 12:32:31 -0700
Labels:         app=kfp-csi-s3
                app.kubernetes.io/name=kubeflow
                application-crd-id=kubeflow-pipelines
                controller-revision-hash=6d8fbd86d7
                pod-template-generation=1
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  DaemonSet/kfp-csi-s3
Containers:
  driver-registrar:
    Container ID:
    Image:         quay.io/k8scsi/csi-node-driver-registrar:v1.2.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=5
      --csi-address=/csi/csi.sock
      --kubelet-registration-path=/var/data/kubelet/plugins/kfp-csi-s3/csi.sock
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      KUBE_NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /csi from socket-dir (rw)
      /registration from registration-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8gbd8 (ro)
  kfp-csi-s3:
    Container ID:
    Image:         quay.io/datashim/csi-s3:latest-amd64
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=5
      --endpoint=$(CSI_ENDPOINT)
      --nodeid=$(KUBE_NODE_NAME)
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      CSI_ENDPOINT:    unix:///csi/csi.sock
      KUBE_NODE_NAME:   (v1:spec.nodeName)
      cheap:           off
    Mounts:
      /csi from socket-dir (rw)
      /dev from dev-dir (rw)
      /var/data/kubelet/pods from mountpoint-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8gbd8 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  socket-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/data/kubelet/plugins/kfp-csi-s3
    HostPathType:  DirectoryOrCreate
  mountpoint-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/data/kubelet/pods
    HostPathType:  DirectoryOrCreate
  registration-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/data/kubelet/plugins_registry
    HostPathType:  Directory
  dev-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  Directory
  kube-api-access-8gbd8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    28m                  default-scheduler  Successfully assigned kubeflow/kfp-csi-s3-4x6dl to mlx-control-plane
  Warning  FailedMount  28m (x2 over 28m)    kubelet            MountVolume.SetUp failed for volume "kube-api-access-8gbd8" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  17m (x3 over 21m)    kubelet            Unable to attach or mount volumes: unmounted volumes=[registration-dir], unattached volumes=[socket-dir registration-dir kube-api-access-8gbd8 mountpoint-dir dev-dir]: timed out waiting for the condition
  Warning  FailedMount  14m (x2 over 23m)    kubelet            Unable to attach or mount volumes: unmounted volumes=[registration-dir], unattached volumes=[kube-api-access-8gbd8 mountpoint-dir dev-dir socket-dir registration-dir]: timed out waiting for the condition
  Warning  FailedMount  12m                  kubelet            Unable to attach or mount volumes: unmounted volumes=[registration-dir], unattached volumes=[dev-dir socket-dir registration-dir kube-api-access-8gbd8 mountpoint-dir]: timed out waiting for the condition
  Warning  FailedMount  8m3s (x3 over 26m)   kubelet            Unable to attach or mount volumes: unmounted volumes=[registration-dir], unattached volumes=[registration-dir kube-api-access-8gbd8 mountpoint-dir dev-dir socket-dir]: timed out waiting for the condition
  Warning  FailedMount  100s (x21 over 28m)  kubelet            MountVolume.SetUp failed for volume "registration-dir" : hostPath type check failed: /var/data/kubelet/plugins_registry is not a directory

@yhwang
Copy link
Member

yhwang commented Jun 14, 2022

someone also hits a similar issue on AWS ORKS, I guess the key is the correct path of /var/data/kubelet/plugins_registry. not sure how to get the correct path of plugins_registry on KIND and OpenShift on Fyre.

correct kubelet path for KIND is /var/lib/kubelet/plugins_registry

update
after replace /var/data to /var/lib. the kfp-csi-s3 storage class works on my KIND

@yhwang
Copy link
Member

yhwang commented Jun 14, 2022

@jbusche also found out the path for OpenShift on Fyre is /var/lib too

@yhwang
Copy link
Member

yhwang commented Jun 14, 2022

I found Tommy created a datashim layer for kind using /var/lib : https://github.com/IBM/manifests/blob/v1.5-branch/contrib/datashim/kind/datashim.yaml#L784-L865
and both mlx-single-fyre-openshift and mlx-single-kind are using it now.

yhwang pushed a commit that referenced this pull request Aug 10, 2023
* add mlx and datashim deployment

* remove datashim conflicted spec

* add empty-dir for build dir

* update scc and fix typo

* add proc volume

* remove proc volume

* add kfp-csi-s3 to no auth kfp-tekton applications

* remove kfp-csi-s3 to no auth kfp-tekton applications

* remove kfp-csi-s3 to no auth kfp-tekton applications

* add datashim scc

* update dlf scc

* patch latest dlf oc yaml

* update kind datashim yaml

* add fyre openshift manifest
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants