Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

GKE GPU nodes: nvidia-smi not found, likely missing env PATH and LD_LIBRARY_PATH #176

Open
MeCode4Food opened this issue Jan 7, 2025 · 1 comment

Comments

@MeCode4Food
Copy link

MeCode4Food commented Jan 7, 2025

I am trying to run a vscode notebook on my GKE cluster's kubeflow platform. On the node(s), nvidia-drivers are already installed:

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
    command: ["/bin/bash", "-c", "--"]
    args: ["nvidia-smi; while true; do sleep 600; done;"]
    resources:
      limits:
       nvidia.com/gpu: 1
❯ kubectl logs -f my-gpu-pod
Tue Jan  7 02:42:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   36C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

However, in the notebook this is not the case

(base) jovyan@ck-test-vscode-notebook-gpu-ok-0:~$ nvidia-smi
bash: nvidia-smi: command not found
❯ kubectl get pods -n kubeflow-user-example-com ck-test-vscode-notebook-gpu-ok-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    istio.io/rev: default
    kubectl.kubernetes.io/default-container: ck-test-vscode-notebook-gpu-ok
    kubectl.kubernetes.io/default-logs-container: ck-test-vscode-notebook-gpu-ok
    poddefault.admission.kubeflow.org/poddefault-access-ml-pipeline: "43967"
    prometheus.io/path: /stats/prometheus
    prometheus.io/port: "15020"
    prometheus.io/scrape: "true"
    sidecar.istio.io/status: '{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["workload-socket","credential-socket","workload-certs","istio-envoy","istio-data","istio-podinfo","istio-token","istiod-ca-cert"],"imagePullSecrets":null,"revision":"default"}'
  creationTimestamp: "2025-01-06T14:25:37Z"
  generateName: ck-test-vscode-notebook-gpu-ok-
  labels:
    access-ml-pipeline: "true"
    app: ck-test-vscode-notebook-gpu-ok
    apps.kubernetes.io/pod-index: "0"
    controller-revision-hash: ck-test-vscode-notebook-gpu-ok-68cf86c894
    notebook-name: ck-test-vscode-notebook-gpu-ok
    security.istio.io/tlsMode: istio
    service.istio.io/canonical-name: ck-test-vscode-notebook-gpu-ok
    service.istio.io/canonical-revision: latest
    statefulset: ck-test-vscode-notebook-gpu-ok
    statefulset.kubernetes.io/pod-name: ck-test-vscode-notebook-gpu-ok-0
  name: ck-test-vscode-notebook-gpu-ok-0
  namespace: kubeflow-user-example-com
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: ck-test-vscode-notebook-gpu-ok
    uid: -
  resourceVersion: "25035037"
  uid: -
spec:
  containers:
  - env:
    - name: NB_PREFIX
      value: /notebook/kubeflow-user-example-com/ck-test-vscode-notebook-gpu-ok
    image: kubeflownotebookswg/codeserver-python:v1.8.0
    imagePullPolicy: IfNotPresent
    name: ck-test-vscode-notebook-gpu-ok
    ports:
    - containerPort: 8888
      name: notebook-port
      protocol: TCP
    resources:
      limits:
        cpu: 600m
        memory: 1288490188800m
        nvidia.com/gpu: "1"
      requests:
        cpu: 500m
        memory: 1Gi
        nvidia.com/gpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
    - mountPath: /home/jovyan
      name: ck-test-vscode-notebook-gpu-ok-workspace
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-fzmnf
      readOnly: true
    - mountPath: /var/run/secrets/kubeflow/pipelines
      name: volume-kf-pipeline-token
      readOnly: true
    workingDir: /home/jovyan
  - args:
    - proxy
    - sidecar
    - --domain
    - $(POD_NAMESPACE).svc.cluster.local
    - --proxyLogLevel=warning
    - --proxyComponentLogLevel=misc:error
    - --log_output_level=default:info
    env:
    - name: JWT_POLICY
      value: third-party-jwt
    - name: PILOT_CERT_PROVIDER
      value: istiod
    - name: CA_ADDR
      value: istiod.istio-system.svc:15012
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: INSTANCE_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: SERVICE_ACCOUNT
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.serviceAccountName
    - name: HOST_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.hostIP
    - name: ISTIO_CPU_LIMIT
      valueFrom:
        resourceFieldRef:
          divisor: "0"
          resource: limits.cpu
    - name: PROXY_CONFIG
      value: |
        {}
    - name: ISTIO_META_POD_PORTS
      value: |-
        [
            {"name":"notebook-port","containerPort":8888,"protocol":"TCP"}
        ]
    - name: ISTIO_META_APP_CONTAINERS
      value: ck-test-vscode-notebook-gpu-ok
    - name: GOMEMLIMIT
      valueFrom:
        resourceFieldRef:
          divisor: "0"
          resource: limits.memory
    - name: GOMAXPROCS
      valueFrom:
        resourceFieldRef:
          divisor: "0"
          resource: limits.cpu
    - name: ISTIO_META_CLUSTER_ID
      value: Kubernetes
    - name: ISTIO_META_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: ISTIO_META_INTERCEPTION_MODE
      value: REDIRECT
    - name: ISTIO_META_WORKLOAD_NAME
      value: ck-test-vscode-notebook-gpu-ok
    - name: ISTIO_META_OWNER
      value: kubernetes://apis/apps/v1/namespaces/kubeflow-user-example-com/statefulsets/ck-test-vscode-notebook-gpu-ok
    - name: ISTIO_META_MESH_ID
      value: cluster.local
    - name: TRUST_DOMAIN
      value: cluster.local
    image: docker.io/istio/proxyv2:1.20.2
    imagePullPolicy: IfNotPresent
    name: istio-proxy
    ports:
    - containerPort: 15090
      name: http-envoy-prom
      protocol: TCP
    readinessProbe:
      failureThreshold: 4
      httpGet:
        path: /healthz/ready
        port: 15021
        scheme: HTTP
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1337
      runAsNonRoot: true
      runAsUser: 1337
    startupProbe:
      failureThreshold: 600
      httpGet:
        path: /healthz/ready
        port: 15021
        scheme: HTTP
      periodSeconds: 1
      successThreshold: 1
      timeoutSeconds: 3
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/workload-spiffe-uds
      name: workload-socket
    - mountPath: /var/run/secrets/credential-uds
      name: credential-socket
    - mountPath: /var/run/secrets/workload-spiffe-credentials
      name: workload-certs
    - mountPath: /var/run/secrets/istio
      name: istiod-ca-cert
    - mountPath: /var/lib/istio/data
      name: istio-data
    - mountPath: /etc/istio/proxy
      name: istio-envoy
    - mountPath: /var/run/secrets/tokens
      name: istio-token
    - mountPath: /etc/istio/pod
      name: istio-podinfo
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-fzmnf
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: ck-test-vscode-notebook-gpu-ok-0
  initContainers:
  - args:
    - istio-iptables
    - -p
    - "15001"
    - -z
    - "15006"
    - -u
    - "1337"
    - -m
    - REDIRECT
    - -i
    - '*'
    - -x
    - ""
    - -b
    - '*'
    - -d
    - 15090,15021,15020
    - --log_output_level=default:info
    image: docker.io/istio/proxyv2:1.20.2
    imagePullPolicy: IfNotPresent
    name: istio-init
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
        - NET_ADMIN
        - NET_RAW
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: false
      runAsGroup: 0
      runAsNonRoot: false
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-fzmnf
      readOnly: true
  nodeName: gke-ml-sg-gpu-pool-3-x-y
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 100
  serviceAccount: default-editor
  serviceAccountName: default-editor
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  volumes:
  - emptyDir: {}
    name: workload-socket
  - emptyDir: {}
    name: credential-socket
  - emptyDir: {}
    name: workload-certs
  - emptyDir:
      medium: Memory
    name: istio-envoy
  - emptyDir: {}
    name: istio-data
  - downwardAPI:
      defaultMode: 420
      items:
      - fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels
        path: labels
      - fieldRef:
          apiVersion: v1
          fieldPath: metadata.annotations
        path: annotations
    name: istio-podinfo
  - name: istio-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: istio-ca
          expirationSeconds: 43200
          path: istio-token
  - configMap:
      defaultMode: 420
      name: istio-ca-root-cert
    name: istiod-ca-cert
  - emptyDir:
      medium: Memory
    name: dshm
  - name: ck-test-vscode-notebook-gpu-ok-workspace
    persistentVolumeClaim:
      claimName: ck-test-vscode-notebook-gpu-ok-workspace
  - name: kube-api-access-fzmnf
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
  - name: volume-kf-pipeline-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: pipelines.kubeflow.org
          expirationSeconds: 7200
          path: token
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T14:25:49Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T14:25:50Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T14:25:52Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T14:25:52Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-01-06T14:25:41Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://y
    image: docker.io/kubeflownotebookswg/codeserver-python:v1.8.0
    imageID: docker.io/kubeflownotebookswg/codeserver-python@sha256:bf91bc4c205a8674f4dfe9dd92ed1e63ca2ebd74026e54dc39107c95087962ba
    lastState: {}
    name: ck-test-vscode-notebook-gpu-ok
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-01-06T14:25:50Z"
  - containerID: containerd://z
    image: docker.io/istio/proxyv2:1.20.2
    imageID: docker.io/istio/proxyv2@sha256:5786e72bf56c4cdf58e88dad39579a24875d05e213aa9a7bba3c59206f84ab6c
    lastState: {}
    name: istio-proxy
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-01-06T14:25:50Z"
  hostIP: x
  hostIPs:
  - ip: x
  initContainerStatuses:
  - containerID: containerd://x
    image: docker.io/istio/proxyv2:1.20.2
    imageID: docker.io/istio/proxyv2@sha256:5786e72bf56c4cdf58e88dad39579a24875d05e213aa9a7bba3c59206f84ab6c
    lastState: {}
    name: istio-init
    ready: true
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://x
        exitCode: 0
        finishedAt: "2025-01-06T14:25:49Z"
        reason: Completed
        startedAt: "2025-01-06T14:25:49Z"
  phase: Running
  podIP: y
  podIPs:
  - ip: y
  qosClass: Burstable
  startTime: "2025-01-06T14:25:41Z"

This is however remedied with a change to PATH

(base) jovyan@ck-test-vscode-notebook-gpu-ok-0:~$ nvidia-smi
bash: nvidia-smi: command not found
(base) jovyan@ck-test-vscode-notebook-gpu-ok-0:~$ export PATH=$PATH:/usr/local/nvidia/bin
(base) jovyan@ck-test-vscode-notebook-gpu-ok-0:~$ nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
(base) jovyan@ck-test-vscode-notebook-gpu-ok-0:~$ export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
(base) jovyan@ck-test-vscode-notebook-gpu-ok-0:~$ nvidia-smi
Tue Jan  7 02:48:50 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:05.0 Off |                    0 |
| N/A   37C    P8             10W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Note: this issue is also present in jupyter-pytorch-cuda-full

@github-project-automation github-project-automation bot moved this to Needs Triage in Kubeflow Notebooks Jan 7, 2025
@MeCode4Food MeCode4Food changed the title GKE GPU nodes: nvidia-smi not found, likely missingPATH and LD_LIBRARY_PATH GKE GPU nodes: nvidia-smi not found, likely missing env PATH and LD_LIBRARY_PATH Jan 7, 2025
@MeCode4Food
Copy link
Author

LD_LIBRARY_PATH can be easily added via the PodDefaults CRD, but PATH variable is not easily extended with that. Creating a new image to extend the PATH variable resolves this issue, but I wonder if this can be added into the base image(s), unless there is a way to resolve this without.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
Status: Needs Triage
Development

No branches or pull requests

1 participant