Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Failed to find device path /dev/xvdac. no device path for device \"/dev/xvdac\" volume \"vol-02d38d88d79844a04\" found #2062

Closed
balusarakesh opened this issue Jun 12, 2024 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@balusarakesh
Copy link

/kind bug

What happened?
Restarted the pod and now the EBS volume is not getting mounted on the node. As you can see from the screenshot that device name /dev/xvdac does not exist and the ebs-csi-node pod is stuck unable to mount the volume.

What you expected to happen?
A pod restart should not break EBS volume mount

How to reproduce it (as minimally and precisely as possible)?
Not really sure

Anything else we need to know?:
E0612 21:49:29.699613 1 driver.go:107] "GRPC error" err="rpc error: code = Internal desc = Failed to find device path /dev/xvdac. no device path for device \"/dev/xvdac\" volume \"vol-02d38d88d79844a04\" found"

Environment

  • Kubernetes version (use kubectl version):
Client Version: v1.28.9
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.9-eks-036c24b
  • Driver version: v1.30.0
    Screenshot 2024-06-12 at 2 55 17 PM
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 12, 2024
@balusarakesh
Copy link
Author

for anyone looking for a workaround, manually mount the volume onto the same node and use a different devicepath, that seems to have fixed the problem

@ConnorJC3
Copy link
Contributor

Hi, if you can find a way to reproduce this issue please let us know, but this issue as written doesn't provide enough information to diagnose the issue. Restarting the pod using a volume does not normally prevent the volume from attaching.

Note that this issue may occur if the volume is manually mounted/unmounted. The EBS CSI Driver does not support and strongly discourages manual mounting of volumes. If this issue occurs in the future, we recommend deleting (and allowing a ReplicaSet to recreate) the broken pod to reschedule it onto another node.

@balusarakesh
Copy link
Author

balusarakesh commented Jun 14, 2024

@ConnorJC3

  • unfortunately this happens randomly and there is no way to reproduce this error with 100% certainity
  • the volume was never mounted manually (it was only mounted manually when the above error was observed)
  • this is a statefulset so no ReplicaSet is involved here
  • the issue seems to that the CSI driver is looking for a device id that does not exist for the EBS volume
  • the issue only happens for like 1or 2 pods in a cluster with hundreds of pods
  • I do understand that without a way to reproduce this error it's hard to fix but the error message should be adequate enough to debug this , I think the EBS CSI driver should pick device ids that actually EXIST on the volume

@AndrewSirenko
Copy link
Contributor

@balusarakesh this might be a red herring, but next time you run into this issue can you confirm that the mount permissions on the /dev directory are read/write (rw) NOT read only (ro)?

We have seen similar symptoms on another customer's cluster where some other process was making it read-only, which means the device name would not show up in /dev after volume is attached AND therefore customer would not be able to mount.

When running sudo mount from node you should see something like devtmpfs on /dev type devtmpfs (rw ...) NOT devtmpfs on /dev type devtmpfs (ro ...).

Manually running sudo mount -o remount,rw /dev would reset the mount with rw permissions.

but the error message should be adequate enough to debug this

What error message would be more helpful here? I believe Failed to find device path /dev/xvdac. no device path for device \"/dev/xvdac\" volume \"vol-02d38d88d79844a04\" found" summarizes the root cause of why EBS CSI Node service cannot mount (it cannot find the device). Perhaps we can add the full path to the volume as well instead of just volume-id, or an instruction to confirm that these paths exist? But I'm afraid that more prescriptive advice might mislead users in other edge-cases.

@balusarakesh
Copy link
Author

@AndrewSirenko I'll try to check the permissions next time it happens (thanks for the commands)

as you can see from the AWS screenshot, the device id does not exist on the volume for the given instance, doesn't that mean the ebs-csi-driver should pick a different device id that is available?

I manually mounted the volume on the node with a different device id that was available and it worked so it seems like the aws-ebs-csi driver is picking non-existent device ids

@AndrewSirenko
Copy link
Contributor

AndrewSirenko commented Jun 14, 2024

For context of anyone who stumbles on this issue, it is important to remember that EBS CSI Driver is split into two components, the EBS CSI Controller Pod (responsible for talking to EC2 to make sure volumes are created and attached to the right node) and the EBS CSI Node pod (running on each node, responsible for formatting and mounting already attached volume to proper node mount-point and then pod mount-point).

as you can see from the AWS screenshot, the device id does not exist on the volume for the given instance, doesn't that mean the ebs-csi-driver should pick a different device id that is available?

@balusarakesh By the time the mount operation is happening (NodeStageVolume, see csi-spec), a volume has already been attached by controller pod to an already chosen device name. If the volume is not attached at the time of the mount operation happening, or the EBS CSI Node pod checks the wrong device path, something is very wrong (because the Kubernetes state of the world and EC2 state of the world are out of sync)

This device name can be seen on the volumeattachment object in Kubernetes, or from EC2 API if you manually query ec2 describe-volumes --volume-id <your-vol-id>

See below for an example of attached volume with device /dev/xvdaa:

{
    "Volumes": [
        {
            "Attachments": [
                {
                    "AttachTime": "2024-06-14T19:25:46+00:00",
                    "Device": "/dev/xvdaa",
                    "InstanceId": "i-0ea137a9b7ee4684a",
                    "State": "attached",
                    "VolumeId": "vol-01da7da10d37d277b",
                    "DeleteOnTermination": false
                }
            ],
...
            "State": "in-use",
            "VolumeId": "vol-01da7da10d37d277b",

The EBS CSI Node pod does not choose the device name at the time of mounting. If the csi node service fails to find that path, as a backup the ebs csi node service tries to find the device via the volume ID under /dev/disk/by-id/. The error you are seeing shows that for some reason the volume cannot be seen by the node pod at the already-chosen device name (in this case /dev/xvdac) AND the typical NVME Volume ID lookup /dev/disk/by-id/vol-xyz.


Next time you run into this issue can you:

  1. Provide proof relevant EBS Volume is attached to underlying EC2 instance (via EC2 DescribeVolumes)
  2. Post deviceName found when calling EC2 DescribeVolumes, and the device on the volumeattachment object (volumeattachment object will have a field with pv-name)
  3. Provide logs of CSI Node pod with loglevel=7
  4. Try to manually ssh into node (if possible) and check for /dev/ for existence of device name AND /dev/disk/by-id/ for existence of path with volume's id? If you can see those paths while ssh'd then the EBS CSI node pod may have insufficient permissions. If you can't see the paths there then something is wrong with the volume attachment.

Some customer has allegedly observed that if /dev is read-only, then even if an EC2 AttachVolume call succeeds and volume is in state attached, the path to device might still not exist.


Hope some of ^^ was helpful. I do not know why manually mounting a volume with a dif available device id would work, unless step 2 of the above instructions has a mismatch between what EC2 DescribeVolumes says and what volumeattachment object says is device path.

@balusarakesh
Copy link
Author

The above message is super helpful, thank you very much for taking the time and writing it.
I'll follow the steps if this happens next and time and let you know

I'll leave it up to you to close or keep the issue open

  • Thanks again

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 12, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 13, 2024
@AndrewSirenko
Copy link
Contributor

Closing based off of your previous comment as it seems this issue has not re-occured. Please re-open if you run into it again.

/close

@k8s-ci-robot
Copy link
Contributor

@AndrewSirenko: Closing this issue.

In response to this:

Closing based off of your previous comment as it seems this issue has not re-occured. Please re-open if you run into it again.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants