Device attach fails on volume creation due to duplicate bios.uuid #1373

govint · 2017-06-08T06:06:02Z

Copy-n-paste from email :

    Hi,
    
    We are having some trouble getting the docker-volume-vsphere plugin to work. Do you see anything that we may be missing?
    
    The details and logs are below.
    
    Thanks!
    XXX

    Environment Details: 1 VM on RHEL 7.3, ESXi 6.0U1, Docker 17.05.0-ce, vDVS 0.14
    
    Steps to Reproduce:
    docker plugin install --grant-all-permissions --alias vsphere vmware/docker-volume-vsphere:0.14
    docker volume create --driver=vsphere --name=test
    
    Actual Result:
    Error response from daemon: create test: VolumeDriver.Create: Failed to create filesystem on /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:1:0: exit status 1. Output = mke2fs 1.43.3 (04-Sep-2016)
    The file /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:1:0 does not exist and no size was specified.
    
    Client Logs:
    2017-06-07 19:32:35.514944609 +0000 UTC [INFO] Starting plugin driver=vsphere log_level=info config="/etc/docker-volume-vsphere.conf"
    2017-06-07 19:32:35.515383164 +0000 UTC [INFO] Plugin options - port=1019
    2017-06-07 19:32:35.515411358 +0000 UTC [INFO] Getting volume data from unix:///var/run/docker.sock
    2017-06-07 19:32:35.539957187 +0000 UTC [INFO] Found 6 running or paused containers
    2017-06-07 19:32:35.546933529 +0000 UTC [INFO] Discovered 0 volumes in use.
    2017-06-07 19:32:35.546966645 +0000 UTC [INFO] Refcounting successfully completed
    2017-06-07 19:32:35.546986052 +0000 UTC [INFO] Docker VMDK plugin started version="vSphere Volume Driver v0.4" port=1019 mock_esx=false
    2017-06-07 19:32:35.54727681 +0000 UTC [INFO] Going into ServeUnix - Listening on Unix socket address="/run/docker/plugins/vsphere.sock"
    2017-06-07 19:07:25.347595558 +0000 UTC [DEBUG] vmdkOps.Get name=test
    2017-06-07 19:07:25.34765686 +0000 UTC [DEBUG] Run get request: version=
    2017-06-07 19:07:25.563177483 +0000 UTC [DEBUG] vmdkOp.Create name=test
    2017-06-07 19:07:25.563228464 +0000 UTC [DEBUG] Run get request: version=
    2017-06-07 19:07:25.816647424 +0000 UTC [INFO] Attaching volume and creating filesystem name=test fstype=ext4
    2017-06-07 19:07:25.818644055 +0000 UTC [DEBUG] vmdkOps.Attach name=test
    2017-06-07 19:07:25.818684762 +0000 UTC [DEBUG] Run get request: version=
    2017-06-07 19:07:44.342576236 +0000 UTC [WARNING] Exceeded timeout while waiting for device attach to completetimeout=10s device="/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:1:0"
    2017-06-07 19:07:44.344666589 +0000 UTC [ERROR] Create filesystem failed, removing the volume name=test error="Failed to create filesystem on /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:1:0: exit status 1. Output = mke2fs 1.43.3 (04-Sep-2016)\nThe file /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:1:0 does not exist and no size was specified.\n"
    2017-06-07 19:07:44.344716477 +0000 UTC [DEBUG] vmdkOps.Detach name=test
    2017-06-07 19:07:44.344730957 +0000 UTC [DEBUG] Run get request: version=
    2017-06-07 19:07:47.108149631 +0000 UTC [DEBUG] vmdkOps.Remove name=test
    2017-06-07 19:07:47.10819686 +0000 UTC [DEBUG] Run get request: version=
    
    ESX Host Logs:
    06/07/17 23:07:39 1420822 [Thread-11] [INFO   ] cmd get with opts {} on tenant_uuid 11111111-1111-1111-1111-111111111111 datastore VNX01-T2-NFS-IT2 is allowed to execute
    06/07/17 23:07:39 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] executeRequest 'get' completed with ret={u'Error': 'Volume test not found (file: /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk)'}
    06/07/17 23:07:39 1420822 [Thread-12] [WARNING] Volume size not specified
    06/07/17 23:07:39 1420822 [Thread-12] [WARNING] Volume size not specified
    06/07/17 23:07:39 1420822 [Thread-12] [INFO   ] cmd create with opts {u'fstype': u'ext4'} on tenant_uuid 11111111-1111-1111-1111-111111111111 datastore VNX01-T2-NFS-IT2 is allowed to execute
    06/07/17 23:07:39 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] *** createVMDK: /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk opts = {u'fstype': u'ext4'}
    06/07/17 23:07:40 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [WARNING] Volume size not specified
    06/07/17 23:07:40 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] executeRequest 'create' completed with ret=None
    06/07/17 23:07:40 1420822 [Thread-13] [INFO   ] cmd attach with opts {} on tenant_uuid 11111111-1111-1111-1111-111111111111 datastore VNX01-T2-NFS-IT2 is allowed to execute
    06/07/17 23:07:40 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] *** attachVMDK: /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk to env-a-4vm1 (d66143b7-8ce0-4771-8bbe-a0feb6f8028d) VM uuid = 420a8c6b-e833-dbf8-f04c-390c5332d91f
    06/07/17 23:07:40 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] Attaching /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk as independent_persistent
    06/07/17 23:07:45 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] Disk /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk successfully attached. controller pci_slot_number=160, disk_slot=1
    06/07/17 23:07:45 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] executeRequest 'attach' completed with ret={'ControllerPciSlotNumber': '160', 'Unit': '1'}
    06/07/17 23:07:46 1420822 [Thread-14] [INFO   ] cmd detach with opts {} on tenant_uuid 11111111-1111-1111-1111-111111111111 datastore VNX01-T2-NFS-IT2 is allowed to execute
    06/07/17 23:07:46 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] *** detachVMDK: /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk from env-a-4vm1 (d66143b7-8ce0-4771-8bbe-a0feb6f8028d) VM uuid = 420a8c6b-e833-dbf8-f04c-390c5332d91f
    06/07/17 23:07:48 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] Disk detached /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk
    06/07/17 23:07:48 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] executeRequest 'detach' completed with ret=None
    06/07/17 23:07:48 1420822 [Thread-15] [INFO   ] cmd remove with opts {} on tenant_uuid 11111111-1111-1111-1111-111111111111 datastore VNX01-T2-NFS-IT2 is allowed to execute
    06/07/17 23:07:48 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] *** removeVMDK: /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk
    06/07/17 23:07:48 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] *** cleanVMDK: /vmfs/volumes/VNX01-T2-NFS-IT2/dockvols/11111111-1111-1111-1111-111111111111/test.vmdk
    06/07/17 23:07:48 1420822 [env-dsvm1 (45d97822-3f84-43c5-8585-0d0356ac0ed6)-[VNX01-T2-NFS-IT2] dockvols/11111111-1111-1111-1111-111111111111/test.vmdk] [INFO   ] executeRequest 'remove' completed with ret=None
    06/07/17 23:08:20 1421847 [MainThread] [INFO   ] Connecting to the local Service Instance as 'dcui'

The text was updated successfully, but these errors were encountered:

pdhamdhere · 2017-06-08T17:53:05Z

Timing in log seems suspicious;

Plugin issued attach (19:07:25), timed out after 10Secs and issued detach after 12 seconds (19:07:44).
Server received attach (23:07:40) and detach after 6 seconds (23:07:46)

govint · 2017-06-09T09:32:34Z

Tried this out multiple times and the attach-wait times are anywhere between 6s - 10s. From debugging, the plugin starts its wait after the docker host kernel has completed attaching the device. There are no obvious delays seen in the kernel logs:

Docker host kernel logs:
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.069868] vmw_pvscsi: msg type: 0x0 - MSG RING: 64/63 (5)
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.069873] vmw_pvscsi: msg: device added at scsi0:1:0
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.070082] scsi 2:0:1:0: Direct-Access VMware Virtual disk 1.0 PQ: 0 ANSI: 2
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.074260] sd 2:0:1:0: Attached scsi generic sg4 type 0
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.075813] sd 2:0:1:0: [sdd] 204800 512-byte logical blocks: (104 MB/100 MiB)
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.075840] sd 2:0:1:0: [sdd] Write Protect is off
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.075844] sd 2:0:1:0: [sdd] Mode Sense: 61 00 00 00
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.075869] sd 2:0:1:0: [sdd] Cache data unavailable
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.075871] sd 2:0:1:0: [sdd] Assuming drive cache: write through
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.076083] sd 2:0:1:0: [sdd] Cache data unavailable
Jun 8 21:01:42 hte-1s-eng-dhcp98 kernel: [60580.076086] sd 2:0:1:0: [sdd] Assuming drive cache: write through
Jun 8 21:01:43 hte-1s-eng-dhcp98 kernel: [60580.710503] sdd: unknown partition table
Jun 8 21:01:43 hte-1s-eng-dhcp98 kernel: [60580.710716] sd 2:0:1:0: [sdd] Cache data unavailable
Jun 8 21:01:43 hte-1s-eng-dhcp98 kernel: [60580.710720] sd 2:0:1:0: [sdd] Assuming drive cache: write through
Jun 8 21:01:43 hte-1s-eng-dhcp98 kernel: [60580.710743] sd 2:0:1:0: [sdd] Attached SCSI disk

Plugin logs
2017-06-08 21:00:52.856006649 -0700 PDT [DEBUG] vmdkOp.Create name=y_5
2017-06-08 21:00:52.856329133 -0700 PDT [DEBUG] Run get request: version=
2017-06-08 21:01:24.248953919 -0700 PDT [INFO] Attaching volume and creating filesystem name="y_5" fstype=ext4
2017-06-08 21:01:24.249156484 -0700 PDT [DEBUG] vmdkOps.Attach name=y_5
2017-06-08 21:01:24.249226682 -0700 PDT [DEBUG] Run get request: version=
2017-06-08 21:01:56.483686652 -0700 PDT [ERROR] Starting wait for attachname="y_5" error=
2017-06-08 21:02:02.581062994 -0700 PDT [DEBUG] event: "/dev/disk/by-path/pci-0000:13:00.0-scsi-0:0:1:0": 0x100 == IN_CREATE
2017-06-08 21:02:02.581194605 -0700 PDT [INFO] Scan complete device="/dev/disk/by-path/pci-0000:13:00.0-scsi-0:0:1:0" event="/dev/disk/by-path/pci-0000:13:00.0-scsi-0:0:1:0": 0x100 == IN_CREATE

As seen above its taking around 6s to wait for the device to be detected by the plugin.

The issue is reproduced at times but from all the logs available doesn't seem to be an issue on the ESX side.

govint · 2017-06-12T04:14:13Z

Summarizing, this doesn't seem to be a host (ESX) or VM issue. None of the logs point to any latencies in the host and the VM, except perhaps the GO watcher itself thats causing this behavior.

govint · 2017-06-13T17:41:45Z

@pdhamdhere, from all local tests was able to repro this issue some times and thereafter the issue didn't recur. I don't believe its a plugin or ESX service issue as there are no delays there that point to an obvious issue. I'll have an NFS datastore added and lets see if this is reproduced in CI. Else will close this issue.

@shuklanirdesh82 can you get a NFS DS added to the CI just so we can figure if this can happen there.

govint · 2017-06-20T05:53:50Z

Local tests don't seem to repro this issue and at least not for the same reasons reported. Issue is being addressed over email, closing this issue.

pdhamdhere · 2017-06-20T16:53:28Z

Let's keep the issue open until we root cause the issue in this unique setup.

msterin · 2017-06-20T17:00:26Z

@govint writes

Summarizing, this doesn't seem to be a host (ESX) or VM issue.
None of the logs point to any latencies in the host and the VM, except perhaps the GO watcher itself thats causing this behavior.

The issue still exists and we are trying to get additional info from the setup where it was reported from, even though it's unique so far. Until we know the root cause it represents a significant risk as it's a blocker for whoever steps into it.

msterin · 2017-06-21T18:32:27Z

More info. Customer writes

Here is the information from the ESX side of things:

ESX: VMware ESXi 6.0.0 build-3380124
Storage: NFS

Another potential point of interest….we are deploying the Docker VM’s using a vApp template via the vCloud Director API and the VM is actually managed by our vCloud Director solution. This might be why the process is timing out when trying to attach the disk to the VM since vCloud owns the entity.

We may want to deploy a docker vm using our template but not via vCloud Director to determine if we experience the same behavior

and later he writes the following (hostnames and vm names redacted)

Line 20 of vmdk_ops.log:
*** attachVMDK: vmdk_name_removed to vm1 (d66143b7-8ce0-4771-8bbe-a0feb6f8028d) VM uuid = 420a8c6b-e833-dbf8-f04c-390c5332d91f

It looks like it’s attaching the volume to vm-a-1, but the docker volume create command was on vm-1

So the theory is that we are stepping into VM UUID uniqueness issues.
With vCloudDirector finding out unique UUID is not fun

We can do the following:

use vc.uuid instead of uuid.bios. This may present a problem if no VC is around

I'll take care of this

tusharnt · 2017-06-28T18:09:40Z

@msterin Are we still targeting this for 0.16 release? We've just 2 more days to go!

When communication from VMCI vSocket is established n vmdk_ops.py, we find out VM ID from the socket, and then locate VM ManagedObject by this ID. There are 2 IDs - VC UUID and BIOS UUID. When .vmx file is copied (e.g. VCD or other products), BIOS UUID can be duplicate thus failures like #1373 . However, VC UUID is unique on creation of VM, and unique as long as ESXi is a part of a VC. So it is a much better candidate for using as a unique ID. This change tries to use VC UUID first, and fails back to BIOS UUID next. The change also checks that the VM name (for a found VM by ID) matches the one we see from vSocket, and does a few other minor prints.

* Added proper return on protocol version mismatch * Using VC UUID for attach/detach where possible When communcaiton from VMCI vSocket is established, we find out VM ID from the socket. There are 2 IDs - VC UUID and BIOS UUID. When .vmx file is copied (e.g. VCD or other product), BIOS UUID can be dulicate. However, VC UUID is unique on creation of VM, and unique as long as ESXi is a part of a VC. So it is a much better candidate for using as a unique ID. Thic change tries to use VC UUID first, and fails back to BIOS UUID next. The change also checks that the VM name (for a found VM by ID) matches the one we see from vSocket, and does a few other minor prints. * Try to use VC UUID first for locating VMs, then fall back to BIOS UUID When communication from VMCI vSocket is established n vmdk_ops.py, we find out VM ID from the socket, and then locate VM ManagedObject by this ID. There are 2 IDs - VC UUID and BIOS UUID. When .vmx file is copied (e.g. VCD or other products), BIOS UUID can be duplicate thus failures like #1373 . However, VC UUID is unique on creation of VM, and unique as long as ESXi is a part of a VC. So it is a much better candidate for using as a unique ID. This change tries to use VC UUID first, and fails back to BIOS UUID next. The change also checks that the VM name (for a found VM by ID) matches the one we see from vSocket, and does a few other minor prints. * Unbundled detachVMDK and attachVMD to address review comments * fixed the VM instance ID

govint added component/guest-plugin customer kind/bug labels Jun 8, 2017

shuklanirdesh82 added this to the 0.16 milestone Jun 9, 2017

govint self-assigned this Jun 13, 2017

govint closed this as completed Jun 20, 2017

pdhamdhere reopened this Jun 20, 2017

pdhamdhere assigned msterin and unassigned govint Jun 20, 2017

msterin added the P1 label Jun 20, 2017

msterin changed the title ~~Device attach fails on volume creation~~ Device attach fails on volume creation due to duplicte bios.uuid Jun 29, 2017

msterin changed the title ~~Device attach fails on volume creation due to duplicte bios.uuid~~ Device attach fails on volume creation due to dupliicate bios.uuid Jun 29, 2017

msterin changed the title ~~Device attach fails on volume creation due to dupliicate bios.uuid~~ Device attach fails on volume creation due to duplicate bios.uuid Jun 29, 2017

msterin mentioned this issue Jun 30, 2017

Using VC UUID for attach/detach where possible #1526

Merged

shuklanirdesh82 closed this as completed in #1526 Jun 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device attach fails on volume creation due to duplicate bios.uuid #1373

Device attach fails on volume creation due to duplicate bios.uuid #1373

govint commented Jun 8, 2017 •

edited by msterin

Loading

pdhamdhere commented Jun 8, 2017

govint commented Jun 9, 2017

govint commented Jun 12, 2017

govint commented Jun 13, 2017

govint commented Jun 20, 2017

pdhamdhere commented Jun 20, 2017

msterin commented Jun 20, 2017

msterin commented Jun 21, 2017

tusharnt commented Jun 28, 2017 •

edited

Loading

Device attach fails on volume creation due to duplicate bios.uuid #1373

Device attach fails on volume creation due to duplicate bios.uuid #1373

Comments

govint commented Jun 8, 2017 • edited by msterin Loading

pdhamdhere commented Jun 8, 2017

govint commented Jun 9, 2017

govint commented Jun 12, 2017

govint commented Jun 13, 2017

govint commented Jun 20, 2017

pdhamdhere commented Jun 20, 2017

msterin commented Jun 20, 2017

msterin commented Jun 21, 2017

tusharnt commented Jun 28, 2017 • edited Loading

govint commented Jun 8, 2017 •

edited by msterin

Loading

tusharnt commented Jun 28, 2017 •

edited

Loading