Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Fix the plugin upgrade flow when disable -f is used (live upgrade). #1779

Closed
ashahi1 opened this issue Aug 18, 2017 · 8 comments
Closed

Fix the plugin upgrade flow when disable -f is used (live upgrade). #1779

ashahi1 opened this issue Aug 18, 2017 · 8 comments

Comments

@ashahi1
Copy link
Contributor

ashahi1 commented Aug 18, 2017

Container looses access to the data after plugin upgrade.

Steps:

1. Installed the 0.15 plugin.
2. Ran the container with volume mounted and wrote some data to the volume.
3. While container is running, did the upgrade from 0.15 to 0.16
4. After upgrade has finished successfully, try to read the data from the already running container - Fails

Container was always running during the upgrade process. I had never stopped the container.

Logs have been attached at the end.


Detailed steps and their output are as follows:

  1. Installed the plugin:
root@sc-rdops-vm02-dhcp-52-237:~# docker plugin install --grant-all-permissions --alias vsphere_0.15 vmware/docker-volume-vsphere:0.15
0.15: Pulling from vmware/docker-volume-vsphere
967f49dfadaf: Download complete
Digest: sha256:64a50830c2fa9befbede6efca3f28d2e2d8b4e595703f1b5984470579656e935
Status: Downloaded newer image for vmware/docker-volume-vsphere:0.15
Installed plugin vmware/docker-volume-vsphere:0.15
  1. Ran the container with volume mounted and wrote some data to the volume
root@sc-rdops-vm02-dhcp-52-237:~#
root@sc-rdops-vm02-dhcp-52-237:~# docker volume ls
DRIVER                VOLUME NAME
vsphere_0.15:latest   TestAAA@shared    Vmfs   -   0
vsphere_0.15:latest   VolXYZ1@shared    Vmfs   -   0
root@sc-rdops-vm02-dhcp-52-237:~#
root@sc-rdops-vm02-dhcp-52-237:~# docker run -it --volume-driver=vsphere_0.15:latest -v VolXYZ1@'shared    Vmfs   -   0':/vol1 --name ub_15 ubuntu
root@a0d51f3c11d8:/#
root@a0d51f3c11d8:/# echo "1234 pre-upgrade" >> vol1/testUpgrade
root@a0d51f3c11d8:/#
root@a0d51f3c11d8:/#
root@a0d51f3c11d8:/# echo "1234 pre-upgrade" >> vol1/testUpgrade
root@a0d51f3c11d8:/#
root@a0d51f3c11d8:/# cat vol1/testUpgrade
1234 pre-upgrade
1234 pre-upgrade
root@a0d51f3c11d8:/#
  1. Then did the upgrade
root@sc-rdops-vm02-dhcp-52-237:~# docker plugin ls
ID                  NAME                  DESCRIPTION                           ENABLED
1929c6d75275        vsphere_0.15:latest   VMWare vSphere Docker Volume plugin   true
root@sc-rdops-vm02-dhcp-52-237:~# docker plugin disable 1929c6d75275
Error response from daemon: plugin vsphere_0.15:latest is in use
root@sc-rdops-vm02-dhcp-52-237:~#
root@sc-rdops-vm02-dhcp-52-237:~# docker plugin disable 1929c6d75275 -f
1929c6d75275
root@sc-rdops-vm02-dhcp-52-237:~#
root@sc-rdops-vm02-dhcp-52-237:~#
root@sc-rdops-vm02-dhcp-52-237:~#
root@sc-rdops-vm02-dhcp-52-237:~#
root@sc-rdops-vm02-dhcp-52-237:~# docker plugin upgrade vsphere_0.15:latest vmware/docker-volume-vsphere:0.16
Upgrading plugin vsphere_0.15:latest from vmware/docker-volume-vsphere:0.15 to vmware/docker-volume-vsphere:0.16
Plugin images do not match, are you sure? y
Plugin "vmware/docker-volume-vsphere:0.16" is requesting the following privileges:
 - mount: [/dev]
 - mount: [/var/run]
 - mount: [/etc]
 - mount: [/var/log]
 - allow-all-devices: [true]
 - capabilities: [CAP_SYS_ADMIN]
Do you grant the above permissions? [y/N] y
0.16: Pulling from vmware/docker-volume-vsphere
1d54a760dc2a: Download complete
Digest: sha256:5281d47506740ee927001ea5fc16d8584ed7fc04de0fb25969f0b11eb3243345
Status: Downloaded newer image for vmware/docker-volume-vsphere:0.16
Upgraded plugin vsphere_0.15:latest to vmware/docker-volume-vsphere:0.16
root@sc-rdops-vm02-dhcp-52-237:~#
root@sc-rdops-vm02-dhcp-52-237:~#
root@sc-rdops-vm02-dhcp-52-237:~# docker plugin ls
ID                  NAME                  DESCRIPTION                           ENABLED
1929c6d75275        vsphere_0.15:latest   VMWare vSphere Docker Volume plugin   false
root@sc-rdops-vm02-dhcp-52-237:~# docker plugin enable 1929c6d75275
1929c6d75275
root@sc-rdops-vm02-dhcp-52-237:~# docker plugin ls
ID                  NAME                  DESCRIPTION                           ENABLED
1929c6d75275        vsphere_0.15:latest   VMWare vSphere Docker Volume plugin   true
  1. Again try to read the data after upgrade has finished - Fails
root@a0d51f3c11d8:/#
root@a0d51f3c11d8:/# cat vol1/testUpgrade
cat: vol1/testUpgrade: No such file or directory     <<<<<<<<<<<<<<<<<
root@a0d51f3c11d8:/#

Logs:
docker.txt
docker-volume-vsphere.txt
vmdk_ops.txt

@ashahi1 ashahi1 added this to the Sprint - Mufasa (VMworld, US) milestone Aug 18, 2017
@ashahi1 ashahi1 changed the title Container looses access to the data after upgrade Container looses access to the data after plugin upgrade Aug 18, 2017
@govint
Copy link
Contributor

govint commented Aug 18, 2017

Suggest only thing is to document disable -f is disallowed/unsupported.

Possibly data isn;t flushed to the disk and hence doesn't re-appear after these steps. let me confirm that.

@ashahi1 ashahi1 assigned ashahi1 and pshahzeb and unassigned ashahi1 Aug 19, 2017
@pshahzeb
Copy link
Contributor

pshahzeb commented Aug 21, 2017

The described data loss behavior is fixed in Docker 17.06, but enabling the plugin after upgrade is broken if it was disabled through -f option.
This needs more digging.

@pshahzeb pshahzeb changed the title Container looses access to the data after plugin upgrade Fix the plugin upgrade flow when disable -f is used (live upgrade). Aug 21, 2017
@pshahzeb
Copy link
Contributor

@govint @shuklanirdesh82
I did following following quick test to upgrade with a container running alongside. Docker removes plugin id dir which contains vsphere.sock.
This is irrespective of usage of -f flag. Either way it removes the dir. I double confirmed this.
The error to reenable to plugin doesn't occur if the container isn't running.

Looks like in case if a container is running and had our volume attached, after plugin upgrade to enable the plugin, docker tries to talk to our socket before we start listening on it.

See steps below.

root@sc-rdops-vm02-dhcp-52-237:~# docker plugin install --grant-all-permissions --alias vsphere vmware/docker-volume-vsphere:0.15
0.15: Pulling from vmware/docker-volume-vsphere
967f49dfadaf: Download complete
Digest: sha256:64a50830c2fa9befbede6efca3f28d2e2d8b4e595703f1b5984470579656e935
Status: Downloaded newer image for vmware/docker-volume-vsphere:0.15
Installed plugin vmware/docker-volume-vsphere:0.15
root@sc-rdops-vm02-dhcp-52-237:~# docker volume ls
DRIVER              VOLUME NAME

root@sc-rdops-vm02-dhcp-52-237:~# docker volume create -d vsphere --name vol1
vol1
root@sc-rdops-vm02-dhcp-52-237:~# docker volume ls
DRIVER              VOLUME NAME
vsphere:latest      vol1@sharedvmfs-0

root@sc-rdops-vm02-dhcp-52-237:~# docker run -it -v vol1@sharedvmfs-0:/data alpine
/ # ls /data
lost+found
/ #
root@sc-rdops-vm02-dhcp-52-237:~# docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
05b1b9066739        alpine              "/bin/sh"           10 seconds ago      Up 8 seconds                            jolly_stallman

Then I disabled the plugin using -f. It removes the vsphere.sock and the plugin id directory.

root@sc-rdops-vm02-dhcp-52-237:~# ls /run/docker/plugins/1cbe1458b4490fa9eb4a29f48296d6f0ae87db530a2ac164c228357d9067eb41
vsphere.sock
root@sc-rdops-vm02-dhcp-52-237:~# docker plugin disable -f vsphere:latest
vsphere:latest
root@sc-rdops-vm02-dhcp-52-237:~# ls /run/docker/plugins
root@sc-rdops-vm02-dhcp-52-237:~#
root@sc-rdops-vm02-dhcp-52-237:~#
docker plugin upgrade vsphere:latest vmware/docker-volume-vsphere:0.16
Upgrading plugin vsphere:latest from vmware/docker-volume-vsphere:0.15 to vmware/docker-volume-vsphere:0.16
Plugin images do not match, are you sure? [y/N] y
Plugin "vmware/docker-volume-vsphere:0.16" is requesting the following privileges:
 - mount: [/dev]
 - mount: [/var/run]
 - mount: [/etc]
 - mount: [/var/log]
 - allow-all-devices: [true]
 - capabilities: [CAP_SYS_ADMIN]
Do you grant the above permissions? [y/N] y
0.16: Pulling from vmware/docker-volume-vsphere
1d54a760dc2a: Download complete
Digest: sha256:5281d47506740ee927001ea5fc16d8584ed7fc04de0fb25969f0b11eb3243345
Status: Downloaded newer image for vmware/docker-volume-vsphere:0.16
Upgraded plugin vsphere:latest to vmware/docker-volume-vsphere:0.16

root@sc-rdops-vm02-dhcp-52-237:~# docker plugin enable vsphere:latest
Error response from daemon: dial unix /run/docker/plugins/1cbe1458b4490fa9eb4a29f48296d6f0ae87db530a2ac164c228357d9067eb41/vsphere.sock: connect: no such file or directory
root@sc-rdops-vm02-dhcp-52-237:~#

End snippet of our plugin log:

2017-08-21 17:51:29.331860073 -0700 PDT [INFO] No config file found. Using defaults.
2017-08-21 17:51:29.332108642 -0700 PDT [INFO] Starting plugin driver=vsphere log_level=info config="/etc/docker-volume-vsphere.conf"
2017-08-21 17:51:29.332527464 -0700 PDT [INFO] Getting volume data from unix:///var/run/docker.sock
2017-08-21 17:51:29.347369704 -0700 PDT [INFO] Found 1 running or paused containers
2017-08-21 17:51:29.349825909 -0700 PDT [INFO] Successfully retrieved mountsmap=map[]
2017-08-21 17:51:29.35004188 -0700 PDT [WARNING] Initiating recovery mount. refcnt=1 mounted=false dev= name="vol1@sharedvmfs-0"
2017-08-21 17:51:40.348233289 -0700 PDT [WARNING] Exceeded timeout while waiting for device attach to completedevice="/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:0" timeout=10s
2017-08-21 17:51:40.348486257 -0700 PDT [INFO] Discovered 1 volumes in use.
2017-08-21 17:51:40.348500763 -0700 PDT [INFO] Volume name=vol1@sharedvmfs-0 count=1 mounted=false device=''
2017-08-21 17:51:40.348506664 -0700 PDT [INFO] Refcounting successfully completed
2017-08-21 17:51:40.348555863 -0700 PDT [INFO] Docker VMDK plugin started port=1019 mock_esx=false version="vSphere Volume Driver v0.5"
2017-08-21 17:51:40.348662839 -0700 PDT [INFO] Going into ServeUnix - Listening on Unix socket address="/run/docker/plugins/vsphere.sock"
2017-08-21 17:51:41.833609036 -0700 PDT [WARNING] Received signal signal=terminated

Should we instruct customers to not have containers running while upgrading the plugin?

@pshahzeb
Copy link
Contributor

Looks like with docker 17.06, docker plugin disable removes /proc/mounts entries of mounted volumes.
So while enabling, it results in mount recovery on our plugin side. But since the volume is already attached, there isn't any complete event and we exceed timeout of 10s.
During this 10s, if a container is running, docker tries to talk to our socket and since we start listening on the socket after first recovery logic (syncing refMap with mountpoints), docker returns with error code of unable to dial.

Did a quick hack to make timeout of 0s and above issue doesn't occur.

The reason for this behavior is missing entry of mounted volumes in /proc/mounts for already mounted volumes. Here's how I concluded this.

root@sc-rdops-vm02-dhcp-52-237:~# docker plugin install --grant-all-permissions --alias vsphere vmware/docker-volume-vsphere:latest VDVS_LOG_LEVEL=debug
latest: Pulling from vmware/docker-volume-vsphere
ad10a6bbce92: Download complete
Digest: sha256:feaa90e4234880fbf1669015e88cd485b2077270219ff84a730b35207bb622d4
Status: Downloaded newer image for vmware/docker-volume-vsphere:latest
Installed plugin vmware/docker-volume-vsphere:latest

root@sc-rdops-vm02-dhcp-52-237:~# docker run -it -v vol1@sharedvmfs-0:/data alpine
/ # echo "Hello World" > /data/text.txt
/ # cat /data/text.txt
Hello World
/ # cat /data/text.txt
Hello World
/ #

Another terminal

root@sc-rdops-vm02-dhcp-52-237:~# cat /proc/mounts
..
/dev/disk/by-uuid/66a21986-55c4-4ede-a2d9-37d30aa41a8e /var/lib/docker/plugins/c4ed53b4fc9bf3ff38eb94fc70002b89b54031ef567d289bf07ac3294828877f/propagated-mount ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
/dev/disk/by-uuid/66a21986-55c4-4ede-a2d9-37d30aa41a8e /var/lib/docker/plugins/c4ed53b4fc9bf3ff38eb94fc70002b89b54031ef567d289bf07ac3294828877f/rootfs/mnt/vmdk ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
none /var/lib/docker/aufs/mnt/35d5b295292f7c69fbb90f4c2f2a80e185e41f1694c3a657d348c70b7f2762bc aufs rw,relatime,si=f8a49996dc3b793a,dio,dirperm1 0 0
shm /var/lib/docker/containers/261c86c670207519712a5ff608070c56db338c01e5378729496f5ecdb1eaba3a/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k 0 0
/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:0 /var/lib/docker/plugins/c4ed53b4fc9bf3ff38eb94fc70002b89b54031ef567d289bf07ac3294828877f/rootfs/mnt/vmdk/vol1@sharedvmfs-0 ext4 rw,relatime,data=ordered 0 0
/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:0 /var/lib/docker/plugins/c4ed53b4fc9bf3ff38eb94fc70002b89b54031ef567d289bf07ac3294828877f/propagated-mount/vol1@sharedvmfs-0 ext4 rw,relatime,data=ordered 0 0
nsfs /run/docker/netns/b1a0779c6528 nsfs rw 0 0
...


root@sc-rdops-vm02-dhcp-52-237:~# docker plugin disable vsphere:latest
vsphere:latest


root@sc-rdops-vm02-dhcp-52-237:~# cat /proc/mounts
...
systemd /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,name=systemd 0 0
/dev/disk/by-uuid/66a21986-55c4-4ede-a2d9-37d30aa41a8e /var/lib/docker/plugins ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
/dev/disk/by-uuid/66a21986-55c4-4ede-a2d9-37d30aa41a8e /var/lib/docker/aufs ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
none /var/lib/docker/aufs/mnt/35d5b295292f7c69fbb90f4c2f2a80e185e41f1694c3a657d348c70b7f2762bc aufs rw,relatime,si=f8a49996dc3b793a,dio,dirperm1 0 0
shm /var/lib/docker/containers/261c86c670207519712a5ff608070c56db338c01e5378729496f5ecdb1eaba3a/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k 0 0
nsfs /run/docker/netns/b1a0779c6528 nsfs rw 0 0


root@sc-rdops-vm02-dhcp-52-237:~# ls /var/lib/docker/plugins/c4ed53b4fc9bf3ff38eb94fc70002b89b54031ef567d289bf07ac3294828877f/rootfs/mnt/vmdk/
root@sc-rdops-vm02-dhcp-52-237:~# ls /var/lib/docker/plugins/c4ed53b4fc9bf3ff38eb94fc70002b89b54031ef567d289bf07ac3294828877f/propagated-mount/
vol1@sharedvmfs-0
root@sc-rdops-vm02-dhcp-52-237:~# ls /var/lib/docker/plugins/c4ed53b4fc9bf3ff38eb94fc70002b89b54031ef567d289bf07ac3294828877f/propagated-mount/vol1@sharedvmfs-0/
root@sc-rdops-vm02-dhcp-52-237:~# ls /var/lib/docker/plugins/c4ed53b4fc9bf3ff38eb94fc70002b89b54031ef567d289bf07ac3294828877f/propagated-mount/vol1@sharedvmfs-0/

Note above that the issue isn't with upgrade anymore.
WIth missing entries in /proc/mounts, we issue mount again to an already mounted volume and result in timeout of 10 seconds. This window is hit by docker to talk to us if a container is running. The error to dial on docker side results in failure to enable the plugin.

With 17.06 docker our plugin disable and subsequent enable is itself broken when a container is running.

@govint
Copy link
Contributor

govint commented Aug 22, 2017

Or, always spin the refcount logic as a deferred activity vs. attempting to do recovery and then creating the vsphere socket.

@govint
Copy link
Contributor

govint commented Aug 22, 2017

Does this work? On Docker 17.07.

docker version
Client:
Version: 17.07.0-ce-rc2
API version: 1.31
Go version: go1.8.3
Git commit: 36ce605
Built: Mon Aug 7 23:51:13 2017
OS/Arch: linux/amd64

Server:
Version: 17.07.0-ce-rc2
API version: 1.31 (minimum version 1.12)
Go version: go1.8.3
Git commit: 36ce605
Built: Mon Aug 7 23:46:39 2017
OS/Arch: linux/amd64
Experimental: false

root@sc-rdops-vm02-dhcp-52-237:~# ps -ef|grep docker

root 21223 21205 0 Aug16 ? 00:00:00 /usr/bin/docker-volume-vsphere --config /etc/docker-volume-vsphere.conf <-----------

root@sc-rdops-vm02-dhcp-52-237:~# docker plugin ls
ID NAME DESCRIPTION ENABLED
06faf9abe464 vsphere:latest VMWare vSphere Docker Volume plugin true

root@sc-rdops-vm02-dhcp-52-237:~# kill -9 21223

root@sc-rdops-vm02-dhcp-52-237:~# docker plugin ls
ID NAME DESCRIPTION ENABLED
06faf9abe464 vsphere:latest VMWare vSphere Docker Volume plugin true

root@sc-rdops-vm02-dhcp-52-237:~# ps -ef|grep docker

root 18481 18465 3 03:44 ? 00:00:00 /usr/bin/docker-volume-vsphere --config /etc/docker-volume-vsphere.conf <------------------

Plugin is enabled

root@sc-rdops-vm02-dhcp-52-237:~# docker plugin ls
ID NAME DESCRIPTION ENABLED
06faf9abe464 vsphere:latest VMWare vSphere Docker Volume plugin true

root@sc-rdops-vm02-dhcp-52-237:~# ps -ef|grep docker

root 18481 18465 0 03:44 ? 00:00:00 /usr/bin/docker-volume-vsphere --config /etc/docker-volume-vsphere.conf

Kill plugin:
root@sc-rdops-vm02-dhcp-52-237:~# kill -9 18481

root@sc-rdops-vm02-dhcp-52-237:~# ps -ef|grep docker

root 18823 18805 0 03:50 ? 00:00:00 /usr/bin/docker-volume-vsphere --config /etc/docker-volume-vsphere.conf <---------

Plugin is enabled.
root@sc-rdops-vm02-dhcp-52-237:~# docker plugin ls
ID NAME DESCRIPTION ENABLED
06faf9abe464 vsphere:latest VMWare vSphere Docker Volume plugin true

Container is still running
root@sc-rdops-vm02-dhcp-52-237:~# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
11e6de3910e1 busybox "sh" About a minute ago Up About a minute elastic_leakey
d7a971c2dc19 root/docker-volume-vsphere:rootfs "/usr/bin/docker-v..." 5 days ago Created tempContainer

Container is happy:
root@sc-rdops-vm02-dhcp-52-237:# docker volume create -d vsphere tvol1
tvol1
root@sc-rdops-vm02-dhcp-52-237:
# docker run -it --rm -v tvol1:/mnt busybox
/ # cd /mnt
/mnt # ls
lost+found
/mnt # echo "hello world"> hw
/mnt # echo "hello world"> hw1
/mnt # echo "hello world"> hw2
/mnt # echo "hello world"> hw3
/mnt #
/mnt #
/mnt #
/mnt #
/mnt #
/mnt # ls
hw hw1 hw2 hw3 lost+found <----- after plugin killed and restarted
/mnt #

root@sc-rdops-vm02-dhcp-52-237:~# docker volume inspect tvol1
[
{
"CreatedAt": "0001-01-01T00:00:00Z",
"Driver": "vsphere:latest",
"Labels": {},
"Mountpoint": "/mnt/vmdk/tvol1",
"Name": "tvol1",
"Options": {},
"Scope": "global",
"Status": {
"access": "read-write",
"attach-as": "independent_persistent",
"capacity": {
"allocated": "15MB",
"size": "100MB"
},
"clone-from": "None",
"created": "Tue Aug 22 10:48:53 2017",
"created by VM": "ubuntu-VM0.0",
"datastore": "sharedVmfs-0",
"diskformat": "thin",
"fstype": "ext4",
"status": "detached"
}
}
]

Can create a new volume ,
root@sc-rdops-vm02-dhcp-52-237:~# docker volume create -d vsphere tvol2
tvol2

@pshahzeb
Copy link
Contributor

We should test this with plugin disable-> enable instead of plugin kill.

@pshahzeb
Copy link
Contributor

Or, always spin the refcount logic as a deferred activity vs. attempting to do recovery and then creating the vsphere socket.

This won't be entirely correct. It will still trigger the recovery, use /proc/mounts and try to mount the volumes again which are already attached.

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

No branches or pull requests

3 participants