-
Notifications
You must be signed in to change notification settings - Fork 95
Swarm test fails on CI intermittently due to metadata file missing #1405
Comments
From the CI logs, 2017/06/13 07:26:48 Confirming attached status for volume [swarm_test_volume_tr8b5apa57f5] 2017/06/13 07:26:56 Failed to invoke command [docker volume rm swarm_test_volume_tr8b5apa57f5]: exit status 1 2017/06/13 07:26:56 Removing docker service [swarm_test_service_1c4n7y6zh4t08] on VM [192.168.31.81] 2017/06/13 07:26:56 Listing docker service [swarm_test_service_1c4n7y6zh4t08] running on VM [192.168.31.81] 2017/06/13 07:26:56 Failed to invoke command [docker service ps swarm_test_service_1c4n7y6zh4t08]: exit status 1 2017/06/13 07:26:56 Confirming detached status for volume [swarm_test_volume_tr8b5apa57f5] |
ESX logs show that the volume was attached and subsequent requests to attach to the same VM all fail with "already" attached. But a detach fails to find the volume (???) |
The test logic is trying to verify the volume status is changed from attached to detached after removing the swarm service. However, from above error logs, it looks like the volume is gone: Failed to invoke command [docker volume inspect --format '{{index .Status.status}}' swarm_test_volume_tr8b5apa57f5]: exit status 1 If the volume exists, the command "docker volume inspect..." shouldn't fail. However, it's really weird why the volume is gone. Have no clue so far... |
I've got some clues now. It looks like we are running into issue #1191. First let's take a look the test case logic - just look at step 12-16: // Test vDVS usage in swarm mode As you can see: step 13 is similar to the scenario in issue #1191 - we are trying to delete a volume which is in attached status - see ESX logs below: 2017-06-13 07:26:50.661402554 +0000 UTC [INFO] Removing volume name="swarm_test_volume_tr8b5apa57f5" After step 13, we removed the swarm service, and again tried to check volume status in step 16. This step is a simple command "docker volume inspect...". It failed because the vmfd file is missing - see logs below: 06/13/17 07:26:59 71629 [MainThread] [INFO ] Started new thread : 232138163968 with target <function execRequestThread at 0x360ae12488> and args (9, 72093, b'{"cmd":"get","details":{"Name":"swarm_test_volume_tr8b5apa57f5"},"version":"2"}') @govint Since you have investigated similar issue before, can you help to take a look? |
If this is the same issue as Govindan pointed out in #1191 (comment), one temporary workaround to unblock CI is to comment out test step 13 to bypass this issue. |
Can someone elaborate what specifically is the root cause ? This is a serious glitch. |
@msterin I don't know if Govindan has figured out the root cause during his previous investigation (#1191 (comment)). As of now, I'm still not clear of the root cause. We don't have any logs to keep track of the metadata file. From ESX logs, the first occurrence of the vmfd file is this FileNotFoundError: 06/13/17 07:26:59 71629 [photon.vmfs-datastore1._DEFAULT.swarm_test_volume_tr8b5apa57f5] [ERROR ] Failed to access b'/vmfs/volumes/datastore1/dockvols/11111111-1111-1111-1111-111111111111/swarm_test_volume_tr8b5apa57f5-22129b9236316450.vmfd' |
I think @govint was referring to disklib behavior when a sidecar was deleted and the disk was not. Or maybe to something else? Let's find out this first, before doing workaround or changing CI |
Thankfully, this was a false alarm. The log messages are all valid though. The issue was with the change being tested in the CI. The change (that was being tested in CI #1260) was to remove the calls to handle stale attach in the service. That was removed in the change and converted into logs instead. Works fine for attach/clone use cases. The change works fine for volume remove use case also as long as the volume has been detached first. Otherwise, removeVMDK() in the ESX service finds the volume is already attached and logs the message and then proceeds to clean the VMDK (delete the vmdk). In this case, the test has a container that is using the volume but attempts to remove the volume which is expected to fail. SInce the volume was already attached the delete failed but removed the KV (sidecar). And hence the issue. Upated the change ( #1260) to restore returning from removeVMDK() if the volume is attached. The E2E test did catch the bug in the change being tested, although this change does need a new test in the basic tests category whcih i'll add up as part of the existing PR. |
The swarm test seems to have errored with the current code in CI - https://ci.vmware.run/vmware/docker-volume-vsphere/494
FAIL: swarm_test.go:91: SwarmTestSuite.TestDockerSwarm
swarm_test.go:141:
c.Assert(status, Equals, true, Commentf("Volume %s is still attached", s.volumeName))
... obtained bool = false
... expected bool = true
... Volume swarm_test_volume_tr8b5apa57f5 is still attached
The text was updated successfully, but these errors were encountered: