Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Swarm test fails on CI intermittently due to metadata file missing #1405

Closed
govint opened this issue Jun 13, 2017 · 9 comments
Closed

Swarm test fails on CI intermittently due to metadata file missing #1405

govint opened this issue Jun 13, 2017 · 9 comments

Comments

@govint
Copy link
Contributor

govint commented Jun 13, 2017

The swarm test seems to have errored with the current code in CI - https://ci.vmware.run/vmware/docker-volume-vsphere/494

FAIL: swarm_test.go:91: SwarmTestSuite.TestDockerSwarm

swarm_test.go:141:
c.Assert(status, Equals, true, Commentf("Volume %s is still attached", s.volumeName))
... obtained bool = false
... expected bool = true
... Volume swarm_test_volume_tr8b5apa57f5 is still attached

@govint
Copy link
Contributor Author

govint commented Jun 13, 2017

From the CI logs,
2017/06/13 07:26:48 Containers running on docker host [192.168.31.83]: swarm_test_service_1c4n7y6zh4t08.1.3gq04c7lxui5zvsa748ocddm7
swarm_test_service_1c4n7y6zh4t08.2.58z2cu3b7q2m93ifntmdp9jwo

2017/06/13 07:26:48 Confirming attached status for volume [swarm_test_volume_tr8b5apa57f5]
2017/06/13 07:26:50 Destroying volume [swarm_test_volume_tr8b5apa57f5]

2017/06/13 07:26:56 Failed to invoke command [docker volume rm swarm_test_volume_tr8b5apa57f5]: exit status 1

2017/06/13 07:26:56 Removing docker service [swarm_test_service_1c4n7y6zh4t08] on VM [192.168.31.81]

2017/06/13 07:26:56 Listing docker service [swarm_test_service_1c4n7y6zh4t08] running on VM [192.168.31.81]

2017/06/13 07:26:56 Failed to invoke command [docker service ps swarm_test_service_1c4n7y6zh4t08]: exit status 1

2017/06/13 07:26:56 Confirming detached status for volume [swarm_test_volume_tr8b5apa57f5]
2017/06/13 07:26:56 Sleep for 2 seconds
2017/06/13 07:26:59 Failed to invoke command [docker volume inspect --format '{{index .Status.status}}' swarm_test_volume_tr8b5apa57f5]: exit status 1
2017/06/13 07:26:59 Sleep for 2 seconds
2017/06/13 07:27:01 Failed to invoke command [docker volume inspect --format '{{index .Status.status}}' swarm_test_volume_tr8b5apa57f5]: exit status 1

@govint
Copy link
Contributor Author

govint commented Jun 13, 2017

ESX logs show that the volume was attached and subsequent requests to attach to the same VM all fail with "already" attached. But a detach fails to find the volume (???)

@shaominchen
Copy link
Contributor

The test logic is trying to verify the volume status is changed from attached to detached after removing the swarm service. However, from above error logs, it looks like the volume is gone:

Failed to invoke command [docker volume inspect --format '{{index .Status.status}}' swarm_test_volume_tr8b5apa57f5]: exit status 1

If the volume exists, the command "docker volume inspect..." shouldn't fail. However, it's really weird why the volume is gone. Have no clue so far...

@shaominchen
Copy link
Contributor

I've got some clues now. It looks like we are running into issue #1191.

First let's take a look the test case logic - just look at step 12-16:

// Test vDVS usage in swarm mode
//
// Test steps:
// 1. Create a docker service with replicas setting to 1
// 2. Verify the service is up and running with one node
// 3. Verify one container is spawned
// 5. Verify the volume is in attached status
// 6. Scale the service to set replica numbers to 2
// 7. Verify the service is up and running with two nodes
// 8. Verify 2 containers are spawned
// 9. Stop one node of the service
// 10. Verify the service is still running with two nodes
// 11. Verify there are still 2 containers up and running
// 12. Verify the volume is in attached status
// 13. Delete the volume - expect fail
// 14. Remove the service
// 15. Verify the service is gone
// 16. Verify the volume is in detached status

As you can see: step 13 is similar to the scenario in issue #1191 - we are trying to delete a volume which is in attached status - see ESX logs below:

2017-06-13 07:26:50.661402554 +0000 UTC [INFO] Removing volume name="swarm_test_volume_tr8b5apa57f5"
2017-06-13 07:26:56.637418449 +0000 UTC [ERROR] Failed to remove volume name="swarm_test_volume_tr8b5apa57f5" error="Failed to remove volume: Error caused by file /vmfs/volumes/58884956-881be6e0-1941-000c294680dc/dockvols/11111111-1111-1111-1111-111111111111/swarm_test_volume_tr8b5apa57f5.vmdk"
2017-06-13 07:29:18.254302006 +0000 UTC [INFO] Removing volume name="swarm_test_volume_tr8b5apa57f5"

After step 13, we removed the swarm service, and again tried to check volume status in step 16. This step is a simple command "docker volume inspect...". It failed because the vmfd file is missing - see logs below:

06/13/17 07:26:59 71629 [MainThread] [INFO ] Started new thread : 232138163968 with target <function execRequestThread at 0x360ae12488> and args (9, 72093, b'{"cmd":"get","details":{"Name":"swarm_test_volume_tr8b5apa57f5"},"version":"2"}')
06/13/17 07:26:59 71629 [Thread-278] [INFO ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
06/13/17 07:26:59 71629 [photon.vmfs-datastore1._DEFAULT.swarm_test_volume_tr8b5apa57f5] [ERROR ] Failed to access b'/vmfs/volumes/datastore1/dockvols/11111111-1111-1111-1111-111111111111/swarm_test_volume_tr8b5apa57f5-22129b9236316450.vmfd'
Traceback (most recent call last):
File "/usr/lib/vmware/vmdkops/Python/kvESX.py", line 265, in load
with open(meta_file, "r") as fh:
FileNotFoundError: [Errno 2] No such file or directory: b'/vmfs/volumes/datastore1/dockvols/11111111-1111-1111-1111-111111111111/swarm_test_volume_tr8b5apa57f5-22129b9236316450.vmfd'
06/13/17 07:26:59 71629 [photon.vmfs-datastore1._DEFAULT.swarm_test_volume_tr8b5apa57f5] [ERROR ] Failed to get disk details for /vmfs/volumes/datastore1/dockvols/11111111-1111-1111-1111-111111111111/swarm_test_volume_tr8b5apa57f5.vmdk ('NoneType' object is not subscriptable)

@govint Since you have investigated similar issue before, can you help to take a look?

@shaominchen shaominchen changed the title Swarm test error on CI Swarm test error on CI intermittently due to metadata file missing Jun 13, 2017
@shaominchen shaominchen assigned govint and unassigned shaominchen Jun 13, 2017
@shaominchen shaominchen changed the title Swarm test error on CI intermittently due to metadata file missing Swarm test fails on CI intermittently due to metadata file missing Jun 13, 2017
@shaominchen
Copy link
Contributor

shaominchen commented Jun 13, 2017

If this is the same issue as Govindan pointed out in #1191 (comment), one temporary workaround to unblock CI is to comment out test step 13 to bypass this issue.

@msterin
Copy link
Contributor

msterin commented Jun 13, 2017

Can someone elaborate what specifically is the root cause ? This is a serious glitch.

@shaominchen
Copy link
Contributor

@msterin I don't know if Govindan has figured out the root cause during his previous investigation (#1191 (comment)). As of now, I'm still not clear of the root cause. We don't have any logs to keep track of the metadata file. From ESX logs, the first occurrence of the vmfd file is this FileNotFoundError:

06/13/17 07:26:59 71629 [photon.vmfs-datastore1._DEFAULT.swarm_test_volume_tr8b5apa57f5] [ERROR ] Failed to access b'/vmfs/volumes/datastore1/dockvols/11111111-1111-1111-1111-111111111111/swarm_test_volume_tr8b5apa57f5-22129b9236316450.vmfd'
Traceback (most recent call last):
File "/usr/lib/vmware/vmdkops/Python/kvESX.py", line 265, in load
with open(meta_file, "r") as fh:
FileNotFoundError: [Errno 2] No such file or directory: b'/vmfs/volumes/datastore1/dockvols/11111111-1111-1111-1111-111111111111/swarm_test_volume_tr8b5apa57f5-22129b9236316450.vmfd'
06/13/17 07:26:59 71629 [photon.vmfs-datastore1._DEFAULT.swarm_test_volume_tr8b5apa57f5] [ERROR ] Failed to get disk details for /vmfs/volumes/datastore1/dockvols/11111111-1111-1111-1111-111111111111/swarm_test_volume_tr8b5apa57f5.vmdk ('NoneType' object is not subscriptable)

@msterin
Copy link
Contributor

msterin commented Jun 14, 2017

I think @govint was referring to disklib behavior when a sidecar was deleted and the disk was not. Or maybe to something else? Let's find out this first, before doing workaround or changing CI

@govint
Copy link
Contributor Author

govint commented Jun 14, 2017

Thankfully, this was a false alarm. The log messages are all valid though. The issue was with the change being tested in the CI. The change (that was being tested in CI #1260) was to remove the calls to handle stale attach in the service.

That was removed in the change and converted into logs instead. Works fine for attach/clone use cases.

The change works fine for volume remove use case also as long as the volume has been detached first. Otherwise, removeVMDK() in the ESX service finds the volume is already attached and logs the message and then proceeds to clean the VMDK (delete the vmdk).

In this case, the test has a container that is using the volume but attempts to remove the volume which is expected to fail. SInce the volume was already attached the delete failed but removed the KV (sidecar). And hence the issue.

Upated the change ( #1260) to restore returning from removeVMDK() if the volume is attached.

The E2E test did catch the bug in the change being tested, although this change does need a new test in the basic tests category whcih i'll add up as part of the existing PR.

@govint govint closed this as completed Jun 14, 2017
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

No branches or pull requests

3 participants