Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

CI is taking more time than usual while running tests on vsan datastore #1057

Closed
shuklanirdesh82 opened this issue Mar 21, 2017 · 12 comments
Closed
Assignees
Milestone

Comments

@shuklanirdesh82
Copy link
Contributor

shuklanirdesh82 commented Mar 21, 2017

I've noticed so many timeouts today so I went ahead and grabbed some data that elevates some concern.

Following are the data grabbed from https://ci.vmware.run/vmware/docker-volume-vsphere/1686 ... It is highly possible that some bad code should have gone in past couple months.

Some observations:

  1. Tests are taking more time for vms on vsan vs vms on vmfs
  2. One of the test TestConcurrency (3462.32s) taking close to ~1hr that is ridiculous
  3. over the time local testbed is also a victim of performance degradation (similar to what @msterin reported vmdkops_admin.py volume ls is super slow on large count of volumes  #1054)

Test run with test execution time

on ESX 6.5 using VMs created on VMFS

TestVolumeCreationFirstTime (15.99s)
TestVolumeProperties (62.57s)
test-esx(~10 minutes)
TestRefCnt (163.08s)
TestSanity (10.33s)
TestConcurrency (132.95s)

on ESX 6.5 using VMs created on VSAN

TestVolumeCreationFirstTime (56.42s)
TestVolumeProperties (131.33s)
TestRefCnt (180.76s)
TestSanity (40.93s)
TestConcurrency (731.32s)

on ESX 6.0 using VMs created on VMFS

TestVolumeCreationFirstTime (17.14s)
TestVolumeProperties (64.40s)
test-esx (~13 minutes)
TestRefCnt (163.13s)
TestSanity (17.01s)
TestConcurrency (105.81s)

on ESX 6.0 using VMs created on VSAN

TestVolumeCreationFirstTime (163.72s)
TestVolumeProperties (348.25s)
TestRefCnt (275.41s)
TestSanity (138.82s)
TestConcurrency (3462.32s)

Note: The last successful run before commenting out testConcurrency I can recall is https://ci.vmware.run/vmware/docker-volume-vsphere/1020 like ~2months back where the test completion data is not bad as shown above.

/cc @govint @msterin @pdhamdhere

@shuklanirdesh82 shuklanirdesh82 added this to the 0.13 milestone Mar 21, 2017
@govint
Copy link
Contributor

govint commented Mar 21, 2017

Refcnt test is taking min 2+ and max a little over 3 minutes and its run like 3 - 4 times in this test (why??). And each time its for waiting for the plugin to restart (after the volumes are created and attached to 5 containers). Seems like a plugin start up issue (managed plugin?).

Refcnt test is flaky that it looks for logs to figure correct reporting of refcnt for a volume.

Lets change the plugin to add a refcnt field to the volume status, so refcnt and any user can figure how many users are there for a volume. The plugin can populate the refcnt field after getting the volume meta-data from the ESX service.

With the managed plugin preferably use the API to figure volume usage vs. grep on plugin logs.

Besides that the real question is why the plugin took so long to come up and report one volume in use.

@msterin
Copy link
Contributor

msterin commented Mar 21, 2017

Does.the degradation apply to esx unit tests ?

@shuklanirdesh82
Copy link
Contributor Author

shuklanirdesh82 commented Mar 21, 2017

@msterin's comment #1057 (comment)

the degradation apply to esx unit tests ?

It seems to me. Currently test-esx is running twice once against 6.5 ESX and another shot against 6.0 ESX and the difference is 3+ minutes (test running against 6.0ESX taking more time).

@shuklanirdesh82
Copy link
Contributor Author

@govint's comment #1057 (comment)

run like 3 - 4 times in this test (why??).

vm-test (go test) runs 4 times: 2 times from vms on VMFS backed by 6.0/6.5 ESX and other 2 times for vms exist on VSAN datastore backed by 6.0/6.5 ESX

@govint Please let me know if I have missed any question from your comment. If I understand correctly you are proposing to change the plugin to report refcount inline with volume status.

@govint
Copy link
Contributor

govint commented Mar 22, 2017

@shuklanirdesh82 the CI run doesn't have the plugin logs, where can we get those.

@govint
Copy link
Contributor

govint commented Mar 22, 2017

Ran locally with Photon VMs on VMFS,
Running tests on tcp://10.110.48.48:2375 (may take a while)...
--- PASS: TestSanity (4.96s)
sanity_test.go:199: Successfully connected to tcp://10.110.48.48:2375
sanity_test.go:199: Successfully connected to tcp://10.110.48.47:2375
sanity_test.go:216: Creating vol=DefaultTestVol on client tcp://10.110.48.48:2375.
sanity_test.go:103: Running cmd=&[touch /mnt/testvol/DefaultTestVol/file_to_touch] with vol=DefaultTestVol on client tcp://10.110.48.48:2375
sanity_test.go:103: Running cmd=&[stat /mnt/testvol/DefaultTestVol/file_to_touch] with vol=DefaultTestVol on client tcp://10.110.48.48:2375
=== RUN TestConcurrency
Running concurrent tests on tcp://10.110.48.48:2375 and tcp://10.110.48.47:2375 (may take a while)...
Running create/delete concurrent test...
Running same docker host concurrent create/delete test on tcp://10.110.48.48:2375...
Running clone concurrent test...
--- PASS: TestConcurrency (22.16s)
sanity_test.go:199: Successfully connected to tcp://10.110.48.48:2375
sanity_test.go:199: Successfully connected to tcp://10.110.48.47:2375
PASS
coverage: 40.4% of statements

Should be able to figure exactly whats causing the slowness on CI (only) if we have the plugin and ESX service logs.

@govint
Copy link
Contributor

govint commented Mar 28, 2017

The testbed (single node vsan) is being or has been retired. Closing this issue unless the slow runs are observed again.

@govint govint closed this as completed Mar 28, 2017
@shuklanirdesh82
Copy link
Contributor Author

reopening this issue, #1088 has disabled tests on vsan60

@pdhamdhere pdhamdhere modified the milestones: 0.14, 0.13 Mar 30, 2017
@shuklanirdesh82
Copy link
Contributor Author

shuklanirdesh82 commented Apr 5, 2017

The issue has been root caused, logs are collected for following both runs and following up with VSAN team further.

on VMFS: /vmfs/volumes/datastore1/esx-dhcp-192-168-31-62.ci.drone.local-2017-04-04--23.26-91298.tgz

root@photon-klJ0Zth7o [ ~ ]# time docker volume create -d vsphere --name=vmdk_vmfs1
vmdk_vmfs1

real	0m2.864s
user	0m0.004s
sys	0m0.004s

on VSAN:/vmfs/volumes/datastore1/esx-dhcp-192-168-31-62.ci.drone.local-2017-04-04--23.32-93788.tgz

root@sc-rdops-vm18-dhcp-57-89:~# time docker volume create -d vsphere --name=vmdk_vsan4
vmdk_vsan4

real	0m6.913s
user	0m0.008s
sys	0m0.000s

/cc @pdhamdhere @govint @msterin

@msterin
Copy link
Contributor

msterin commented Apr 5, 2017

The issue has been root caused

so what is the root cause ?

@shuklanirdesh82
Copy link
Contributor Author

so what is the root cause ?
test run against VSAN is not performant and following up with VSAN team.

Keeping in open state as discussed offline(Thanks @msterin! )

@tusharnt tusharnt modified the milestones: v1 GA, 0.14 Apr 7, 2017
@tusharnt tusharnt modified the milestones: 0.15, v1 GA May 2, 2017
@shuklanirdesh82 shuklanirdesh82 changed the title Performance has been degraded recently (CI is taking more time than usual) CI is taking more time than usual while running tests on vsan datastore May 8, 2017
@shuklanirdesh82
Copy link
Contributor Author

shuklanirdesh82 commented May 16, 2017

Checked with VSAN performance team with the collected performance data from our run. The outcome is inline with vsan performance team too. There are some extra steps while creating object on the vSAN datastore compare with VMFS datastore. In addition to that our CI is using nested ESX VMs. We have planned to improve our test execution hence closing this issue as wontfix. Moreover, we can avoid using nested vms by configuring dedicated ESX box for the test run.

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

No branches or pull requests

5 participants