CI is taking more time than usual while running tests on vsan datastore #1057

shuklanirdesh82 · 2017-03-21T05:18:44Z

I've noticed so many timeouts today so I went ahead and grabbed some data that elevates some concern.

Following are the data grabbed from https://ci.vmware.run/vmware/docker-volume-vsphere/1686 ... It is highly possible that some bad code should have gone in past couple months.

Some observations:

Tests are taking more time for vms on vsan vs vms on vmfs
One of the test TestConcurrency (3462.32s) taking close to ~1hr that is ridiculous
over the time local testbed is also a victim of performance degradation (similar to what @msterin reported vmdkops_admin.py volume ls is super slow on large count of volumes #1054)

Test run with test execution time

on ESX 6.5 using VMs created on VMFS

TestVolumeCreationFirstTime (15.99s)
TestVolumeProperties (62.57s)
test-esx(~10 minutes)
TestRefCnt (163.08s)
TestSanity (10.33s)
TestConcurrency (132.95s)

on ESX 6.5 using VMs created on VSAN

TestVolumeCreationFirstTime (56.42s)
TestVolumeProperties (131.33s)
TestRefCnt (180.76s)
TestSanity (40.93s)
TestConcurrency (731.32s)

on ESX 6.0 using VMs created on VMFS

TestVolumeCreationFirstTime (17.14s)
TestVolumeProperties (64.40s)
test-esx (~13 minutes)
TestRefCnt (163.13s)
TestSanity (17.01s)
TestConcurrency (105.81s)

on ESX 6.0 using VMs created on VSAN

TestVolumeCreationFirstTime (163.72s)
TestVolumeProperties (348.25s)
TestRefCnt (275.41s)
TestSanity (138.82s)
TestConcurrency (3462.32s)

Note: The last successful run before commenting out testConcurrency I can recall is https://ci.vmware.run/vmware/docker-volume-vsphere/1020 like ~2months back where the test completion data is not bad as shown above.

/cc @govint @msterin @pdhamdhere

The text was updated successfully, but these errors were encountered:

govint · 2017-03-21T05:40:23Z

Refcnt test is taking min 2+ and max a little over 3 minutes and its run like 3 - 4 times in this test (why??). And each time its for waiting for the plugin to restart (after the volumes are created and attached to 5 containers). Seems like a plugin start up issue (managed plugin?).

Refcnt test is flaky that it looks for logs to figure correct reporting of refcnt for a volume.

Lets change the plugin to add a refcnt field to the volume status, so refcnt and any user can figure how many users are there for a volume. The plugin can populate the refcnt field after getting the volume meta-data from the ESX service.

With the managed plugin preferably use the API to figure volume usage vs. grep on plugin logs.

Besides that the real question is why the plugin took so long to come up and report one volume in use.

msterin · 2017-03-21T05:51:51Z

Does.the degradation apply to esx unit tests ?

shuklanirdesh82 · 2017-03-21T06:10:15Z

@msterin's comment #1057 (comment)

the degradation apply to esx unit tests ?

It seems to me. Currently test-esx is running twice once against 6.5 ESX and another shot against 6.0 ESX and the difference is 3+ minutes (test running against 6.0ESX taking more time).

shuklanirdesh82 · 2017-03-21T06:15:09Z

@govint's comment #1057 (comment)

run like 3 - 4 times in this test (why??).

vm-test (go test) runs 4 times: 2 times from vms on VMFS backed by 6.0/6.5 ESX and other 2 times for vms exist on VSAN datastore backed by 6.0/6.5 ESX

@govint Please let me know if I have missed any question from your comment. If I understand correctly you are proposing to change the plugin to report refcount inline with volume status.

govint · 2017-03-22T11:35:51Z

@shuklanirdesh82 the CI run doesn't have the plugin logs, where can we get those.

govint · 2017-03-22T16:38:08Z

Ran locally with Photon VMs on VMFS,
Running tests on tcp://10.110.48.48:2375 (may take a while)...
--- PASS: TestSanity (4.96s)
sanity_test.go:199: Successfully connected to tcp://10.110.48.48:2375
sanity_test.go:199: Successfully connected to tcp://10.110.48.47:2375
sanity_test.go:216: Creating vol=DefaultTestVol on client tcp://10.110.48.48:2375.
sanity_test.go:103: Running cmd=&[touch /mnt/testvol/DefaultTestVol/file_to_touch] with vol=DefaultTestVol on client tcp://10.110.48.48:2375
sanity_test.go:103: Running cmd=&[stat /mnt/testvol/DefaultTestVol/file_to_touch] with vol=DefaultTestVol on client tcp://10.110.48.48:2375
=== RUN TestConcurrency
Running concurrent tests on tcp://10.110.48.48:2375 and tcp://10.110.48.47:2375 (may take a while)...
Running create/delete concurrent test...
Running same docker host concurrent create/delete test on tcp://10.110.48.48:2375...
Running clone concurrent test...
--- PASS: TestConcurrency (22.16s)
sanity_test.go:199: Successfully connected to tcp://10.110.48.48:2375
sanity_test.go:199: Successfully connected to tcp://10.110.48.47:2375
PASS
coverage: 40.4% of statements

Should be able to figure exactly whats causing the slowness on CI (only) if we have the plugin and ESX service logs.

govint · 2017-03-28T07:09:18Z

The testbed (single node vsan) is being or has been retired. Closing this issue unless the slow runs are observed again.

shuklanirdesh82 · 2017-03-28T07:59:22Z

reopening this issue, #1088 has disabled tests on vsan60

shuklanirdesh82 · 2017-04-05T00:58:11Z

The issue has been root caused, logs are collected for following both runs and following up with VSAN team further.

on VMFS: /vmfs/volumes/datastore1/esx-dhcp-192-168-31-62.ci.drone.local-2017-04-04--23.26-91298.tgz

root@photon-klJ0Zth7o [ ~ ]# time docker volume create -d vsphere --name=vmdk_vmfs1
vmdk_vmfs1

real	0m2.864s
user	0m0.004s
sys	0m0.004s

on VSAN:/vmfs/volumes/datastore1/esx-dhcp-192-168-31-62.ci.drone.local-2017-04-04--23.32-93788.tgz

root@sc-rdops-vm18-dhcp-57-89:~# time docker volume create -d vsphere --name=vmdk_vsan4
vmdk_vsan4

real	0m6.913s
user	0m0.008s
sys	0m0.000s

/cc @pdhamdhere @govint @msterin

msterin · 2017-04-05T02:58:41Z

The issue has been root caused

so what is the root cause ?

shuklanirdesh82 · 2017-04-05T19:47:08Z

so what is the root cause ?
test run against VSAN is not performant and following up with VSAN team.

Keeping in open state as discussed offline(Thanks @msterin! )

shuklanirdesh82 · 2017-05-16T05:27:44Z

Checked with VSAN performance team with the collected performance data from our run. The outcome is inline with vsan performance team too. There are some extra steps while creating object on the vSAN datastore compare with VMFS datastore. In addition to that our CI is using nested ESX VMs. We have planned to improve our test execution hence closing this issue as wontfix. Moreover, we can avoid using nested vms by configuring dedicated ESX box for the test run.

shuklanirdesh82 added this to the 0.13 milestone Mar 21, 2017

shuklanirdesh82 added the P0 label Mar 21, 2017

tusharnt assigned govint Mar 21, 2017

This was referenced Mar 23, 2017

Plugin panic during format-remove #1062

Closed

Increase default timeout parameter while running E2E tests #1078

Closed

Diabling vsan60 tests due to testbed slowness #1088

Merged

govint closed this as completed Mar 28, 2017

shuklanirdesh82 reopened this Mar 28, 2017

shuklanirdesh82 assigned shuklanirdesh82 and unassigned govint Mar 28, 2017

pdhamdhere mentioned this issue Mar 29, 2017

Minor test improvements #1100

Merged

pdhamdhere modified the milestones: 0.14, 0.13 Mar 30, 2017

shuklanirdesh82 closed this as completed Apr 5, 2017

shuklanirdesh82 reopened this Apr 5, 2017

tusharnt modified the milestones: v1 GA, 0.14 Apr 7, 2017

tusharnt modified the milestones: 0.15, v1 GA May 2, 2017

shuklanirdesh82 changed the title ~~Performance has been degraded recently (CI is taking more time than usual)~~ CI is taking more time than usual while running tests on vsan datastore May 8, 2017

shuklanirdesh82 closed this as completed May 16, 2017

shuklanirdesh82 added the wontfix label May 16, 2017

shuklanirdesh82 mentioned this issue May 16, 2017

Migrate vDVS CI/CD to OneCloud #1269

Open

1 task

shuklanirdesh82 mentioned this issue Nov 10, 2017

Investigate execution times of concurrent tests on VSAN #813

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI is taking more time than usual while running tests on vsan datastore #1057

CI is taking more time than usual while running tests on vsan datastore #1057

shuklanirdesh82 commented Mar 21, 2017 •

edited

Loading

govint commented Mar 21, 2017 •

edited

Loading

msterin commented Mar 21, 2017

shuklanirdesh82 commented Mar 21, 2017 •

edited

Loading

shuklanirdesh82 commented Mar 21, 2017

govint commented Mar 22, 2017

govint commented Mar 22, 2017

govint commented Mar 28, 2017

shuklanirdesh82 commented Mar 28, 2017

shuklanirdesh82 commented Apr 5, 2017 •

edited

Loading

msterin commented Apr 5, 2017

shuklanirdesh82 commented Apr 5, 2017

shuklanirdesh82 commented May 16, 2017 •

edited

Loading

CI is taking more time than usual while running tests on vsan datastore #1057

CI is taking more time than usual while running tests on vsan datastore #1057

Comments

shuklanirdesh82 commented Mar 21, 2017 • edited Loading

Test run with test execution time

govint commented Mar 21, 2017 • edited Loading

msterin commented Mar 21, 2017

shuklanirdesh82 commented Mar 21, 2017 • edited Loading

shuklanirdesh82 commented Mar 21, 2017

govint commented Mar 22, 2017

govint commented Mar 22, 2017

govint commented Mar 28, 2017

shuklanirdesh82 commented Mar 28, 2017

shuklanirdesh82 commented Apr 5, 2017 • edited Loading

msterin commented Apr 5, 2017

shuklanirdesh82 commented Apr 5, 2017

shuklanirdesh82 commented May 16, 2017 • edited Loading

shuklanirdesh82 commented Mar 21, 2017 •

edited

Loading

govint commented Mar 21, 2017 •

edited

Loading

shuklanirdesh82 commented Mar 21, 2017 •

edited

Loading

shuklanirdesh82 commented Apr 5, 2017 •

edited

Loading

shuklanirdesh82 commented May 16, 2017 •

edited

Loading