Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Synchronise disklib operations in kvESX, add threadutils module. #708

Merged
merged 7 commits into from
Nov 7, 2016

Conversation

brunotm
Copy link
Contributor

@brunotm brunotm commented Nov 5, 2016

Synchronise disklib operations in kvESX, fix indentation and misc.
Moved threading code to utils/threadutils.
Added threadutils.LockManager class
Added threadutils.get_lock_decorator
And misc.

//CC @msterin
Closing #698 in favor of this.

…nchronize disklib operations.

kvESX: Fix Indentation and etc
kvESX: Move comments to module and functions docstring
kvESX: Remove unused c_void_p import
@kerneltime
Copy link
Contributor

I cleaned up the testbed before the test run failed. The relevant bit from the test log

sanity_test.go:265: Parallel test failed, err: Error response from daemon: Error while removing volume volTestP10: remove volTestP10: VolumeDriver.Remove: Failed to remove /vmfs/volumes/vsanDatastore/dockvols/volTestP10.vmdk. Failed to delete virtual disk: Failed to lock the file (16392).
FAIL

The file is still present on disk

[root@sabu-esx-01:/vmfs/volumes/vsan:52fe282b8b5a398c-e41bd80f9e940d46/5c448a57-c330-6d3f-3539-000c29e3609e] ls
policies                          volTestP01.vmdk                   volTestP11-7e7be1795d76687b.vmfd
volTestP01-60fce49a4bd304fb.vmfd  volTestP10.vmdk                   volTestP11.vmdk

@govint
Copy link
Contributor

govint commented Nov 7, 2016

@kerneltime can we get a ps -ef on the ESX running the CI

On Mon, Nov 7, 2016 at 11:05 AM, Ritesh H Shukla notifications@github.com
wrote:

I cleaned up the testbed before the test run failed. The relevant bit from
the test log

sanity_test.go:265: Parallel test failed, err: Error response from daemon: Error while removing volume volTestP10: remove volTestP10: VolumeDriver.Remove: Failed to remove /vmfs/volumes/vsanDatastore/dockvols/volTestP10.vmdk. Failed to delete virtual disk: Failed to lock the file (16392).
FAIL

The file is still present on disk

[root@sabu-esx-01:/vmfs/volumes/vsan:52fe282b8b5a398c-e41bd80f9e940d46/5c448a57-c330-6d3f-3539-000c29e3609e] ls
policies volTestP01.vmdk volTestP11-7e7be1795d76687b.vmfd
volTestP01-60fce49a4bd304fb.vmfd volTestP10.vmdk volTestP11.vmdk


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#708 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/APHseLF89JlhfjVmTXAteUqSpOsiSnsVks5q7riQgaJpZM4KqUYE
.

@brunotm
Copy link
Contributor Author

brunotm commented Nov 7, 2016

@kerneltime
This error comes from vmkfstools itself. It seems that rate of changes in the parallel tests on vsan backed datastores eventually leads to timing/locking issues (docker host trying to modify a volume faster than vsan releasing the lock on file). - We already have a workaround in place in kvESX for this situation.

We can make removeVMDK retryable for this error. (i think this will be the last case of this situation). I'll make a patch.

11/07/16 05:22:44 3519970 [Ubuntu.1404.vsan-volTestP10] [INFO ] executeRequest 'remove' completed with ret={u'Error': 'Failed to remove /vmfs/volumes/vsanDatastore/dockvols/volTestP10.vmdk. Failed to delete virtual disk: Failed to lock the file (16392).\n'}

@brunotm
Copy link
Contributor Author

brunotm commented Nov 7, 2016

With the retryable removeVMDK, while trying to remove a externally opened vmdk file (with less) and closing.

11/07/16 07:08:02 733061 [MainThread] [DEBUG  ] lib.vmci_get_one_op returns 9, buffer '{"cmd":"remove","details":{"Name":"vol102469"}}'
11/07/16 07:08:02 733061 [dev01-vol102469] [DEBUG  ] LockManager.get_lock: new lock: vol102469, <thread.lock object at 0xfff9f720>
11/07/16 07:08:02 733061 [dev01-vol102469] [DEBUG  ] Trying to aquire lock: vol102469
11/07/16 07:08:02 733061 [dev01-vol102469] [DEBUG  ] Aquired lock: vol102469
11/07/16 07:08:02 733061 [dev01-vol102469] [DEBUG  ] get_datastores: [('datastore1', '580d9727-1feb0bee-382e-b8aeedeb6b16', '/vmfs/volumes/580d9727-1feb0bee-382e-b8aeedeb6b16/dockvols'), ('datastore2', '580d9bb1-b8c6bf18-ada6-b8aeedeb6b16', '/vmfs/volumes/580d9bb1-b8c6bf18-ada6-b8aeedeb6b16/dockvols')]
11/07/16 07:08:02 733061 [dev01-vol102469] [DEBUG  ] get_datastore_name: path=580d9bb1-b8c6bf18-ada6-b8aeedeb6b16 name=['datastore2']
11/07/16 07:08:02 733061 [dev01-vol102469] [DEBUG  ] Authorize: vm_uuid=4236919a-d0fd-831b-4c58-4e8bf8b0cfff
11/07/16 07:08:02 733061 [dev01-vol102469] [DEBUG  ] Authorize: datastore=datastore2
11/07/16 07:08:02 733061 [dev01-vol102469] [DEBUG  ] Authorize: cmd=remove
11/07/16 07:08:02 733061 [dev01-vol102469] [DEBUG  ] Authorize: opt={}
11/07/16 07:08:02 733061 [dev01-vol102469] [DEBUG  ] VM 4236919a-d0fd-831b-4c58-4e8bf8b0cfff does not belong to any tenant
11/07/16 07:08:02 733061 [dev01-vol102469] [DEBUG  ] Found /vmfs/volumes/datastore2/dockvols, returning
11/07/16 07:08:02 733061 [dev01-vol102469] [DEBUG  ] executeRequest None /vmfs/volumes/datastore2/dockvols
11/07/16 07:08:03 733061 [dev01-vol102469] [INFO   ] *** removeVMDK: /vmfs/volumes/datastore2/dockvols/vol102469.vmdk
11/07/16 07:08:03 733061 [dev01-vol102469] [DEBUG  ] Running cmd /sbin/vmkfstools -U  /vmfs/volumes/datastore2/dockvols/vol102469.vmdk
11/07/16 07:08:06 733061 [dev01-vol102469] [INFO   ] *** removeVMDK: /vmfs/volumes/datastore2/dockvols/vol102469.vmdk, coudn't lock volume for removal. Retrying...
11/07/16 07:08:07 733061 [dev01-vol102469] [DEBUG  ] Running cmd /sbin/vmkfstools -U  /vmfs/volumes/datastore2/dockvols/vol102469.vmdk
11/07/16 07:08:10 733061 [dev01-vol102469] [INFO   ] *** removeVMDK: /vmfs/volumes/datastore2/dockvols/vol102469.vmdk, coudn't lock volume for removal. Retrying...
11/07/16 07:08:11 733061 [dev01-vol102469] [DEBUG  ] Running cmd /sbin/vmkfstools -U  /vmfs/volumes/datastore2/dockvols/vol102469.vmdk
11/07/16 07:08:13 733061 [dev01-vol102469] [INFO   ] *** removeVMDK: /vmfs/volumes/datastore2/dockvols/vol102469.vmdk, coudn't lock volume for removal. Retrying...
11/07/16 07:08:14 733061 [dev01-vol102469] [DEBUG  ] Running cmd /sbin/vmkfstools -U  /vmfs/volumes/datastore2/dockvols/vol102469.vmdk
11/07/16 07:08:14 733061 [dev01-vol102469] [INFO   ] executeRequest 'remove' completed with ret=None
11/07/16 07:08:14 733061 [dev01-vol102469] [DEBUG  ] lib.vmci_reply: VMCI replied with errcode 0
11/07/16 07:08:14 733061 [dev01-vol102469] [DEBUG  ] Released lock: vol102469

Leaving it open:

11/07/16 07:12:20 733061 [dev01-vol102469] [DEBUG  ] executeRequest None /vmfs/volumes/datastore2/dockvols
11/07/16 07:12:20 733061 [dev01-vol102469] [INFO   ] *** removeVMDK: /vmfs/volumes/datastore2/dockvols/vol102469.vmdk
11/07/16 07:12:20 733061 [dev01-vol102469] [DEBUG  ] Running cmd /sbin/vmkfstools -U  /vmfs/volumes/datastore2/dockvols/vol102469.vmdk
11/07/16 07:12:23 733061 [dev01-vol102469] [INFO   ] *** removeVMDK: /vmfs/volumes/datastore2/dockvols/vol102469.vmdk, coudn't lock volume for removal. Retrying...
11/07/16 07:12:24 733061 [dev01-vol102469] [DEBUG  ] Running cmd /sbin/vmkfstools -U  /vmfs/volumes/datastore2/dockvols/vol102469.vmdk
11/07/16 07:12:27 733061 [dev01-vol102469] [INFO   ] *** removeVMDK: /vmfs/volumes/datastore2/dockvols/vol102469.vmdk, coudn't lock volume for removal. Retrying...
11/07/16 07:12:28 733061 [dev01-vol102469] [DEBUG  ] Running cmd /sbin/vmkfstools -U  /vmfs/volumes/datastore2/dockvols/vol102469.vmdk
11/07/16 07:12:31 733061 [dev01-vol102469] [INFO   ] *** removeVMDK: /vmfs/volumes/datastore2/dockvols/vol102469.vmdk, coudn't lock volume for removal. Retrying...
11/07/16 07:12:32 733061 [dev01-vol102469] [DEBUG  ] Running cmd /sbin/vmkfstools -U  /vmfs/volumes/datastore2/dockvols/vol102469.vmdk
11/07/16 07:12:35 733061 [dev01-vol102469] [INFO   ] *** removeVMDK: /vmfs/volumes/datastore2/dockvols/vol102469.vmdk, coudn't lock volume for removal. Retrying...
11/07/16 07:12:36 733061 [dev01-vol102469] [DEBUG  ] Running cmd /sbin/vmkfstools -U  /vmfs/volumes/datastore2/dockvols/vol102469.vmdk
11/07/16 07:12:39 733061 [dev01-vol102469] [INFO   ] *** removeVMDK: /vmfs/volumes/datastore2/dockvols/vol102469.vmdk, coudn't lock volume for removal. Retrying...
11/07/16 07:12:40 733061 [dev01-vol102469] [DEBUG  ] Running cmd /sbin/vmkfstools -U  /vmfs/volumes/datastore2/dockvols/vol102469.vmdk
11/07/16 07:12:43 733061 [dev01-vol102469] [INFO   ] executeRequest 'remove' completed with ret={u'Error': 'Failed to remove /vmfs/volumes/datastore2/dockvols/vol102469.vmdk. Failed to delete virtual disk: Failed to lock the file (16392).\n'}
11/07/16 07:12:43 733061 [dev01-vol102469] [DEBUG  ] lib.vmci_reply: VMCI replied with errcode 0
11/07/16 07:12:43 733061 [dev01-vol102469] [DEBUG  ] Released lock: vol102469

threadutils.LockManager: add list_locks()
threadutils.LockManager: return within self._lock scope in get_lock
@brunotm
Copy link
Contributor Author

brunotm commented Nov 7, 2016

@kerneltime
The testbed will need cleaning again before submitting new builds.

@brunotm
Copy link
Contributor Author

brunotm commented Nov 7, 2016

The vmkfs -U command will unlink all volume associated files but the busy one. After the first removeVMDK failure we will have a vmdk file without the flat and the vmdf files so the subsequent errors.
Will also test the behavior using the vim.VirtualDiskManager.

Copy link
Contributor

@govint govint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but the new threadutils code could use a block comment saying how the interfaces defined there must be used in other modules.

@msterin
Copy link
Contributor

msterin commented Nov 7, 2016

The code looks good. I like threadutils and the way decorator is defined and used - very clear.
The code addresses a set of issue though

  • removeVMDK partial successes and retries - IMO should be merged now (LGTM from me on this commit)
  • general disklib protection - it may be not needed. I did mislead with "disklib is not thread safe" claim , it was not fully correct. Disklib is thread safe as long as dhandle is not shared between threads. So unless we have something to prove there is an issue, we should delay this one.
  • KV protection (and VMDK ops protection) - if 2 docker engines are trying to operate on the same volume, we need to serialize it. I am not sure there is any code for this

//CC @brunotm @kerneltime

@brunotm
Copy link
Contributor Author

brunotm commented Nov 7, 2016

@govint
I'll update with comments.
Thanks!

@brunotm
Copy link
Contributor Author

brunotm commented Nov 7, 2016

@msterin

general disklib protection

If we don't serialize, we'll need a loaded lib by thread similar to the SQLite situation. As soon as we pass the dhandle for operation from a request thread we've crossed the boundaries. I really prefer the general locking approach for now.

kV and vmdk ops protection

Right now they'll be serialized in the lifetime of a request but an op (mainly create/delete) are a composition of them, for that we'll need a higher level lock and remember the plugin intention with some kind of id and state. I have some ideas for that and can propose something as soon as I have time.

@brunotm
Copy link
Contributor Author

brunotm commented Nov 7, 2016

I'm replying from the phone, sorry for the brevity.

@msterin
Copy link
Contributor

msterin commented Nov 7, 2016

@brunotm - good point about crossing the boundaries on passing to the op.
LGTM then

//CC @kerneltime

@kerneltime
Copy link
Contributor

Fixes #626

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants