-
Notifications
You must be signed in to change notification settings - Fork 95
Make the plugin thread safe #810
Make the plugin thread safe #810
Conversation
Good find, but can the same mutex be used that's used to serialize mount
and unmount ops. OTOH, can the vmci client be made reentrant or remove the
lock used to serialize the mount and unmount ops and keep the new one.
…On Dec 9, 2016 6:38 PM, "Bruno Moura" ***@***.***> wrote:
The access to vmci data structures on the plugin side needs to be
serialized during the command/response sequence to avoid race conditions in
high concurrency scenarios.
Found when stressing concurrency on the same client in #802
<#802>.
//CC @msterin <https://github.com/msterin>
before:
=== RUN TestConcurrency
Running concurrent tests on tcp://vsc02:2375 and tcp://vsc02:2375 (may take a while)...
Running create/delete tests...
--- FAIL: TestConcurrency (20.30s)
sanity_test.go:190: Successfully connected to tcp://vsc02:2375
sanity_test.go:190: Successfully connected to tcp://vsc02:2375
sanity_test.go:287: Parallel test failed, err: Error response from daemon: create volTestP14: Post http://%2Frun%2Fdocker%2Fplugins%2Fvmdk.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0
FAIL
after:
=== RUN TestConcurrency
Running concurrent tests on tcp://vsc02:2375 and tcp://vsc02:2375 (may take a while)...
Running create/delete tests...
Running clone parallel tests...
--- PASS: TestConcurrency (9.89s)
sanity_test.go:190: Successfully connected to tcp://vsc02:2375
sanity_test.go:190: Successfully connected to tcp://vsc02:2375
PASS
------------------------------
You can view, comment on, or merge this pull request online at:
#810
Commit Summary
- Serialize vmdkcmd.Run
File Changes
- *M* vmdk_plugin/vmdkops/esx_vmdkcmd.go
<https://github.com/vmware/docker-volume-vsphere/pull/810/files#diff-0>
(7)
Patch Links:
- https://github.com/vmware/docker-volume-vsphere/pull/810.patch
- https://github.com/vmware/docker-volume-vsphere/pull/810.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#810>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/APHseOWHkfsQ1_Z2tyLdcUA-Jx19eczlks5rGVLFgaJpZM4LI69v>
.
|
With this change a mount takes the lock in plugin.go, then one in
esx_command.go and then one more on the esx service side all of which
serialise ops per VM. Possibly have one lock on the client and one on the
esx service.
…On Fri, Dec 9, 2016 at 8:33 PM, Govindan ***@***.***> wrote:
Good find, but can the same mutex be used that's used to serialize mount
and unmount ops. OTOH, can the vmci client be made reentrant or remove the
lock used to serialize the mount and unmount ops and keep the new one.
On Dec 9, 2016 6:38 PM, "Bruno Moura" ***@***.***> wrote:
> The access to vmci data structures on the plugin side needs to be
> serialized during the command/response sequence to avoid race conditions in
> high concurrency scenarios.
>
> Found when stressing concurrency on the same client in #802
> <#802>.
>
> //CC @msterin <https://github.com/msterin>
>
> before:
>
> === RUN TestConcurrency
> Running concurrent tests on tcp://vsc02:2375 and tcp://vsc02:2375 (may take a while)...
> Running create/delete tests...
> --- FAIL: TestConcurrency (20.30s)
> sanity_test.go:190: Successfully connected to tcp://vsc02:2375
> sanity_test.go:190: Successfully connected to tcp://vsc02:2375
> sanity_test.go:287: Parallel test failed, err: Error response from daemon: create volTestP14: Post http://%2Frun%2Fdocker%2Fplugins%2Fvmdk.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0
> FAIL
>
> after:
>
> === RUN TestConcurrency
> Running concurrent tests on tcp://vsc02:2375 and tcp://vsc02:2375 (may take a while)...
> Running create/delete tests...
> Running clone parallel tests...
> --- PASS: TestConcurrency (9.89s)
> sanity_test.go:190: Successfully connected to tcp://vsc02:2375
> sanity_test.go:190: Successfully connected to tcp://vsc02:2375
> PASS
>
> ------------------------------
> You can view, comment on, or merge this pull request online at:
>
> #810
> Commit Summary
>
> - Serialize vmdkcmd.Run
>
> File Changes
>
> - *M* vmdk_plugin/vmdkops/esx_vmdkcmd.go
> <https://github.com/vmware/docker-volume-vsphere/pull/810/files#diff-0>
> (7)
>
> Patch Links:
>
> - https://github.com/vmware/docker-volume-vsphere/pull/810.patch
> - https://github.com/vmware/docker-volume-vsphere/pull/810.diff
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#810>, or mute the
> thread
> <https://github.com/notifications/unsubscribe-auth/APHseOWHkfsQ1_Z2tyLdcUA-Jx19eczlks5rGVLFgaJpZM4LI69v>
> .
>
|
@govint As you said, It looks like we can drop the mutex in vmdkDriver which is only being held at mount/unmount per volume, and is already being protected by the volume/attach/detach locks in vmdk_ops. I'll test and update. |
@govint @msterin We can either keep that (which will synchronize more than the refcount) or drop it and use a mutex specific for the refcount. I think the later is better for being explicit. What do you think? |
Added a commit with a rwmutex for the refcount. We can either use only the first commit or all of them for the second approach. |
1353ea9
to
bbb0e0e
Compare
@@ -45,7 +45,7 @@ const ( | |||
) | |||
|
|||
type vmdkDriver struct { | |||
m *sync.Mutex // create() serialization - for future use | |||
m *sync.RWMutex // create() serialization - for future use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - change the comment to match exactly how this lock is being used.
} | ||
return rc.Mtx | ||
} | ||
|
||
// enumberates volumes and builds refCountsMap, then sync with mount info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit and optional - could add comments in the discover and sync functions that they don't use any locking anywhere when they set the count or mount/unmount volumes. They don't need to as that happens on the init path when the plugin isn't up yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left them out for this, and mainly because the refCountsMap doesn't hold the responsibility of synchronisation. I have found another issues with this approach, comment below.
@govint Probably the best approach is to have the refcount map fully synchronized and keep a count of volume lock holders/waiters. And remove the locking from the driver (That really doesn't needs to be itself synchronized) I'll try to have something to test later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good.
Yes, thats correct, but as long as those races don't leave the refcount in an inconsistent state and clean up ok its not a problem. Like Remove() checks for a zero ref count and if it is goes ahead and removes the volume, while a mount can race and create a refcount map entry and mount the volume. In which case the Remove() should fail. But it will at least not corrupt the refcount map. Do you see any cases where the refcount map would be trashed in some scenario? |
I'm aware that the possibility is low with the current code but there are a few windows open where concurrent ops could hold different locks for the same volume. In principle having a refcount being deleted while it's lock still lives elsewhere is not good. |
Use a mutex in EsxVmdkCmd to synchronize access to vmci data structures during Run() comand/response sequence.
bbb0e0e
to
66731f5
Compare
66731f5
to
e2805be
Compare
return fmt.Errorf("Volume %s has no lockers, cannot be unlocked", vol) | ||
} | ||
|
||
rc.lockers-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, rc.lockers is incremented with the refmap lock only and without the per volume lock. Here its decremented with the per volume lock. Does lockers need the per volume lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, only the refCountsMap is needed. It is composed this way because if we try to lock the volume and that blocks while holding the refCountsMap lock it will block the whole refCountsMap and driver/plugin.
func (r refCountsMap) lockVolume(vol string) { | ||
// Lock the refCountsMap | ||
r.mtx.Lock() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defer unlock of refmap lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has a explicit unlock below. Same reason as the previous comment.
} | ||
r.mtx.Unlock() // Unlock. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This segment is guaranteed to run without any other threads, do we still need the locks here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current code no. I have protected those sections for future usage, like online recovery. If thats unlikely i can remove them (but they don't hurt anyone :)
r.mtx.Lock() | ||
defer r.mtx.Unlock() | ||
|
||
for vol, cnt := range r.refMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above are the locks needed here? Need to watch out that functions called from syncMountsWithRefCounters() must never attempt calling refcount functions as that will try locking the same lock. The start up functions preferably don't need to take locks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, attention will be needed when touching the init functions in the future. If the online recovery ideia is unlikely we should remove the locks then.
|
||
rc := r.refMap[vol] | ||
if rc == nil { | ||
return fmt.Errorf("Volume %s has no refcount, cannot be unlocked", vol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caller will leave the lock held?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case theres noting to unlock because there is no refcount for the volume. The error is there mostly to avoid bad usage.
|
||
if rc.lockers == 0 { | ||
return fmt.Errorf("Volume %s has no lockers, cannot be unlocked", vol) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caller leaves the lock held?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's noting to unlock. This is for bad usage as above - calling unlockVolume() before lockVolume.
The only way to have a scenario where the lockers == 0 and the mutex is locked is if we bypass the volumeLock() function and lock the volume directly in refMap[vol].mtx. Which wouldn't make any sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, calling unlock on a unlocked mutex throws a runtime error and go mutexes doesn't have a IsLocked() or TryLock()
@@ -209,6 +221,10 @@ func (d *vmdkDriver) unmountVolume(name string) error { | |||
// Name and driver specific options passed through to the ESX host | |||
func (d *vmdkDriver) Create(r volume.Request) volume.Response { | |||
|
|||
// Lock this volume for create | |||
d.lockVolume(r.Name) | |||
defer d.unlockVolume(r.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will essentially create a refmap entry and lock the entry. Once create completes the lock is released and the refmap entry will be removed once create completes. Do we then need the lock for the create? Same for remove, the last unmount will destroy the refmap entry and remove creates and deletes it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with the remove part.
For create i think its necessary to protect the create protocol that has several not protected steps within the driver/plugin (create/attach/mkfs/detach)
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know which is the behaviour ifa container starts with a vol that is being removed? Mount() in parallel with Remove() ? Does the docker daemon protects from this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After having a second thought, without locking on create we can race while formating the filesystem with the volume attached and another container tries to start with that volume. So locking here is really needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brunotm - what do yo mean by "container starts with a vol that is being removed?" ? I do not understand the sequence of events here...
without locking on create we can race while formating the filesystem with the volume attached and another container tries to start with that volume. So locking here is really needed.
Agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msterin
The remove case was a bad example.
As much as the daemon seems to be protecting us to start a container with a volume in the middle of Create() i would like to keep the locks there just to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good but locking for create and remove seem overkill and few nits.
But that can't happen on the same VM (as the volume create request hasn't
yet returned to Docker from the plugin), it may happen on another VM
perhaps thats accessing the same volume in which case the locks in the VM
won't help.
…On Tue, Dec 13, 2016 at 10:34 PM, Bruno Moura ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In vmdk_plugin/plugin.go
<#810>:
> @@ -209,6 +221,10 @@ func (d *vmdkDriver) unmountVolume(name string) error {
// Name and driver specific options passed through to the ESX host
func (d *vmdkDriver) Create(r volume.Request) volume.Response {
+ // Lock this volume for create
+ d.lockVolume(r.Name)
+ defer d.unlockVolume(r.Name)
After having a second thought, without locking on create we can race while
formating the filesystem with the volume attached and another container
tries to start with that volume. So locking here is really needed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#810>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/APHseH5keB2-bY97uik5hmndohwC_ru3ks5rHtAugaJpZM4LI69v>
.
|
Ok, Docker may allow a Get to go thru to the volume which can find the
volume? Get() too must lock the VM.
…On Tue, Dec 13, 2016 at 11:03 PM, Govindan ***@***.***> wrote:
But that can't happen on the same VM (as the volume create request hasn't
yet returned to Docker from the plugin), it may happen on another VM
perhaps thats accessing the same volume in which case the locks in the VM
won't help.
On Tue, Dec 13, 2016 at 10:34 PM, Bruno Moura ***@***.***>
wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In vmdk_plugin/plugin.go
> <#810>:
>
> > @@ -209,6 +221,10 @@ func (d *vmdkDriver) unmountVolume(name string) error {
> // Name and driver specific options passed through to the ESX host
> func (d *vmdkDriver) Create(r volume.Request) volume.Response {
>
> + // Lock this volume for create
> + d.lockVolume(r.Name)
> + defer d.unlockVolume(r.Name)
>
> After having a second thought, without locking on create we can race
> while formating the filesystem with the volume attached and another
> container tries to start with that volume. So locking here is really needed.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#810>, or mute the
> thread
> <https://github.com/notifications/unsubscribe-auth/APHseH5keB2-bY97uik5hmndohwC_ru3ks5rHtAugaJpZM4LI69v>
> .
>
|
Sorry not VM - volume.
…On Tue, Dec 13, 2016 at 11:13 PM, Govindan ***@***.***> wrote:
Ok, Docker may allow a Get to go thru to the volume which can find the
volume? Get() too must lock the VM.
On Tue, Dec 13, 2016 at 11:03 PM, Govindan ***@***.***> wrote:
> But that can't happen on the same VM (as the volume create request hasn't
> yet returned to Docker from the plugin), it may happen on another VM
> perhaps thats accessing the same volume in which case the locks in the VM
> won't help.
>
> On Tue, Dec 13, 2016 at 10:34 PM, Bruno Moura ***@***.***>
> wrote:
>
>> ***@***.**** commented on this pull request.
>> ------------------------------
>>
>> In vmdk_plugin/plugin.go
>> <#810>:
>>
>> > @@ -209,6 +221,10 @@ func (d *vmdkDriver) unmountVolume(name string) error {
>> // Name and driver specific options passed through to the ESX host
>> func (d *vmdkDriver) Create(r volume.Request) volume.Response {
>>
>> + // Lock this volume for create
>> + d.lockVolume(r.Name)
>> + defer d.unlockVolume(r.Name)
>>
>> After having a second thought, without locking on create we can race
>> while formating the filesystem with the volume attached and another
>> container tries to start with that volume. So locking here is really needed.
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <#810>, or mute the
>> thread
>> <https://github.com/notifications/unsubscribe-auth/APHseH5keB2-bY97uik5hmndohwC_ru3ks5rHtAugaJpZM4LI69v>
>> .
>>
>
>
|
If the daemon allows it, we probably do need to volume lock on Get() too. |
so to summarize the discussion, we seem to be saying that Is it about right? Also, talking about VMCI - where exactly it is not thread safe ? |
From basic testing the docker daemon seems to protect from this. Trying to start a container with a volume before the formatting is done waits till Create() returns and apparently the same happens for Remove(). I'll remove the locks over create/remove. |
No, locking will be needed on mount/umount.
Yes, this is intra-docker-host.
We seem to be overwriting data over parallel requests, like the #802 error. |
sounds good
This is most likely due to the simple fact that VMCI code does not track requestId. So if request1 is sent and then request 2 is sent, but reply2 arrives first there is no code mapping it to the correct request. So while serialization of mount/unmount is hiding the problem (as these are longer-running ops) I suspect there issue will still be there, just not seen often |
@govint @msterin
Agreed. |
@msterin @govint So the only thing that's needed from the plugin side are the synchronized refCountsMap and the vmdkDriver. I'll update the PR. |
Change refCountsMap to a struct Synchronize refcounts ops including init functions Remove vmdkDriver synchronization
@msterin |
e2805be
to
9e5d4c5
Compare
@msterin Please review. |
Looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good and clean , but I do have one question (see inside).
func (r refCountsMap) getCount(vol string) uint { | ||
rc := r[vol] | ||
// RLocks the refCountsMap | ||
r.mtx.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't another thread change the value right after getCount returns and unlocks, but before the value is used for comparison plugin.go:Remove() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docker daemon has a reference counted resource locking by name scheme (much like what we have in the esx service) that serialize operations per volume (pkg/locker and volume/store) so we can't have more than one op on the same daemon/volume at the same time, so race on the refcount value is not possible.
Our only concern it's to serialize reads/writes on the refcount map.
@@ -170,10 +195,14 @@ func (r refCountsMap) getCount(vol string) uint { | |||
|
|||
// incr refCount for the volume vol. Creates new entry if needed. | |||
func (r refCountsMap) incr(vol string) uint { | |||
rc := r[vol] | |||
// Locks the refCountsMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question. There seems to be a gap between the refcount is changed in incr() and checked in Mount(). During this time another thread can change the value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
@@ -340,9 +346,6 @@ func (d *vmdkDriver) Path(r volume.Request) volume.Response { | |||
func (d *vmdkDriver) Mount(r volume.MountRequest) volume.Response { | |||
log.WithFields(log.Fields{"name": r.Name}).Info("Mounting volume ") | |||
|
|||
d.m.Lock() | |||
defer d.m.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I am missing something, but won't it be now possible to get refcnt (libe 354)m then another thread to change the actual refcount before decision on line 356 is made ? Or is there something else (e.g. Docker serializing of mounts) than would prevent it ? If there is, please add a comment to the code explaining it .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update with comments
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! LGTM / Ready to merge after the comment is added.
Ready to merge when CI passes. |
Update to reflect the current code in this PR.
This makes the plugin thread safe by implementing a per volume locking scheme inside refCountsMap{} and making it fully synchronized, removes the plugin wide locks in mount/umount, and protects the vmci data structures during the comand/response sequence in EsxVmdkCmd.
The access to vmci data structures on the plugin side needs to be serialized during the command/response sequence to avoid race conditions in high concurrency scenarios.
Found when stressing concurrency on the same client in #802.
//CC @msterin
before:
after: