Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Make the plugin thread safe #810

Merged
merged 3 commits into from
Dec 20, 2016

Conversation

brunotm
Copy link
Contributor

@brunotm brunotm commented Dec 9, 2016

Update to reflect the current code in this PR.

This makes the plugin thread safe by implementing a per volume locking scheme inside refCountsMap{} and making it fully synchronized, removes the plugin wide locks in mount/umount, and protects the vmci data structures during the comand/response sequence in EsxVmdkCmd.


The access to vmci data structures on the plugin side needs to be serialized during the command/response sequence to avoid race conditions in high concurrency scenarios.

Found when stressing concurrency on the same client in #802.

//CC @msterin

before:

=== RUN   TestConcurrency
Running concurrent tests on tcp://vsc02:2375 and tcp://vsc02:2375 (may take a while)...
Running create/delete tests...
--- FAIL: TestConcurrency (20.30s)
	sanity_test.go:190: Successfully connected to tcp://vsc02:2375
	sanity_test.go:190: Successfully connected to tcp://vsc02:2375
	sanity_test.go:287: Parallel test failed, err: Error response from daemon: create volTestP14: Post http://%2Frun%2Fdocker%2Fplugins%2Fvmdk.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0
FAIL

after:

=== RUN   TestConcurrency
Running concurrent tests on tcp://vsc02:2375 and tcp://vsc02:2375 (may take a while)...
Running create/delete tests...
Running clone parallel tests...
--- PASS: TestConcurrency (9.89s)
	sanity_test.go:190: Successfully connected to tcp://vsc02:2375
	sanity_test.go:190: Successfully connected to tcp://vsc02:2375
PASS

@govint
Copy link
Contributor

govint commented Dec 9, 2016 via email

@govint
Copy link
Contributor

govint commented Dec 9, 2016 via email

@brunotm
Copy link
Contributor Author

brunotm commented Dec 9, 2016

@govint
The only locks that can serialize everything for a single docker host while being held is the mount/umount in vmdk_ops (per vm), during which the plugin will also hold the vmdkcmd.Run one (until it gets a response from esx).

As you said, It looks like we can drop the mutex in vmdkDriver which is only being held at mount/unmount per volume, and is already being protected by the volume/attach/detach locks in vmdk_ops.

I'll test and update.

@brunotm
Copy link
Contributor Author

brunotm commented Dec 9, 2016

@govint @msterin
While testing with same tests from #802 went fine, we still need a lock for the *RefCount functions that can access/modify the same map concurrently and where previously protected in mount/umout by the vmdkDriver mutex.

We can either keep that (which will synchronize more than the refcount) or drop it and use a mutex specific for the refcount. I think the later is better for being explicit.

What do you think?

@brunotm
Copy link
Contributor Author

brunotm commented Dec 9, 2016

Added a commit with a rwmutex for the refcount. We can either use only the first commit or all of them for the second approach.

@brunotm
Copy link
Contributor Author

brunotm commented Dec 11, 2016

@msterin @govint
I have reviewed the change to better conform to the current code and allow the plugin to run as freely as possible while keeping start/stop storm scenarios safe.

From my part is ready for review/merge.

Thanks.

@@ -45,7 +45,7 @@ const (
)

type vmdkDriver struct {
m *sync.Mutex // create() serialization - for future use
m *sync.RWMutex // create() serialization - for future use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - change the comment to match exactly how this lock is being used.

}
return rc.Mtx
}

// enumberates volumes and builds refCountsMap, then sync with mount info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit and optional - could add comments in the discover and sync functions that they don't use any locking anywhere when they set the count or mount/unmount volumes. They don't need to as that happens on the init path when the plugin isn't up yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left them out for this, and mainly because the refCountsMap doesn't hold the responsibility of synchronisation. I have found another issues with this approach, comment below.

@brunotm
Copy link
Contributor Author

brunotm commented Dec 12, 2016

@govint
After re-reading the code again i understand that exists the possibility of races or stale locks due to the fact that is possible to have a refcount removed from the map while a lock is still being held outside of the path calling decr(). Not only this, but also if we are creating and deleting a volume with the same name.

Probably the best approach is to have the refcount map fully synchronized and keep a count of volume lock holders/waiters. And remove the locking from the driver (That really doesn't needs to be itself synchronized)

I'll try to have something to test later today.

Copy link
Contributor

@govint govint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good.

@govint
Copy link
Contributor

govint commented Dec 12, 2016

Yes, thats correct, but as long as those races don't leave the refcount in an inconsistent state and clean up ok its not a problem. Like Remove() checks for a zero ref count and if it is goes ahead and removes the volume, while a mount can race and create a refcount map entry and mount the volume. In which case the Remove() should fail. But it will at least not corrupt the refcount map.

Do you see any cases where the refcount map would be trashed in some scenario?

@brunotm
Copy link
Contributor Author

brunotm commented Dec 12, 2016

I'm aware that the possibility is low with the current code but there are a few windows open where concurrent ops could hold different locks for the same volume. In principle having a refcount being deleted while it's lock still lives elsewhere is not good.
I already have most of the code for the synchronization of the refcountmap in place, with the lock/waiter count and a locking semantics that doesn't expose the mutex to the caller. I didn't had the opportunity to test so I didn't pushed yet :) but I would like to hear your opinion on it.

Use a mutex in EsxVmdkCmd to synchronize access to vmci data
structures during Run() comand/response sequence.
@brunotm
Copy link
Contributor Author

brunotm commented Dec 12, 2016

@govint
This last push implements the strategy mentioned before. Tests where successful including the tests from #802 with ~3sec less than the current code with the driver locked for mount/umount.

Please review. I have a local branch with the previous code if necessary.

@brunotm brunotm changed the title Serialize vmdkcmd.Run Make the plugin thread safe Dec 13, 2016
return fmt.Errorf("Volume %s has no lockers, cannot be unlocked", vol)
}

rc.lockers--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, rc.lockers is incremented with the refmap lock only and without the per volume lock. Here its decremented with the per volume lock. Does lockers need the per volume lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, only the refCountsMap is needed. It is composed this way because if we try to lock the volume and that blocks while holding the refCountsMap lock it will block the whole refCountsMap and driver/plugin.

func (r refCountsMap) lockVolume(vol string) {
// Lock the refCountsMap
r.mtx.Lock()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defer unlock of refmap lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has a explicit unlock below. Same reason as the previous comment.

}
r.mtx.Unlock() // Unlock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This segment is guaranteed to run without any other threads, do we still need the locks here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current code no. I have protected those sections for future usage, like online recovery. If thats unlikely i can remove them (but they don't hurt anyone :)

r.mtx.Lock()
defer r.mtx.Unlock()

for vol, cnt := range r.refMap {
Copy link
Contributor

@govint govint Dec 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above are the locks needed here? Need to watch out that functions called from syncMountsWithRefCounters() must never attempt calling refcount functions as that will try locking the same lock. The start up functions preferably don't need to take locks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, attention will be needed when touching the init functions in the future. If the online recovery ideia is unlikely we should remove the locks then.


rc := r.refMap[vol]
if rc == nil {
return fmt.Errorf("Volume %s has no refcount, cannot be unlocked", vol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caller will leave the lock held?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case theres noting to unlock because there is no refcount for the volume. The error is there mostly to avoid bad usage.


if rc.lockers == 0 {
return fmt.Errorf("Volume %s has no lockers, cannot be unlocked", vol)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caller leaves the lock held?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's noting to unlock. This is for bad usage as above - calling unlockVolume() before lockVolume.
The only way to have a scenario where the lockers == 0 and the mutex is locked is if we bypass the volumeLock() function and lock the volume directly in refMap[vol].mtx. Which wouldn't make any sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, calling unlock on a unlocked mutex throws a runtime error and go mutexes doesn't have a IsLocked() or TryLock()

@@ -209,6 +221,10 @@ func (d *vmdkDriver) unmountVolume(name string) error {
// Name and driver specific options passed through to the ESX host
func (d *vmdkDriver) Create(r volume.Request) volume.Response {

// Lock this volume for create
d.lockVolume(r.Name)
defer d.unlockVolume(r.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will essentially create a refmap entry and lock the entry. Once create completes the lock is released and the refmap entry will be removed once create completes. Do we then need the lock for the create? Same for remove, the last unmount will destroy the refmap entry and remove creates and deletes it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with the remove part.
For create i think its necessary to protect the create protocol that has several not protected steps within the driver/plugin (create/attach/mkfs/detach)

what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know which is the behaviour ifa container starts with a vol that is being removed? Mount() in parallel with Remove() ? Does the docker daemon protects from this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having a second thought, without locking on create we can race while formating the filesystem with the volume attached and another container tries to start with that volume. So locking here is really needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brunotm - what do yo mean by "container starts with a vol that is being removed?" ? I do not understand the sequence of events here...

without locking on create we can race while formating the filesystem with the volume attached and another container tries to start with that volume. So locking here is really needed.

Agreed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msterin
The remove case was a bad example.
As much as the daemon seems to be protecting us to start a container with a volume in the middle of Create() i would like to keep the locks there just to be safe.

Copy link
Contributor

@govint govint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good but locking for create and remove seem overkill and few nits.

@govint
Copy link
Contributor

govint commented Dec 13, 2016 via email

@govint
Copy link
Contributor

govint commented Dec 13, 2016 via email

@govint
Copy link
Contributor

govint commented Dec 13, 2016 via email

@brunotm
Copy link
Contributor Author

brunotm commented Dec 13, 2016

If the daemon allows it, we probably do need to volume lock on Get() too.
As a test i'll try to start a container while formatting a big volume.

@msterin
Copy link
Contributor

msterin commented Dec 13, 2016

so to summarize the discussion, we seem to be saying that
(1) we need to serialize access to a volume in plugin (get/create/mount/unmount)
- a per-volume lock replaces plugin-wide serialization of ALL volume operations
(2) we still need per-VMDK lock in the esx service side (not related to this PR code, just a note).

Is it about right? Also, talking about VMCI - where exactly it is not thread safe ?

@brunotm
Copy link
Contributor Author

brunotm commented Dec 13, 2016

From basic testing the docker daemon seems to protect from this. Trying to start a container with a volume before the formatting is done waits till Create() returns and apparently the same happens for Remove().

I'll remove the locks over create/remove.

@brunotm
Copy link
Contributor Author

brunotm commented Dec 13, 2016

@msterin

(1) we need to serialize access to a volume in plugin (get/create/mount/unmount)

  • a per-volume lock replaces plugin-wide serialization of ALL volume operations

No, locking will be needed on mount/umount.

(2) we still need per-VMDK lock in the esx service side (not related to this PR code, just a note).

Yes, this is intra-docker-host.

Is it about right? Also, talking about VMCI - where exactly it is not thread safe ?

We seem to be overwriting data over parallel requests, like the #802 error.
Synchronizing the vmdkDriver eliminates the problem from tests.

@msterin
Copy link
Contributor

msterin commented Dec 13, 2016

No, locking will be needed on mount/umount.

sounds good

talking about VMCI - where exactly it is not thread safe ?
We seem to be overwriting data over parallel requests, like the #802 error.

This is most likely due to the simple fact that VMCI code does not track requestId. So if request1 is sent and then request 2 is sent, but reply2 arrives first there is no code mapping it to the correct request. So while serialization of mount/unmount is hiding the problem (as these are longer-running ops) I suspect there issue will still be there, just not seen often

@brunotm
Copy link
Contributor Author

brunotm commented Dec 13, 2016

@govint @msterin
If it is ok, i would keep the per volume locks on create/mount/umount.
For the init functions i would like to keep the refCountsMap lock for possibly future usage like online recovery, but if that scenario seems unlikely we can drop it.

@msterin

This is most likely due to the simple fact that VMCI code does not track requestId. So if request1 is sent and then request 2 is sent, but reply2 arrives first there is no code mapping it to the correct request. So while serialization of mount/unmount is hiding the problem (as these are longer-running ops) I suspect there issue will still be there, just not seen often

Agreed.
The lock on vmdkDriver eliminates this as it serializes the command/response sequence.

@brunotm
Copy link
Contributor Author

brunotm commented Dec 13, 2016

@msterin @govint
After inspecting the docker code, they already have a locking by name scheme in the volume store (not driver wide but volume store global) similar with what I was trying to do, that already protects this situations.

So the only thing that's needed from the plugin side are the synchronized refCountsMap and the vmdkDriver.

I'll update the PR.
Sorry about the noise, I should have done a proper investigation, but it worth the exercise.

Change refCountsMap to a struct
Synchronize refcounts ops including init functions
Remove vmdkDriver synchronization
@brunotm
Copy link
Contributor Author

brunotm commented Dec 13, 2016

@msterin
In my previous comments about synchronizing the vmdkDriver, i meant EsxVmdkCmd.

@brunotm
Copy link
Contributor Author

brunotm commented Dec 17, 2016

@msterin @govint
Do you think this still needs changes before merging?

@pdhamdhere
Copy link
Contributor

@msterin Please review.

@govint
Copy link
Contributor

govint commented Dec 19, 2016

Looks good.

Copy link
Contributor

@msterin msterin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good and clean , but I do have one question (see inside).

func (r refCountsMap) getCount(vol string) uint {
rc := r[vol]
// RLocks the refCountsMap
r.mtx.RLock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't another thread change the value right after getCount returns and unlocks, but before the value is used for comparison plugin.go:Remove() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docker daemon has a reference counted resource locking by name scheme (much like what we have in the esx service) that serialize operations per volume (pkg/locker and volume/store) so we can't have more than one op on the same daemon/volume at the same time, so race on the refcount value is not possible.
Our only concern it's to serialize reads/writes on the refcount map.

@@ -170,10 +195,14 @@ func (r refCountsMap) getCount(vol string) uint {

// incr refCount for the volume vol. Creates new entry if needed.
func (r refCountsMap) incr(vol string) uint {
rc := r[vol]
// Locks the refCountsMap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question. There seems to be a gap between the refcount is changed in incr() and checked in Mount(). During this time another thread can change the value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@@ -340,9 +346,6 @@ func (d *vmdkDriver) Path(r volume.Request) volume.Response {
func (d *vmdkDriver) Mount(r volume.MountRequest) volume.Response {
log.WithFields(log.Fields{"name": r.Name}).Info("Mounting volume ")

d.m.Lock()
defer d.m.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I am missing something, but won't it be now possible to get refcnt (libe 354)m then another thread to change the actual refcount before decision on line 356 is made ? Or is there something else (e.g. Docker serializing of mounts) than would prevent it ? If there is, please add a comment to the code explaining it .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update with comments
Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! LGTM / Ready to merge after the comment is added.

@msterin
Copy link
Contributor

msterin commented Dec 20, 2016

Ready to merge when CI passes.

@msterin msterin merged commit f313636 into vmware-archive:master Dec 20, 2016
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants