-
Notifications
You must be signed in to change notification settings - Fork 95
Conversation
esx_service/cli/vmdkops_admin.py
Outdated
@@ -1286,6 +1286,7 @@ def config_init(args): | |||
auth = auth_data.AuthorizationDataManager(db_path) | |||
err = auth.new_db() | |||
if err: | |||
os.remove(db_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems the same change as PR #1259?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I made a branch in the same source tree that had this change for #1259 and it pulled the changes from 1259 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the change in vmdkops_admin.py as it's a duplicate to PR #1259 1259.
esx_service/vmdk_ops.py
Outdated
if ret: | ||
return ret | ||
else: | ||
if cur_vm.runtime.powerState != VM_POWERED_OFF: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 1144 isn't going to return the VM if it is on another host. (in essence if disk is attached to VM on another host) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, this code can wrongly reset a volume KV if the volume is accessed across hosts. The ESX service isn't cluster aware and the right way to fix this is to have cluster aware plugins so volumes are used appropriately.
4f43c58
to
b2d0798
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not remove handle_stale_attach
entirely?
The check for the volume attached to a live VM is needed and in a cluster
scenario the service doesn't assume the other node is running a version
that handles hostd events
|
Why? There are 2 places where In
In
Do you mean some ESX running 0.13 or earlier version and some running 0.14 or later? Firstly, I don't see a reason to optimize code for this corner case. Secondly, I don't understand how keeping |
If the KV says its attached and we treat it as attached no matter what the actual status of the volume. What happens if the event handler misses an event for some reason (host crash, service crash)? Do we go with what the KV has in it or verify that thats the status of the VM from the platform? The KV is more or less an indication to check the actual status of the VM from the platform. a) Either leave the function as-is, as a backup for cases where the event handler has either missed the event or doesn't get one and this function resets the KV so the volume is usable. b) Or, remove the function in all the three call sites (attach, remove, clone) and retain the logging at these sites. |
Leaving the function as-is doesn't help since it only works if VM using volume is on same ESX. Good point on not failing the request based on KV status. So, we should proceed with remove and clone and return appropriate error? |
Sorry, missed this over the weekend. Yes, I'd say attempt the removal/clone/detach and handle the failure. I'll post the change and with the tests for that. |
b2d0798
to
1bf87f7
Compare
Testing, volume is already in use:
ESX vmdk_ops.log
|
The fix doesn't reset the volume KV just because we don't find the volume on the local node, using the same volume on multiple nodes is disallowed and if a volume is in use one node, its locked and can't be used on another node. |
1bf87f7
to
b3af042
Compare
esx_service/vmdk_ops.py
Outdated
logging.info("*** removeVMDK: %s is in use, volume = %s VM = %s VM-uuid = %s (%s)", | ||
vmdk_path, vol_name, attached_vm_name, kv_uuid, ret) | ||
return err("Failed to remove volume {0}, in use by VM = {1}.".format(vol_name, attached_vm_name)) | ||
log_attached_volume(vmdk_path, kv_uuid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need log_attached_volume
here and other places? I can understand if we use it on error path to publish additional information but seems unnecessary on regular code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, log_attached_volume() is needed because this isn;t expected in the code paths where its called.
esx_service/vmdk_ops.py
Outdated
ret = reset_vol_meta(vmdk_path) | ||
if ret: | ||
return ret | ||
msg = "Failed to find VM (id {0}), disk {1} is already attached".format(kv_uuid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should publish VM name from KV if available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the code again and while KV does/can give the volume name the log_attached_volume() is actually checking the UUID and if the VM is found and logs the VM name. If the UUID isn't resolved to a VM name then it would be incorrect to print the name from the KV - VM may be renamed (in which case we get the current name or VM is moved to another node and the service is local to the esx host). The proposed changes are handling printing the VM name as appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on this reasoning. However, renaming the VM is a rare scenario and i think we can still go ahead with printing th VM name we have in the KV.
b3af042
to
83f03fc
Compare
Swarm test seems to have timed out on a poll and errored out, otherwise build is good. FAIL: swarm_test.go:91: SwarmTestSuite.TestDockerSwarm swarm_test.go:141: |
83f03fc
to
c738436
Compare
CI failed with the issue in #1372 |
Depends on #1368 to handle the swarm test error |
a0f8ee6
to
e85e2aa
Compare
7556f0e
to
0cc96a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the late review.
Lets wrap this up. I have a comment on the behavior.
Lets address it and merge this.
CI is in pogress and once its complete. |
@@ -92,6 +93,9 @@ func (s *BasicTestSuite) TestVolumeLifecycle(c *C) { | |||
status := verification.VerifyAttachedStatus(s.volName1, host, s.esx) | |||
c.Assert(status, Equals, true, Commentf("Volume %s is not attached", s.volName1)) | |||
|
|||
out, err = dockercli.DeleteVolume(host, s.volName1) | |||
c.Assert(err, Not(IsNil), Commentf(out)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We can use NotNil instead of Not(IsNil)
esx_service/vmdk_ops.py
Outdated
vol_name = vmdk_utils.get_volname_from_vmdk_path(vmdk_path) | ||
logging.info("*** removeVMDK: %s is in use, volume = %s VM = %s VM-uuid = %s", | ||
vmdk_path, vol_name, attached_vm_name, kv_uuid) | ||
return err("Failed to remove volume {0}, in use by VM = {1}.".format(vol_name, attached_vm_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT we are not relying on data in KV for any operation. This would be an exception.
If I understand correctly this is a workaround for DiskLib issue. Rather than adding a workaround here, I would propose to follow solution proposed in original bug. Read KV in memory, delete file, if delete fails, write KV back to disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not feasible because once the KV is deleted by the issue in disklib a new KV can't be created as the disk can't be opened with the flags needed to create the sidecar. A sidecar can't be created for a disk in use (disk is still attached to some VM). Tried out and figured this out.
0cc96a7
to
b6c535a
Compare
CI failing due to #1371 |
@govint Please triage CI failure and re-push if needed. This PR has been open for a while. |
@pdhamdhere the CI failed with #1371. Can I merge this change since its not my change thats causing the issue? |
Test that was causing #1371 is fixed. We should get a green CI. |
Green CI. |
Closing. |
Removed code in handle_stale_attach() to reset KV if the VM is powered off. This code is not relevant any more with the handling of VM events in the service.