Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Corrupted metadata file should be handled gracefully and not terminate operations #662

Closed
msterin opened this issue Oct 29, 2016 · 8 comments

Comments

@msterin
Copy link
Contributor

msterin commented Oct 29, 2016

Today if metadata file is not complete (e.g. missing created-by, which may happen in diffent error paths) , getVMDK returns error. Since docker runs "get()" command before anything else, that blocks operations like remove or inspect. This was found when fixing #661

I think it would be much better if vol_info simply returned "N/A" or "Unknown" or "Failed to get" for fields it cannot fill for any reason. This would still check for volume existence first in getVMDK , and all admin commands (inspect/ls) would keep working.

Example of blocked operations:

root@photon-1uL8XNsuB [ ~ ]# docker volume ls 
DRIVER              VOLUME NAME
vmdk                ty1
root@photon-1uL8XNsuB [ ~ ]# docker volume inspect ty1
[]
Error response from daemon: get ty1: VolumeDriver.Get: Failed to get disk details for /vmfs/volumes/datastore1/dockvols/ty1.vmdk ('created-by')
root@photon-1uL8XNsuB [ ~ ]# docker volume inspect ty1
[]
Error response from daemon: get ty1: VolumeDriver.Get: Failed to get disk details for /vmfs/volumes/datastore1/dockvols/ty1.vmdk ('created-by')
root@photon-1uL8XNsuB [ ~ ]# docker volume rm ty1
Error response from daemon: get ty1: VolumeDriver.Get: Failed to get disk details for /vmfs/volumes/datastore1/dockvols/ty1.vmdk ('created-by')

Assigning to @govint to take a look first, and decide the best way to handle it.

@govint
Copy link
Contributor

govint commented Oct 30, 2016

Taking a step back, the current behavior isn't really bad. First of all how did the meta-data get corrupted? Unless there is a code path or some sort of intentional tampering of the KV store there is no way to corrupt the KV store file. Prefer to identify the code path or workflow that can corrupt the KV and fix that.

And if the KV store is corrupt then pretty much that volume isn't usable or shouldn't be usable thereafter.

Or we can move the volume KV to the SQL DB.

@msterin
Copy link
Contributor Author

msterin commented Oct 30, 2016

The volume is perfectly usable wiithout the metadata in the majority of cases. And if it not needed, it least it should be deleteable. Then KV file was not corrupted, it was just saying detached and nothing else, so moving to SQLite won't help, I suspect it's a specific path in code rather that external corruption. This is why I did not want KV on decision path. I agree its much better to underatand how the KV got that ways, but meanwhile, the volume should be deleteable at the very least, and I think we should optimize for more reliable volume working even in some bugs case, rather than less reliable and becoming unusable and undeleteable on a slightest bug in KV.

On moving KV to SQLite. Itnqouls be good since it drops sidecar (and thus esx 6.0) dependency , but now the KV will be totally independent from vmdk and we'd need to understand the b and svmotion use cases. Maybe worth investigation as a separate issue.

@govint
Copy link
Contributor

govint commented Oct 31, 2016

The only time that a code path preemptively resets the meta-data for a volume is when the disk isn't found in the VM configuration on a detach request or the VM itself isn't found. Which, while its an error path, the volume meta-data is reset to "detached state" in these scenarios. Looks like that has happened in this test for the KV to report the volume as detached. Although why the disk that was meant to be attached to the VM got removed from the VM config is unknown.

But yes, the KV state need not restrict what a Get() call returns as meta-data for a volume.

@govint
Copy link
Contributor

govint commented Oct 31, 2016

@msterin can I get the KV file contents. Looks like this is a storage issue, if even "created-by" isn;t there in the KV file then the FS is loosing data written to the file and unreliable.

@msterin
Copy link
Contributor Author

msterin commented Oct 31, 2016

Great, so looks like we have an agreement that KV issues should not impact get() so at least 'rm' and 'inspect' and others would work. Do you want to get it fixed or do you want to use it as a training exercise for others ?

@govint
Copy link
Contributor

govint commented Nov 2, 2016

@msterin - frankly this is a very very contrived way to get changes submitted into the code. Training or otherwise is left to the individual.

Beats all reasoning how one KV pair alone got left inside the KV file while all other pairs got removed.

I'd love to use this issue though to figure out how these stupid things happen!! Not surprising :-))

@govint
Copy link
Contributor

govint commented Nov 3, 2016

Modified the code to implement volume status the way Docker defines it. Status on a Get() call is optional and in the event that getting status fails from the server its perfectly OK to return a NULL map to Docker (on Get()).

Once testing is complete I'll post the changes for review.

@govint
Copy link
Contributor

govint commented Nov 9, 2016

Closed via #699

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

No branches or pull requests

3 participants