-
Notifications
You must be signed in to change notification settings - Fork 95
Plugin panic during format-remove #1062
Comments
We should upgrade go to 1.7 and see if it is reproducible. |
Line:85: cmdS := C.CString(string(jsonStr)) <----------- caused panic Checking plugin logs for the remove command that caused this. |
The volume (concurrent test on the same host) was created and on the remove hit the panic and plugin restarted. The voilume remove was retried and now its stuck (similar problem that was solved in the earlier issue where the volume was left attached on panic during volume create). Plugin logs: Volume is created Volume remove is initiated and plugin restarts 2017-03-22 06:33:01.278789815 +0000 UTC [INFO] No config file found. Using defaults. Volume remove was retried and fails because the earlier panic left the volume attached to the VM. 2017-03-22 06:33:16.123089018 +0000 UTC [INFO] Removing volume name=volTestP-same11 |
Just guessing, we may be hitting a leak (from documentation, C.malloc() if ever it fails to allocate memory will crash the program - see https://golang.org/cmd/cgo/) |
Looks like this issue got exposed with #1058. Before #1058, there was implicit delay of 1+sec and Concurrency test wasn't fully concurrent! I propose we disable part of concurrency test What do you think @shuklanirdesh82 & @govint ? |
Test will be re-enabled once we root cause & fix #1062
Yeah agree! I would rather request to comment out the whole test as it causes more delay (for PRs to merge) than expected mention #1057 |
But how is this related to the wait for device attach? The issue should then happen with a larger load of create volume requests to the managed plugin? |
Test will be re-enabled once we root cause & fix #1062
#1066 disabled tests but we still need to root cause panic caused with concurrency test. |
Have been running a larger config of up to 500 volumes and more threads for the concurrency test with Photon 4.4.41 and Ubuntu guests on ESXi6.0u4. Don't have a repro yet. From the CI runs at least it looks like the Go/C heap is corrupted. Conversion of a GO string to C causes a malloc() to create the C string. This is where the panic is seen. Used the debug pkg to get a stack of all the threads and checking that. Currently trying out with more debug logs and check for mem overruns and with single and multiple threads. Another issue may be with thread safety of these C calls that the plugin makes, the issue happens with the concurrent tests. So, thats one more angle to investigate. |
Tried the following and unable to repro the issue anymore, the hunch was that the buffer being used to collect the response may be overflowing causing the runtime corruption.
It seems the issue was Go/C run time internal as with no code changes the issue isn't repro'ing any more with all tests enabled. @pdhamdhere, do we open up the concurrent tests in the code as the issue isn't getting reproduced. |
Fixed via #1118 |
https://ci.vmware.run/vmware/docker-volume-vsphere/1759
CC / @govint
The text was updated successfully, but these errors were encountered: