-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Fix issue where Python was killing reaper thread in pipe remove callbacks #55
Conversation
…eaper thread. Use ffi.new_handle and ffi.from_handle for pipe callbacks, eliminating _live_sockets
Thanks for the PR! You have clearly identified an issue that needs to be fixed. Your approach makes sense to me. More comments on the code are going to be inline! |
The failing builds by CI are non-issues. The Mac build is flaky on Azure pipelines, and I'm not using AppVeyor at all anymore. |
@@ -329,16 +333,14 @@ def __init__(self, *, | |||
|
|||
# set up pipe callbacks. This **must** be called before listen/dial to | |||
# avoid race conditions. | |||
as_void = ffi.cast('void *', id(self)) | |||
|
|||
handle = ffi.new_handle(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time reasoning about this code wrt use after free possibilities. I want to make sure it's not possible that a pipe callback gets called after Python has garbage collected the socket. My thinking here is that there is nothing stopping that from happening here; the socket keeps a reference to its handle, but since that is the only reference, it could happen that the GC collects both the socket and the handle at the same time.
Wait a second, now I'm pretty sure I'm wrong about that possibility. Since __del__
calls close on the socket, and IIRC nng calls the POST_PIPE_REM
callback synchronously in close, the handle should always be valid. So actually, this looks great, and way better than keeping track of everything in the global dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that NNG_PIPE_EV_REM_POST
is only ever called from a reap_worker
thread, so there may in fact be a possible use after free.
I'll see if I can come up with a test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running a simple stress test:
from pynng import Req0, Rep0
import time
listener = Rep0(listen='tcp://localhost:9999')
while True:
client = Req0(dial='tcp://localhost:9999')
client.send(b'a')
client.close()
listener.recv_msg()
I get key errors such as this:
From cffi callback <function _nng_pipe_cb at 0x7f36a3ebae18>:
Traceback (most recent call last):
File "/home/fuzz/pynng/pynng/nng.py", line 1291, in _nng_pipe_cb
pipe = sock._pipes[pipe_id]
KeyError: 636305841
These are a result of the connect callbacks arriving out of order, but I haven't found any issues with use after free on the REM_POST
events.
…ceived message with a pipe. Create a new Pipe if the post callback is called before pre
I added another commit to the PR which addresses the issues of: Connect callbacks arriving out of orderIn this case, the Received messages arriving before pipe callbacksReceived messages can arrive before pipe callbacks were triggered, causing I acquire Running this stress test from pynng import Req0, Rep0
import time
iterations = 1e5
stats = {
'clients': 0,
'pre_pipe_connect': 0,
'post_pipe_connect': 0,
'post_pipe_remove': 0
}
def pre_pipe_connect(pipe):
stats['pre_pipe_connect'] += 1
def post_pipe_connect(pipe):
stats['post_pipe_connect'] += 1
def post_pipe_remove(pipe):
stats['post_pipe_remove'] += 1
listener = Rep0(listen='tcp://localhost:9999')
listener.add_pre_pipe_connect_cb(pre_pipe_connect)
listener.add_post_pipe_connect_cb(post_pipe_connect)
listener.add_post_pipe_remove_cb(post_pipe_remove)
while True:
client = Req0(dial='tcp://localhost:9999')
client.send(b'a')
client.close()
stats['clients'] += 1
msg = listener.recv_msg()
assert(msg.pipe is not None)
if stats['clients'] % 1000 == 0:
print(stats)
if stats['clients'] == iterations:
break
# Close the listener
listener.close()
# At this point, the callback counts should == iterations
print(stats)
for k,v in stats.items():
assert(v == iterations) After 100,000 iterations we can see that 100k of each callback was successfully called on the receive socket. The assert checked that all received messages had an associated pipe. {'clients': 100000, 'pre_pipe_connect': 100000, 'post_pipe_connect': 100000, 'post_pipe_remove': 100000} |
I also tried with an explicit Note that none of this definitively proves there isn't the possibility of a use after free, but so far empirically this change is far more stable than the current master. |
Thanks for looking into this @wtfuzz. I'm hoping to be able to look over your changes this weekend, but it may not happen for a bit later... like most people, it feels like there just aren't enough hours in the day :-) |
Sorry for the big delay here. Calling Thanks for the PR! Merging now. |
I tracked down an issue with a deadlock when closing a socket. In the stack trace below, you can see that a callback for the
NNG_PIPE_EV_REM_POST
which originated from the reaper thread was causing python to callPyThread_exit_thread()
which effectively callspthread_exit()
.This is due to the interpreter being in a finalizing state before the callback is called. Python attempts to init the thread and immediately kills the reaper, which results in a deadlock.
I added an
atexit
handler which callsnng_fini()
to ensure python doesn't start tearing itself down until nng has cleaned up its threads.I also eliminated
_live_sockets
by passing an ffi handle of theSocket
through the pipe callback instead of having to maintain a map internally.