Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix issue where Python was killing reaper thread in pipe remove callbacks #55

Merged
merged 2 commits into from
Feb 1, 2020

Conversation

wtfuzz
Copy link
Contributor

@wtfuzz wtfuzz commented Jan 18, 2020

I tracked down an issue with a deadlock when closing a socket. In the stack trace below, you can see that a callback for the NNG_PIPE_EV_REM_POST which originated from the reaper thread was causing python to call PyThread_exit_thread() which effectively calls pthread_exit().

This is due to the interpreter being in a finalizing state before the callback is called. Python attempts to init the thread and immediately kills the reaper, which results in a deadlock.

I added an atexit handler which calls nng_fini() to ensure python doesn't start tearing itself down until nng has cleaned up its threads.

I also eliminated _live_sockets by passing an ffi handle of the Socket through the pipe callback instead of having to maintain a map internally.

Thread 39 "python" hit Breakpoint 3, PyThread_exit_thread () at Python/thread_pthread.h:353
353	{
(gdb) bt
#0  PyThread_exit_thread () at Python/thread_pthread.h:353
#1  0x000055555567527d in exit_thread_if_finalizing (tstate=0x7fffbc000b90, runtime=<optimized out>) at Python/ceval.c:246
#2  PyEval_RestoreThread (tstate=tstate@entry=0x7fffbc000b90) at Python/ceval.c:400
#3  0x00005555556bdc11 in PyGILState_Ensure () at Python/pystate.c:1298
#4  0x00007ffff745f1c3 in gil_ensure () at c/misc_thread_common.h:352
#5  0x00007ffff7465077 in cffi_call_python (externpy=0x7ffff75f7280 <_cffi_externpy___nng_test>, args=0x7fffe47c9d10 "\f\235|\344\377\177") at c/call_python.c:268
#6  0x00007ffff74c6714 in _nng_test (a0=..., a1=<optimized out>, a2=<optimized out>) at build/temp.linux-x86_64-3.8-pydebug/pynng._nng.c:2074
#7  0x00007ffff75028e7 in _test_pipe_callback (p=..., event=NNG_PIPE_EV_REM_POST, arg=0x5555562fc120) at test.c:24
#8  0x00007ffff750244b in _cffi_d__test_pipe_callback (x0=..., x1=<optimized out>, x2=<optimized out>) at build/temp.linux-x86_64-3.8-pydebug/pynng._nng.c:2079
#9  0x00007ffff75247e9 in nni_pipe_run_cb (p=0x7fffcc000b60, ev=NNG_PIPE_EV_REM_POST) at /home/fuzz/pynng/nng/src/core/socket.c:1659
#10 0x00007ffff751fae8 in pipe_destroy (p=0x7fffcc000b60) at /home/fuzz/pynng/nng/src/core/pipe.c:66
#11 0x00007ffff7520b0b in reap_worker (notused=0x0) at /home/fuzz/pynng/nng/src/core/reap.c:37
#12 0x00007ffff75281b5 in nni_thr_wrap (arg=0x7ffff75fa200 <reap_thr>) at /home/fuzz/pynng/nng/src/core/thread.c:94
#13 0x00007ffff752daff in nni_plat_thr_main (arg=0x7ffff75fa200 <reap_thr>) at /home/fuzz/pynng/nng/src/platform/posix/posix_thread.c:218
#14 0x00007ffff7f9b669 in start_thread (arg=<optimized out>) at pthread_create.c:479
#15 0x00007ffff7d69323 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) 

…eaper thread. Use ffi.new_handle and ffi.from_handle for pipe callbacks, eliminating _live_sockets
@codypiersall
Copy link
Owner

Thanks for the PR! You have clearly identified an issue that needs to be fixed. Your approach makes sense to me. More comments on the code are going to be inline!

@codypiersall
Copy link
Owner

The failing builds by CI are non-issues. The Mac build is flaky on Azure pipelines, and I'm not using AppVeyor at all anymore.

@@ -329,16 +333,14 @@ def __init__(self, *,

# set up pipe callbacks. This **must** be called before listen/dial to
# avoid race conditions.
as_void = ffi.cast('void *', id(self))

handle = ffi.new_handle(self)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time reasoning about this code wrt use after free possibilities. I want to make sure it's not possible that a pipe callback gets called after Python has garbage collected the socket. My thinking here is that there is nothing stopping that from happening here; the socket keeps a reference to its handle, but since that is the only reference, it could happen that the GC collects both the socket and the handle at the same time.

Wait a second, now I'm pretty sure I'm wrong about that possibility. Since __del__ calls close on the socket, and IIRC nng calls the POST_PIPE_REM callback synchronously in close, the handle should always be valid. So actually, this looks great, and way better than keeping track of everything in the global dict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that NNG_PIPE_EV_REM_POST is only ever called from a reap_worker thread, so there may in fact be a possible use after free.

I'll see if I can come up with a test for it.

Copy link
Contributor Author

@wtfuzz wtfuzz Jan 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running a simple stress test:

from pynng import Req0, Rep0
import time

listener = Rep0(listen='tcp://localhost:9999')

while True:
    client = Req0(dial='tcp://localhost:9999')
    client.send(b'a')
    client.close()
    listener.recv_msg()

I get key errors such as this:

From cffi callback <function _nng_pipe_cb at 0x7f36a3ebae18>:
Traceback (most recent call last):
  File "/home/fuzz/pynng/pynng/nng.py", line 1291, in _nng_pipe_cb
    pipe = sock._pipes[pipe_id]
KeyError: 636305841

These are a result of the connect callbacks arriving out of order, but I haven't found any issues with use after free on the REM_POST events.

…ceived message with a pipe. Create a new Pipe if the post callback is called before pre
@wtfuzz
Copy link
Contributor Author

wtfuzz commented Jan 20, 2020

I added another commit to the PR which addresses the issues of:

Connect callbacks arriving out of order

In this case, the Pipe is created and associated with the Socket in the post connect callback. When the pre connect callback fires, it will just find the existing Pipe and call any user registered callbacks.

Received messages arriving before pipe callbacks

Received messages can arrive before pipe callbacks were triggered, causing recv_msg to not associate a pipe to a message.

I acquire _pipe_notify_lock in _try_associate_msg_with_pipe to ensure no race with the callbacks, and then create a new Pipe and associate it with the Socket

Running this stress test

from pynng import Req0, Rep0
import time

iterations = 1e5

stats = {
    'clients': 0,
    'pre_pipe_connect': 0,
    'post_pipe_connect': 0,
    'post_pipe_remove': 0
}

def pre_pipe_connect(pipe):
    stats['pre_pipe_connect'] += 1

def post_pipe_connect(pipe):
    stats['post_pipe_connect'] += 1

def post_pipe_remove(pipe):
    stats['post_pipe_remove'] += 1

listener = Rep0(listen='tcp://localhost:9999')
listener.add_pre_pipe_connect_cb(pre_pipe_connect)
listener.add_post_pipe_connect_cb(post_pipe_connect)
listener.add_post_pipe_remove_cb(post_pipe_remove)

while True:
    client = Req0(dial='tcp://localhost:9999')
    client.send(b'a')
    client.close()
    stats['clients'] += 1

    msg = listener.recv_msg()
    assert(msg.pipe is not None)

    if stats['clients'] % 1000 == 0:
        print(stats)

    if stats['clients'] == iterations:
        break

# Close the listener
listener.close()

# At this point, the callback counts should == iterations

print(stats)
for k,v in stats.items():
    assert(v == iterations)

After 100,000 iterations we can see that 100k of each callback was successfully called on the receive socket. The assert checked that all received messages had an associated pipe.

{'clients': 100000, 'pre_pipe_connect': 100000, 'post_pipe_connect': 100000, 'post_pipe_remove': 100000}

@wtfuzz
Copy link
Contributor Author

wtfuzz commented Jan 20, 2020

I also tried with an explicit gc.collect() after closing each client socket, and have not been able to get it to misbehave.

Note that none of this definitively proves there isn't the possibility of a use after free, but so far empirically this change is far more stable than the current master.

@codypiersall
Copy link
Owner

Thanks for looking into this @wtfuzz. I'm hoping to be able to look over your changes this weekend, but it may not happen for a bit later... like most people, it feels like there just aren't enough hours in the day :-)

@codypiersall
Copy link
Owner

Sorry for the big delay here. Calling _add_pipe is a great move I think, and adding the atexit handler makes sense.

Thanks for the PR! Merging now.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants