Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Race condition with nng_pipe_cb and live_sockets #40

Open
chaoflow opened this issue May 29, 2019 · 19 comments
Open

Race condition with nng_pipe_cb and live_sockets #40

chaoflow opened this issue May 29, 2019 · 19 comments
Labels
bug Something isn't working

Comments

@chaoflow
Copy link

With pynng 0.4.0 and Python 3.7.1 there seems to be a race condition involving _nng_pipe_cb.

Traceback (most recent call last):
  File "/venv/lib/python3.7/site-packages/pynng/nng.py", line 1211, in _nng_pipe_cb
    sock = _live_sockets[sock_id]
  File "/usr/lib/python3.7/weakref.py", line 137, in __getitem__
    o = self.data[key]()
KeyError: 139787429631984

@codypiersall Do you have any suggestions what could lead to a socket being expected in _live_sockets but not being there anymore or not yet and how to debug this?

So far I'm failing to produce a minimal working example. The code involves async generators with Pair0 sockets used as context managers and the race condition is triggered independent of whether arecv/asend or recv/send are being used.

Moving https://github.com/codypiersall/pynng/blob/master/pynng/nng.py#L308 to https://github.com/codypiersall/pynng/blob/master/pynng/nng.py#L295 seems to reduce the occurrence but does not eliminate the race condition.

@codypiersall
Copy link
Owner

Thanks for opening the issue, @chaoflow.

You're 100% right about moving the addition to _live_sockets above the listen and dial calls. That's absolutely a race. If anyone else is following along, it's a race because it is possible that the socket starts listening, and it is possible that a remote connects to it before its id is added to the _live_sockets dict, which means when the callback gets called the socket won't be found in the _live_sockets dict.

There's a separate race inherent in my design here too, which is that when a socket is garbage collected on the Python end, it is possible that some of the asynchronous stuff happening in the underlying nng lib will still be running, but the socket won't be in the _live_sockets dict anymore. Nothing is

So there are (at least) two races: the one that you realized could be fixed by moving the line you moved, and one that can happen whenever a socket is gc'd.

After you made the fix you identified, do you think you're hitting the gc race, or do you think it may be a separate one?

@codypiersall
Copy link
Owner

Without modifying the source, a workaround for the first race is to not pass the listen or dial args to the socket constructor, but to call listen and dial on separate lines, if anyone else runs into this but doesn't want to modify the source.

@codypiersall codypiersall added the bug Something isn't working label May 29, 2019
@chaoflow
Copy link
Author

@codypiersall Thank you for the very fast response and the workaround, which works. The remaining races occur around context exits. I think it is the gc race.

@chaoflow
Copy link
Author

To be more precise, we exit the socket as context manager by also ending the async generator, i.e. the socket is out of scope (and garbage collected) right after closing it.

codypiersall added a commit that referenced this issue May 30, 2019
There was a race where the socket connect callbacks could get called
before the socket was added to the global cache.  Then, in the callback,
the socket would try to be retrieved from the cache, but it wouldn't be
there, resulting in an error.

This was a pretty bad race, and it affected real code. This was brought
up on #40 by GitHub user
chaoflow.  Many thanks to him for identifying the fix!
@codypiersall
Copy link
Owner

codypiersall commented May 30, 2019

The race condition in Socket.__init__ is fixed in master (as of 0ea1c43). Still left is the race around socket removal; it's a little thornier.

I've also verified that it is a problem with the sockets getting GC'd (and thus removed from the _live_sockets WeakValueDictionary) by changing _live_sockets to a plain old dict. This isn't an acceptable actual solution by itself, though, because then we'd be leaking a lot of memory--every socket would be leaked! But I do think it is a necessary part of a solution, with some additional bookkeeping to make sure that all outstanding connect/disconnect callbacks have been called.

Something like the following might work:

  1. change _live_sockets to a dict with the key of the pipe id and the val a tuple: (num_connected_pipes, pynng.Socket)
  2. Register a pre_pipe_connect callback which will increment the num_connected_pipes if the socket is not closed.
  3. Register a post_pipe_connect callback which will decrement the num_connected_pipes by one and also remove the socket from _live_sockets if num_connected_pipes is 0 and the socket is closed.
  4. Add logic to pynng.Socket.close to remove itself from _live_sockets if num_connected_pipes is 0 while holding the _pipe_notify_lock.

I think this will work, but I have not tested it yet, and I do not think I'll get to it tonight.

By the way, here is a simple reproducing script:

import threading
import pynng

addr = 'inproc:///beepbeep'
sock = pynng.Pair0(listen=addr)
while True:
    p0 = pynng.Pair0(dial=addr, block_on_dial=False)

This also exposes some race between getting pipes from the socket objects as well, but that is a separate issue.

@chaoflow
Copy link
Author

@codypiersall Are there legitimate cases where _nng_pipe_cb should run and do something meaningful despite the socket being already garbage collected or could we do something like:

def _nng_pipe_cb(lib_pipe, event, arg):
    sock_id = int(ffi.cast('size_t', arg))
    try:
        sock = _live_sockets[sock_id]
    except KeyError:
        logger.debug('Ignoring callback for garbage-collected socket')
        return
    with sock._pipe_notify_lock:

at https://github.com/codypiersall/pynng/blob/master/pynng/nng.py#L1213.

It would be nice though, to be sure that it got garbage collected and is not missing for some other reason.

@codypiersall
Copy link
Owner

Are there legitimate cases where _nng_pipe_cb should run and do something meaningful despite the socket being already garbage collected?

I can imagine legitimate use cases; whether they actually exist, I'm not sure. But, for example, someone may grab a resource in _pre_pipe_connect_cb and release it in post_pipe_remove_cb.

@codypiersall
Copy link
Owner

@chaoflow Which transport were you using when you saw these races? I'm seeing the GC race issue I was talking about only with the inproc transport. I'm going to do some spelunking in the nng library to understand what's going on under the hood.

@chaoflow
Copy link
Author

chaoflow commented Jun 1, 2019

@codypiersall That was with inproc as well.

@codypiersall
Copy link
Owner

After a little more investigation, it turns out that inproc was a red herring: in my ipc tests, the sockets were never actually establishing a connection, so I wasn't seeing a bug.

I've figured out that the root cause seems to be that pipe events can be called in an order I didn't expect. I expected that the events would strictly fire in the following order (I'm using the names from the nng lib):

  1. NNG_PIPE_EV_ADD_PRE
  2. NNG_PIPE_EV_ADD_POST
  3. NNG_PIPE_EV_REM_POST

However, I have observed that the sequence above is not always what happens in practice. The code that I used to determine this I just threw at https://github.com/codypiersall/pynng/tree/pipe_evt_dbg.

I think this may be something that needs to be fixed in the nng lib. I'll look into it and see about posting a PR or an issue there.

@chaoflow
Copy link
Author

chaoflow commented Jun 5, 2019

@codypiersall Thank you very much for the update and for looking into it!

@codypiersall
Copy link
Owner

In a surprising turn of events, I was able to write a patch for nng and make the callbacks behave where they can never be called in order of NNG_PIPE_EV_REM_POST before NNG_PIPE_EV_ADD_POST. I'll open an issue in nng about getting this upstream in the next few days. My current implementation for nng was pretty hacky, so I need to clean it up, and see if Garrett is interested in fixing this upstream.

@codypiersall
Copy link
Owner

Soon I'll make a PR in nng with a branch I made that fixes this issue. I think I should be able to get that upstream. If my fix gets merged into nng, I'll close out this issue; otherwise, I'll add a step in the build process of pynng to patch nng with this fix. I think upstream will be amenable to the change in my branch though.

@bsdis
Copy link

bsdis commented Aug 18, 2019

Are there any updates on this issue?

@codypiersall
Copy link
Owner

I'm hoping to open an issue in nng, but haven't had time to do it lately. I'm confident we'll come to a solution eventually, but not sure about the timeline.

@dongdyang
Copy link

Soon I'll make a PR in nng with a branch I made that fixes this issue. I think I should be able to get that upstream. If my fix gets merged into nng, I'll close out this issue; otherwise, I'll add a step in the build process of pynng to patch nng with this fix. I think upstream will be amenable to the change in my branch though.

After I use the branch you mentioned to install nng and pynng, I found that pynng package could not pass pytest. I am wondering whether in your local it has successfully passed the test case?

The following is what I got:

pytest
================================================================ test session starts =================================================================
platform darwin -- Python 3.7.3, pytest-5.1.1, py-1.8.0, pluggy-0.12.0
rootdir: /Users/xxx/Workspaces/pynng
collected 50 items

test/test_aio.py sssss                                                                                                                         [ 10%]
test/test_api.py .....ss...                                                                                                                    [ 30%]
test/test_msg.py ..ss..s                                                                                                                       [ 44%]
test/test_options.py .....**F.**                                                                                                                   [ 58%]
test/test_pipe.py ........s.                                                                                                                   [ 78%]
test/test_protocols.py ...........                                                                                                             [100%]

Thanks for answer

@codypiersall
Copy link
Owner

Hi @dongdongzzz,

That failing test I believe is due to the way that nng formats the sockaddr changing. I got the same results.

I'm not convinced the branch I made in nng is actually the best way to fix the issue, however. At this point I'm not sure what the best way is.

@codypiersall
Copy link
Owner

So, I finally made a C reproducer that shows that there is a race condition in nng: nanomsg/nng#961 (comment)

That said, the race, as far as I can tell, can only happen when a socket calls dial with block=False and closes the dialing socket very quickly afterward. To work around this issue, just don't close the socket very quickly. I think in all real applications you probably wouldn't want to close the socket quickly anyway, but this is still a bug.

This needs to be fixed in nng, and I'll keep poking around to see if I can come up with a good solution. The PR I made does fix the issue, but I don't know that it's the best way to do it.

@wtfuzz
Copy link
Contributor

wtfuzz commented Jan 18, 2020

My changes in pull request #55 might address this issue. The reason for the callbacks arriving out of order is most likely thread scheduling, since the add and remove events originate from two different threads internally to nng.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants