-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Race condition with nng_pipe_cb and live_sockets #40
Comments
Thanks for opening the issue, @chaoflow. You're 100% right about moving the addition to There's a separate race inherent in my design here too, which is that when a socket is garbage collected on the Python end, it is possible that some of the asynchronous stuff happening in the underlying nng lib will still be running, but the socket won't be in the So there are (at least) two races: the one that you realized could be fixed by moving the line you moved, and one that can happen whenever a socket is gc'd. After you made the fix you identified, do you think you're hitting the gc race, or do you think it may be a separate one? |
Without modifying the source, a workaround for the first race is to not pass the |
@codypiersall Thank you for the very fast response and the workaround, which works. The remaining races occur around context exits. I think it is the gc race. |
To be more precise, we exit the socket as context manager by also ending the async generator, i.e. the socket is out of scope (and garbage collected) right after closing it. |
There was a race where the socket connect callbacks could get called before the socket was added to the global cache. Then, in the callback, the socket would try to be retrieved from the cache, but it wouldn't be there, resulting in an error. This was a pretty bad race, and it affected real code. This was brought up on #40 by GitHub user chaoflow. Many thanks to him for identifying the fix!
The race condition in I've also verified that it is a problem with the sockets getting GC'd (and thus removed from the Something like the following might work:
I think this will work, but I have not tested it yet, and I do not think I'll get to it tonight. By the way, here is a simple reproducing script: import threading
import pynng
addr = 'inproc:///beepbeep'
sock = pynng.Pair0(listen=addr)
while True:
p0 = pynng.Pair0(dial=addr, block_on_dial=False) This also exposes some race between getting pipes from the socket objects as well, but that is a separate issue. |
@codypiersall Are there legitimate cases where def _nng_pipe_cb(lib_pipe, event, arg):
sock_id = int(ffi.cast('size_t', arg))
try:
sock = _live_sockets[sock_id]
except KeyError:
logger.debug('Ignoring callback for garbage-collected socket')
return
with sock._pipe_notify_lock: at https://github.com/codypiersall/pynng/blob/master/pynng/nng.py#L1213. It would be nice though, to be sure that it got garbage collected and is not missing for some other reason. |
I can imagine legitimate use cases; whether they actually exist, I'm not sure. But, for example, someone may grab a resource in |
@chaoflow Which transport were you using when you saw these races? I'm seeing the GC race issue I was talking about only with the |
@codypiersall That was with |
After a little more investigation, it turns out that I've figured out that the root cause seems to be that pipe events can be called in an order I didn't expect. I expected that the events would strictly fire in the following order (I'm using the names from the nng lib):
However, I have observed that the sequence above is not always what happens in practice. The code that I used to determine this I just threw at https://github.com/codypiersall/pynng/tree/pipe_evt_dbg. I think this may be something that needs to be fixed in the nng lib. I'll look into it and see about posting a PR or an issue there. |
@codypiersall Thank you very much for the update and for looking into it! |
In a surprising turn of events, I was able to write a patch for nng and make the callbacks behave where they can never be called in order of |
Soon I'll make a PR in nng with a branch I made that fixes this issue. I think I should be able to get that upstream. If my fix gets merged into nng, I'll close out this issue; otherwise, I'll add a step in the build process of |
Are there any updates on this issue? |
I'm hoping to open an issue in nng, but haven't had time to do it lately. I'm confident we'll come to a solution eventually, but not sure about the timeline. |
After I use the branch you mentioned to install nng and pynng, I found that pynng package could not pass The following is what I got:
Thanks for answer |
Hi @dongdongzzz, That failing test I believe is due to the way that nng formats the sockaddr changing. I got the same results. I'm not convinced the branch I made in nng is actually the best way to fix the issue, however. At this point I'm not sure what the best way is. |
So, I finally made a C reproducer that shows that there is a race condition in nng: nanomsg/nng#961 (comment) That said, the race, as far as I can tell, can only happen when a socket calls This needs to be fixed in nng, and I'll keep poking around to see if I can come up with a good solution. The PR I made does fix the issue, but I don't know that it's the best way to do it. |
My changes in pull request #55 might address this issue. The reason for the callbacks arriving out of order is most likely thread scheduling, since the add and remove events originate from two different threads internally to nng. |
With pynng 0.4.0 and Python 3.7.1 there seems to be a race condition involving
_nng_pipe_cb
.@codypiersall Do you have any suggestions what could lead to a socket being expected in
_live_sockets
but not being there anymore or not yet and how to debug this?So far I'm failing to produce a minimal working example. The code involves async generators with Pair0 sockets used as context managers and the race condition is triggered independent of whether arecv/asend or recv/send are being used.
Moving https://github.com/codypiersall/pynng/blob/master/pynng/nng.py#L308 to https://github.com/codypiersall/pynng/blob/master/pynng/nng.py#L295 seems to reduce the occurrence but does not eliminate the race condition.
The text was updated successfully, but these errors were encountered: