Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Deadlock triggered in test_can_send_from_pipe #62

Closed
codypiersall opened this issue Apr 20, 2020 · 15 comments
Closed

Deadlock triggered in test_can_send_from_pipe #62

codypiersall opened this issue Apr 20, 2020 · 15 comments

Comments

@codypiersall
Copy link
Owner

There are intermittent deadlocks(?) when the test test_can_send_from_pipe runs. This causes testing to be a pain, and also prevents CI from being useful at all.

I suspect (but am really not sure) that this is related to a race condition between pipe callbacks being called and a socket closing. An unlikely but possible outcome is that libnng has a deadlock. Most likely it's on our end though.

@codypiersall
Copy link
Owner Author

Adding a sleep(0.05) call at the end of the test prevents the problem from happening. This is not really something I'm happy about.

@codypiersall
Copy link
Owner Author

This happens on a couple tests, and it was fixed in fc703be for the tests. I'm going to leave this issue open, though, because the fix applied was a band-aid.

@wtfuzz
Copy link
Contributor

wtfuzz commented Apr 23, 2020

I wonder if you're running into something similar that I tracked down in PR #55

@codypiersall
Copy link
Owner Author

Could be. I'll make a simpler reproducer and run it in GDB so I can get stack traces for all the threads. I used py-spy to dump the Python traceback of the main thread, but it didn't show what any of the nng threads were doing.

@wtfuzz
Copy link
Contributor

wtfuzz commented Apr 23, 2020

This doesn't appear to be related to #55. I was able to reproduce this, and it's deadlocking in nni_sock_shutdown() waiting on a condition variable during close of the socket.

I'm reproducing using:

import pynng
import logging
logging.basicConfig(level=logging.DEBUG)

addr = "tcp://localhost:9898"
count = 0

while True:
  with pynng.Pair1(listen=addr) as s0, pynng.Pair1(dial=addr) as s1:

    print("send")
    s0.send(b'hello')
    print("recv")
    s1.recv()

    count += 1
    print(count)
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff71ab4ce6 libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff71b76185 libsystem_pthread.dylib`_pthread_cond_wait + 701
    frame #2: 0x0000000100bc208d _nng.abi3.so`nni_plat_cv_wait + 13
    frame #3: 0x0000000100bb66f8 _nng.abi3.so`nni_sock_shutdown + 936
    frame #4: 0x0000000100bb6772 _nng.abi3.so`nni_sock_close + 18
    frame #5: 0x0000000100b91b80 _nng.abi3.so`nng_close + 32
    frame #6: 0x0000000100b62099 _nng.abi3.so`_cffi_f_nng_close(self=<unavailable>, arg0=<unavailable>) at pynng._nng.c:3114:14 [opt]
    frame #7: 0x000000010041bad3 python`_PyMethodDef_RawFastCallKeywords + 131
    frame #8: 0x0000000100558de2 python`call_function + 306
    frame #9: 0x0000000100555b92 python`_PyEval_EvalFrameDefault + 42226
    frame #10: 0x000000010041b2d5 python`function_code_fastcall + 117
    frame #11: 0x0000000100558d67 python`call_function + 183
    frame #12: 0x0000000100555afc python`_PyEval_EvalFrameDefault + 42076
    frame #13: 0x000000010054a46e python`_PyEval_EvalCodeWithName + 414
    frame #14: 0x000000010041a7c7 python`_PyFunction_FastCallDict + 231
    frame #15: 0x000000010041e702 python`method_call + 130
    frame #16: 0x000000010041a4d5 python`_PyObject_FastCallDict + 245
    frame #17: 0x0000000100556726 python`_PyEval_EvalFrameDefault + 45190
    frame #18: 0x000000010054a46e python`_PyEval_EvalCodeWithName + 414
    frame #19: 0x00000001005adce0 python`PyRun_FileExFlags + 256
    frame #20: 0x00000001005ad157 python`PyRun_SimpleFileExFlags + 391
    frame #21: 0x00000001005dadc3 python`pymain_main + 9635
    frame #22: 0x00000001003edf2d python`main + 125
    frame #23: 0x00007fff719717fd libdyld.dylib`start + 1
    frame #24: 0x00007fff719717fd libdyld.dylib`start + 1

This very well could be an nng internal deadlock, but would need to debug a little deeper to be sure.

@JoelStienlet
Copy link

For the last bug of @wtfuzz I think this is because the previously opened socket is not always closed after each cycle of the while loop, and you can't listen on the same IP port twice. So it is blocking on pynng.Pair1(listen=addr).
When asking the garbage collector to explicitly destroy all objects after each iteration, I no longer have this "deadlock":

#!/usr/bin/env python3

import gc
import pynng
import logging
logging.basicConfig(level=logging.DEBUG)

addr = "tcp://localhost:9898"
count = 0

while True:
  with pynng.Pair1(listen=addr) as s0, pynng.Pair1(dial=addr) as s1:

    print("send")
    s0.send(b'hello')
    print("recv")
    s1.recv()

    count += 1
    print(count)
    gc.collect()

@wtfuzz
Copy link
Contributor

wtfuzz commented Apr 25, 2020

@JoelStienlet I think adding the explicit gc is just adding some delay and not triggering the race condition that leads to deadlock.

The deadlock is in a close, so it's not actually looping around and trying to listen again.

Attempting to listen on an already bound endpoint would cause bind to fail and propagate up to throw an exception.

@wtfuzz
Copy link
Contributor

wtfuzz commented Apr 25, 2020

Possibly related nanomsg/nng#1219

@JoelStienlet
Copy link

@wtfuzz you're right!
I did some more tests on my (Linux) system:

  • eventually it crashes with the gc too. It just requires more iterations.
  • when I add time.sleep(0.1) after the print(count) (and without gc) It doesn't crash
  • when I add time.sleep(0.1) before the "with" it crashes quickly. (still without gc)
    So it looks like, what matters is the time delay between the close and the opening.

@codypiersall
Copy link
Owner Author

I love having other people comment on these issues! Thanks.

I think that linked issue looks really promising @wtfuzz. It wouldn't be the first time that the tests in pynng trigger an edge condition in nng, either.

@JoelStienlet
Copy link

The order of destruction between s0 and s1 may be important too: I observe a huge difference depending on the position of the forced garbage collecting in the following code (just uncomment one of the two gc.collect(), then run again with the other uncommented):

# for i in {1..10}; do ./test_deadlock.py $i || break; done 
while count < 10000:
  with pynng.Pair1(listen=addr) as s0:
    with pynng.Pair1(dial=addr) as s1:
      s0.send(b'hello')
      s1.recv()
      count += 1
      if count % 10 == 0:
        print(count)
      #gc.collect()
    #gc.collect()
print("OK.", sys.argv[1])

When the dialler is destroyed before the listener I've not yet seen a crash yet (but I may lack statistics here)

@codypiersall
Copy link
Owner Author

I was able to reproduce this issue in plain ol' C:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>

#include <nng/nng.h>
#include <nng/protocol/pair0/pair.h>

#define CHECK(x) do { \
    int ret = (x); \
    if (ret != 0) { \
        printf("error %d: %s\n", ret, nng_strerror(ret));\
        abort(); \
    } \
} while (0)

char addr[] = "tcp://localhost:9899";
int main() {
    char buf[50];
    for (int i = 0; ; i++) {
        size_t got = sizeof buf;
        nng_socket s0, s1;
        CHECK(nng_pair_open(&s0));
        CHECK(nng_listen(s0, addr, NULL, 0));
        CHECK(nng_pair_open(&s1));
        CHECK(nng_dial(s1, addr, NULL, 0));

        CHECK(nng_send(s0, "hello", 5, 0));
        CHECK(nng_recv(s1, buf, &got, 0));
        printf("%d\n", i);
        nng_close(s1);
        nng_close(s0);
    }

    return 0;
}

This is the C equivalent of what the Python version was doing. (Well, the C is a little simpler; in Python we call nng_recv with the flag NNG_ALLOC...)

I'll open an issue on upstream nng.

@JoelStienlet
Copy link

Nice!
(out of curiosity I've inverted the order of the two nng_close(), here too this makes a huge difference, but maybe this could also be related to timings(?), so not sure how to interpret this.)

@codypiersall
Copy link
Owner Author

nng fixed this upstream (thanks Garrett!) but this bug will be present in pynng until the NNG_REV is updated to a fixed version of nng. I'm going to leave this issue open until I update the NNG_REV. I may wait until nng puts out a new release, depending on how long that would be.

@codypiersall
Copy link
Owner Author

nng fixed this upstream and it's in v0.6.0 now. Hurray!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants