Deadlock triggered in test_can_send_from_pipe #62

codypiersall · 2020-04-20T03:01:24Z

There are intermittent deadlocks(?) when the test test_can_send_from_pipe runs. This causes testing to be a pain, and also prevents CI from being useful at all.

I suspect (but am really not sure) that this is related to a race condition between pipe callbacks being called and a socket closing. An unlikely but possible outcome is that libnng has a deadlock. Most likely it's on our end though.

codypiersall · 2020-04-20T03:03:20Z

Adding a sleep(0.05) call at the end of the test prevents the problem from happening. This is not really something I'm happy about.

codypiersall · 2020-04-23T01:30:12Z

This happens on a couple tests, and it was fixed in fc703be for the tests. I'm going to leave this issue open, though, because the fix applied was a band-aid.

wtfuzz · 2020-04-23T16:31:24Z

I wonder if you're running into something similar that I tracked down in PR #55

codypiersall · 2020-04-23T17:36:32Z

Could be. I'll make a simpler reproducer and run it in GDB so I can get stack traces for all the threads. I used py-spy to dump the Python traceback of the main thread, but it didn't show what any of the nng threads were doing.

wtfuzz · 2020-04-23T20:59:59Z

This doesn't appear to be related to #55. I was able to reproduce this, and it's deadlocking in nni_sock_shutdown() waiting on a condition variable during close of the socket.

I'm reproducing using:

import pynng
import logging
logging.basicConfig(level=logging.DEBUG)

addr = "tcp://localhost:9898"
count = 0

while True:
  with pynng.Pair1(listen=addr) as s0, pynng.Pair1(dial=addr) as s1:

    print("send")
    s0.send(b'hello')
    print("recv")
    s1.recv()

    count += 1
    print(count)

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff71ab4ce6 libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff71b76185 libsystem_pthread.dylib`_pthread_cond_wait + 701
    frame #2: 0x0000000100bc208d _nng.abi3.so`nni_plat_cv_wait + 13
    frame #3: 0x0000000100bb66f8 _nng.abi3.so`nni_sock_shutdown + 936
    frame #4: 0x0000000100bb6772 _nng.abi3.so`nni_sock_close + 18
    frame #5: 0x0000000100b91b80 _nng.abi3.so`nng_close + 32
    frame #6: 0x0000000100b62099 _nng.abi3.so`_cffi_f_nng_close(self=<unavailable>, arg0=<unavailable>) at pynng._nng.c:3114:14 [opt]
    frame #7: 0x000000010041bad3 python`_PyMethodDef_RawFastCallKeywords + 131
    frame #8: 0x0000000100558de2 python`call_function + 306
    frame #9: 0x0000000100555b92 python`_PyEval_EvalFrameDefault + 42226
    frame #10: 0x000000010041b2d5 python`function_code_fastcall + 117
    frame #11: 0x0000000100558d67 python`call_function + 183
    frame #12: 0x0000000100555afc python`_PyEval_EvalFrameDefault + 42076
    frame #13: 0x000000010054a46e python`_PyEval_EvalCodeWithName + 414
    frame #14: 0x000000010041a7c7 python`_PyFunction_FastCallDict + 231
    frame #15: 0x000000010041e702 python`method_call + 130
    frame #16: 0x000000010041a4d5 python`_PyObject_FastCallDict + 245
    frame #17: 0x0000000100556726 python`_PyEval_EvalFrameDefault + 45190
    frame #18: 0x000000010054a46e python`_PyEval_EvalCodeWithName + 414
    frame #19: 0x00000001005adce0 python`PyRun_FileExFlags + 256
    frame #20: 0x00000001005ad157 python`PyRun_SimpleFileExFlags + 391
    frame #21: 0x00000001005dadc3 python`pymain_main + 9635
    frame #22: 0x00000001003edf2d python`main + 125
    frame #23: 0x00007fff719717fd libdyld.dylib`start + 1
    frame #24: 0x00007fff719717fd libdyld.dylib`start + 1

This very well could be an nng internal deadlock, but would need to debug a little deeper to be sure.

JoelStienlet · 2020-04-25T14:13:14Z

For the last bug of @wtfuzz I think this is because the previously opened socket is not always closed after each cycle of the while loop, and you can't listen on the same IP port twice. So it is blocking on pynng.Pair1(listen=addr).
When asking the garbage collector to explicitly destroy all objects after each iteration, I no longer have this "deadlock":

#!/usr/bin/env python3

import gc
import pynng
import logging
logging.basicConfig(level=logging.DEBUG)

addr = "tcp://localhost:9898"
count = 0

while True:
  with pynng.Pair1(listen=addr) as s0, pynng.Pair1(dial=addr) as s1:

    print("send")
    s0.send(b'hello')
    print("recv")
    s1.recv()

    count += 1
    print(count)
    gc.collect()

wtfuzz · 2020-04-25T14:30:31Z

@JoelStienlet I think adding the explicit gc is just adding some delay and not triggering the race condition that leads to deadlock.

The deadlock is in a close, so it's not actually looping around and trying to listen again.

Attempting to listen on an already bound endpoint would cause bind to fail and propagate up to throw an exception.

wtfuzz · 2020-04-25T14:31:35Z

Possibly related nanomsg/nng#1219

JoelStienlet · 2020-04-25T15:56:18Z

@wtfuzz you're right!
I did some more tests on my (Linux) system:

eventually it crashes with the gc too. It just requires more iterations.
when I add time.sleep(0.1) after the print(count) (and without gc) It doesn't crash
when I add time.sleep(0.1) before the "with" it crashes quickly. (still without gc)
So it looks like, what matters is the time delay between the close and the opening.

codypiersall · 2020-04-25T16:40:03Z

I love having other people comment on these issues! Thanks.

I think that linked issue looks really promising @wtfuzz. It wouldn't be the first time that the tests in pynng trigger an edge condition in nng, either.

JoelStienlet · 2020-04-25T18:36:26Z

The order of destruction between s0 and s1 may be important too: I observe a huge difference depending on the position of the forced garbage collecting in the following code (just uncomment one of the two gc.collect(), then run again with the other uncommented):

# for i in {1..10}; do ./test_deadlock.py $i || break; done 
while count < 10000:
  with pynng.Pair1(listen=addr) as s0:
    with pynng.Pair1(dial=addr) as s1:
      s0.send(b'hello')
      s1.recv()
      count += 1
      if count % 10 == 0:
        print(count)
      #gc.collect()
    #gc.collect()
print("OK.", sys.argv[1])

When the dialler is destroyed before the listener I've not yet seen a crash yet (but I may lack statistics here)

codypiersall · 2020-04-27T01:59:36Z

I was able to reproduce this issue in plain ol' C:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>

#include <nng/nng.h>
#include <nng/protocol/pair0/pair.h>

#define CHECK(x) do { \
    int ret = (x); \
    if (ret != 0) { \
        printf("error %d: %s\n", ret, nng_strerror(ret));\
        abort(); \
    } \
} while (0)

char addr[] = "tcp://localhost:9899";
int main() {
    char buf[50];
    for (int i = 0; ; i++) {
        size_t got = sizeof buf;
        nng_socket s0, s1;
        CHECK(nng_pair_open(&s0));
        CHECK(nng_listen(s0, addr, NULL, 0));
        CHECK(nng_pair_open(&s1));
        CHECK(nng_dial(s1, addr, NULL, 0));

        CHECK(nng_send(s0, "hello", 5, 0));
        CHECK(nng_recv(s1, buf, &got, 0));
        printf("%d\n", i);
        nng_close(s1);
        nng_close(s0);
    }

    return 0;
}

This is the C equivalent of what the Python version was doing. (Well, the C is a little simpler; in Python we call nng_recv with the flag NNG_ALLOC...)

I'll open an issue on upstream nng.

JoelStienlet · 2020-04-27T07:37:29Z

Nice!
(out of curiosity I've inverted the order of the two nng_close(), here too this makes a huge difference, but maybe this could also be related to timings(?), so not sure how to interpret this.)

codypiersall · 2020-05-19T13:12:03Z

nng fixed this upstream (thanks Garrett!) but this bug will be present in pynng until the NNG_REV is updated to a fixed version of nng. I'm going to leave this issue open until I update the NNG_REV. I may wait until nng puts out a new release, depending on how long that would be.

codypiersall · 2020-09-01T01:47:06Z

nng fixed this upstream and it's in v0.6.0 now. Hurray!

codypiersall mentioned this issue Apr 27, 2020

Deadlock triggered on nng_close nanomsg/nng#1236

Closed

codypiersall closed this as completed Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock triggered in test_can_send_from_pipe #62

Deadlock triggered in test_can_send_from_pipe #62

codypiersall commented Apr 20, 2020

codypiersall commented Apr 20, 2020

codypiersall commented Apr 23, 2020

wtfuzz commented Apr 23, 2020

codypiersall commented Apr 23, 2020

wtfuzz commented Apr 23, 2020 •

edited

Loading

JoelStienlet commented Apr 25, 2020

wtfuzz commented Apr 25, 2020

wtfuzz commented Apr 25, 2020

JoelStienlet commented Apr 25, 2020

codypiersall commented Apr 25, 2020

JoelStienlet commented Apr 25, 2020

codypiersall commented Apr 27, 2020

JoelStienlet commented Apr 27, 2020

codypiersall commented May 19, 2020

codypiersall commented Sep 1, 2020

Deadlock triggered in test_can_send_from_pipe #62

Deadlock triggered in test_can_send_from_pipe #62

Comments

codypiersall commented Apr 20, 2020

codypiersall commented Apr 20, 2020

codypiersall commented Apr 23, 2020

wtfuzz commented Apr 23, 2020

codypiersall commented Apr 23, 2020

wtfuzz commented Apr 23, 2020 • edited Loading

JoelStienlet commented Apr 25, 2020

wtfuzz commented Apr 25, 2020

wtfuzz commented Apr 25, 2020

JoelStienlet commented Apr 25, 2020

codypiersall commented Apr 25, 2020

JoelStienlet commented Apr 25, 2020

codypiersall commented Apr 27, 2020

JoelStienlet commented Apr 27, 2020

codypiersall commented May 19, 2020

codypiersall commented Sep 1, 2020

wtfuzz commented Apr 23, 2020 •

edited

Loading