gnrc: crash with (excessive) traffic in native #6123

kaspar030 · 2016-11-15T16:55:03Z

#2071 seems to have fixed native interrupt handling. But I can still crash a riot native instance by flood pinging.

Steps to reproduce:

run gnrc_networking
start a couple (5-10) ping -f <riot-ip>%tap0
wait some minutes

Sometimes it takes 30sec, sometimes 5min, but riot reliably crashes.

The problem seems somewhere in gnrc, as my other stack runs fine since #2071 is merged.
Here's a backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x0804df49 in gnrc_pktbuf_start_write ()
(gdb) bt
#0  0x0804df49 in gnrc_pktbuf_start_write ()
#1  0x0805c0ae in _send ()
#2  0x0805ba21 in _event_loop ()
#3  0xf7dba58b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()
(gdb)

Edit: summary of reasons why this might not be a GNRC issue in #6123 (comment).

The text was updated successfully, but these errors were encountered:

miri64 · 2016-11-15T17:25:52Z

I have a hunch that this might be related to #5748 (or at least fixable with #6086). Can you try with #6086?

miri64 · 2016-11-15T17:26:30Z

(it might be, that gnrc_pktbuf_start_write() needs fixing in #6086 too)

miri64 · 2016-11-16T15:21:26Z

okay does not seem to be related. Here's a more detailed GDB dump:

Program received signal SIGSEGV, Segmentation fault.
0x0804e026 in gnrc_pktbuf_start_write (pkt=0x20) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_static/gnrc_pktbuf_static.c:256
256     if ((pkt == NULL) || (pkt->size == 0)) {
(gdb) where
#0  0x0804e026 in gnrc_pktbuf_start_write (pkt=0x20) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_static/gnrc_pktbuf_static.c:256
#1  0x0805c447 in _send (pkt=0x8083ef4 <_pktbuf+756>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:684
#2  0x0805bd8e in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:279
#3  0xf7e3116b in makecontext () at ../sysdeps/unix/sysv/linux/i386/makecontext.S:91
#4  0x00000000 in ?? ()

miri64 · 2016-11-16T15:22:54Z

Or maybe it is.. sorry about the noise. pkt=0x20 comes from the packet in _send, so this still might be a packet buffer problem...

miri64 · 2016-11-16T15:27:32Z

(but #6086 does not solve the problem)

kaspar030 · 2016-11-16T16:19:44Z

(but #6086 does not solve the problem)

Too bad. Thanks for testing, I didn't get around, yet. So you can reproduce this?

miri64 · 2016-11-16T16:23:32Z

So you can reproduce this?

Yepp. But since #6086 doesn't fix #5748 either the problems still seem to be related. But don't have the time atm to look into that. :(

smlng · 2018-01-31T09:04:22Z

cannot reproduce with 2018.01-RC2 on macOS native, flood pinged for >20min with 8 instances against 1 RIOT still works

kaspar030 · 2018-01-31T10:02:52Z

Still happens here, current master, on linux.

kaspar030 · 2018-01-31T10:04:34Z

Still happens here, current master, on linux.

8 instances, started manually. after 6-7 first packets get lost and riot prints a couple of "gnrc_netif: possibly lost interrupt.", like every second, then it crashes:

gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
Segmentation fault (core dumped)```

smlng · 2018-01-31T10:28:31Z

my test still runs on macOS, RIOT native not crashed after > 1h with 8 parallel ping6 -f on its link-local address. Though, according to Wireshark not all ping request get a reply but many do.

kaspar030 · 2018-01-31T10:29:01Z

cannot reproduce with 2018.01-RC2 on macOS native

That might hint that the problem is linux specific. What machine did you run the test on? Maybe this only happens when RIOT doesn't get scheduled enough? My thinkpad is fairly slow by today's standards.

kaspar030 · 2018-01-31T10:29:55Z

my test still runs on macOS, RIOT native not crashed after > 1h with 8 parallel ping6 -f on its link-local address.

Can you try increasing to 16 or more?

smlng · 2018-01-31T10:30:09Z

latest iMac with i7 @ ~4Ghz and 16GB RAM

Can you try increasing to 16 or more?

sure!

kaspar030 · 2018-01-31T10:33:40Z

latest iMac with i7 @ ~4Ghz and 16GB RAM

That's a 4GHz quad core? Eight ping6 -f are ~~enough~~ needed to break native on dual 1.6GHz. Do the math, we need to starve the RIOT native thread. ;)

smlng · 2018-01-31T10:34:45Z

so 16 parallel pings on RIOT native, still working - but lots of `ping6: sendmsg: No buffer space available" from macOS

smlng · 2018-01-31T10:35:33Z

and Wireshark has a hard time to capture all those ping requests and replies ...

bergzand · 2018-01-31T10:37:19Z

The other difference between macOS and Linux is that the async_read used by netdev_tap has a different implementation. Maybe that is causing the different observations here.

kYc0o · 2018-07-20T16:05:36Z

I tried to reproduce this issue and it really took longtime before crashing, though I didn't exaggerate the number of pings, only 4 instances flooding on vagrant.

Is it expected that RIOT ensures an "eternal" functioning state?

miri64 · 2018-07-20T19:19:09Z

It is expected that a system exposed to a network is able to either handle the packets or drop them. Crashing is unexceptable as this leaves doors for DoS attacks open. It is unclear though if it is actually GNRC, RIOT, the native sublayer or even Linux' TAP infrastructure that is crashing here. Maybe some simple, complexity-increasing tests (first test a TAP application on Linux, than just netdev_tap, then netdev_tap in a non-main thread, then go to setup explained above) could pin-point the cause of the issue.

kaspar030 · 2018-07-21T08:28:49Z

Is it expected that RIOT ensures an "eternal" functioning state?

Yes.

miri64 · 2018-07-21T08:56:26Z

I still can reproduce btw. Maybe in September I find some time to bite myself into this like I did with the leak in gnrc_pktbuf_static.

miri64 · 2019-01-26T12:05:58Z

This issue was re-reported in #10875 with a bit more detailed test description. If the description above is not enough for a tester, please refer to #10875.

miri64 · 2019-01-26T13:03:30Z

In #10875 I said

[…] in [the GDB dump in this issue and and in #10875] the packet seems to get corrupted while being in gnrc_ipv6's message queue (possibly due to a too early release)

I can confirm this now at least for one isolated case (note that pkt in _send is at an unused spot):

Program received signal SIGSEGV, Segmentation fault.
0x5656d0a2 in gnrc_netif_hdr_get_netif (hdr=0x18) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
291         return gnrc_netif_get_by_pid(hdr->if_pid);
(gdb) call gnrc_pktbuf_stats()
packet buffer: first byte: 0x5659cd80, last byte: 0x5659e580 (size: 6144)
  position of last byte used: 6144
=========== chunk   0 (0x5659cd80 size:   96) ===========
00000000  98  CD  59  56  B0  D3  59  56  DC  05  00  00  01  00  00  00
00000010  01  00  00  00  00  00  00  00  00  00  00  00  B0  CD  59  56
00000020  14  00  00  00  01  00  00  00  FF  FF  FF  FF  3A  40  FE  80
00000030  06  06  06  00  00  00  00  00  1E  02  ED  E1  3E  C0  1E  02
00000040  ED  E1  3E  C1  00  00  1C  02  08  CE  59  56  80  DF  59  56
00000050  DC  05  00  00  01  00  00  00  01  00  00  00  00  00  DD  63
~ unused: 0x5659cde0 (next: 0x5659ce38, size:   40) ~
=========== chunk   1 (0x5659ce08 size:   48) ===========
00000000  00  00  00  00  20  CE  59  56  14  00  00  00  01  00  00  00
00000010  FF  FF  FF  FF  46  47  48  49  06  06  06  00  00  00  00  00
[…]
(gdb) where
#0  0x5656d0a2 in gnrc_netif_hdr_get_netif (hdr=0x18) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
#1  0x5656dcde in _send (pkt=0x5659cde0 <_pktbuf+96>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:539
#2  0x5656d4a2 in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:193
#3  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()

I'll try to reproduce it a few times more now.

miri64 · 2019-01-26T13:21:43Z

Reproduced it (with different pointers) three more times and one time even with sudo ping -s0 -f fe80::1c02:edff:fee1:3ec1%tapbr0, so size doesn't matter here as assessed by @gschorcht in #10875 (comment) is also confirmed. I'll try to reproduce with gnrc_pktbuf_malloc and valgrind now.

miri64 · 2019-01-26T13:39:36Z

Was able to reproduce with valgrind, but I'm confused by the output.

==25483== Invalid read of size 4
==25483==    at 0x120672: _send (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:533)
==25483==    by 0x11FE55: _event_loop (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:193)
==25483==    by 0x498927A: makecontext (in /usr/lib32/libc-2.28.so)
==25483==  Address 0x5fd6c98 is 16 bytes inside a block of size 20 free'd
==25483==    at 0x4836857: free (/build/valgrind/src/valgrind/coregrind/m_replacemalloc/vg_replace_malloc.c:530)
==25483==    by 0x1120E5: _release_error_locked (sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:214)
==25483==    by 0x112165: gnrc_pktbuf_release_error (sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:228)
==25483==    by 0x12B753: gnrc_pktbuf_release (sys/include/net/gnrc/pktbuf.h:171)
==25483==    by 0x12B7F0: gnrc_pktbuf_remove_snip (sys/net/gnrc/pktbuf/gnrc_pktbuf.c:59)
==25483==    by 0x1206DC: _send (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:558)
==25483==    by 0x11FE55: _event_loop (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:193)
==25483==    by 0x498927A: makecontext (in /usr/lib32/libc-2.28.so)
==25483==  Block was alloc'd at
==25483==    at 0x483562B: malloc (/build/valgrind/src/valgrind/coregrind/m_replacemalloc/vg_replace_malloc.c:299)
==25483==    by 0x112301: _create_snip (sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:276)
==25483==    by 0x111BD8: gnrc_pktbuf_add (sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:97)
==25483==    by 0x11158D: gnrc_netif_hdr_build (sys/net/gnrc/netif/hdr/gnrc_netif_hdr.c:20)
==25483==    by 0x11E7E9: gnrc_icmpv6_echo_req_handle (sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c:94)
==25483==    by 0x11E252: gnrc_icmpv6_demux (sys/net/gnrc/network_layer/icmpv6/gnrc_icmpv6.c:91)
==25483==    by 0x11FC3F: _demux (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:112)
==25483==    by 0x120CF5: _receive (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:799)
==25483==    by 0x11FE42: _event_loop (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:188)
==25483==    by 0x498927A: makecontext (in /usr/lib32/libc-2.28.so)

So it crashes in

RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c

Line 533 in bdd2d52

if (pkt->type == GNRC_NETTYPE_NETIF) {

because it was already released in

RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c

Line 558 in bdd2d52

pkt = gnrc_pktbuf_remove_snip(tmp_pkt, tmp_pkt);

(which comes after...). Either the packet was put into the queue twice or it did not properly dequeue for some reason. Could that be possible @kaspar030?

miri64 · 2019-01-26T14:20:14Z

Since I am not yet able to reproduce this by adding some printfs around the msg_receive in gnrc_ipv6:c:_event_loop, I'm wondering if this might be related to #1811 (edit: which was only closed because vtimer got kicked out so the issue wasn't reproducable any more as described)... did hit the race condition now after some time

miri64 · 2019-01-26T14:30:52Z

I was able to reproduce the crash with the following patch:

diff --git a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
index feb2e8f10..d42696c18 100644
--- a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
+++ b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
@@ -180,7 +180,11 @@ static void *_event_loop(void *args)
     /* start event loop */
     while (1) {
         DEBUG("ipv6: waiting for incoming message.\n");
+        printf("%d\n", cib_peek((cib_t *)&sched_active_thread->msg_queue));
         msg_receive(&msg);
+        printf("%d (%p)\n", cib_peek((cib_t *)&sched_active_thread->msg_queue),
+               msg.content.ptr);
+
 
         switch (msg.type) {
             case GNRC_NETAPI_MSG_TYPE_RCV:

Shortly before the crash I get the following output (using gnrc_pktbuf_malloc again, but just plane gdb):

[…]
7
0 (0x565ac0a0)
0
1 (0x565ab1d0)
1
2 (0x565ac060)
2
3 (0x565ab1f0)
3
4 (0x565acd60)
4
5 (0x565ac740)
5
6 (0x565ab190)
6
7 (0x565abfc0)
7
0 (0x565ab210)
0
1 (0x565ac0a0)
1
2 (0x565ab1d0)
2
3 (0x565ac060)
3
4 (0x565ab1f0)
4
5 (0x565acd60)
5
6 (0x565ac740)
6
7 (0x565ab190)
7
-1 (0x565ab230)
0
-1 (0x565a9e50)
-1
-1 (0x565a9e50)

Program received signal SIGSEGV, Segmentation fault.
0x5655efa7 in _release_error_locked (pkt=0x20, err=22) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:210
210             tmp = pkt->next;
(gdb) where
#0  0x5655efa7 in _release_error_locked (pkt=0x20, err=22) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:210
#1  0x5655f05e in gnrc_pktbuf_release_error (pkt=0x565a9de8, err=22) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:228
#2  0x5656d6e9 in _send (pkt=0x565a9de8, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:566
#3  0x5656ce44 in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:197
#4  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#5  0x00000000 in ?? ()

Notice the last for lines before the SIGSEGV. Weirdly, msg.content.ptr also is different from the pointer handed to _send as pkt (which should be msg.content.ptr).

miri64 · 2019-01-26T14:37:19Z

7
-1 (0x565ab230)
0
-1 (0x565a9e50)
-1
-1 (0x565a9e50)

I think that can be seen as significant, because these are the first -1 that comes up once the ping starts.

miri64 · 2019-01-26T14:43:30Z

Also able to reproduce with gnrc_pktbuf_static. This time the pointers match.

[…]
5
6 (0x5659cdb0)
6
7 (0x5659ce28)
7
0 (0x5659ce40)
0
1 (0x5659d1d0)
1
2 (0x5659cf88)
2
3 (0x5659d1e8)
3
4 (0x5659cf68)
4
5 (0x5659cff8)
5
6 (0x5659d080)
6
7 (0x5659d098)
7
0 (0x5659d118)
0
1 (0x5659ce50)
1
2 (0x5659cd80)
2
3 (0x5659cff8)
3
-1 (0x5659d118)
-1
-1 (0x5659d118)

Program received signal SIGSEGV, Segmentation fault.
0x5656d13f in gnrc_netif_hdr_get_netif (hdr=0x1468) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
291         return gnrc_netif_get_by_pid(hdr->if_pid);
(gdb) where
#0  0x5656d13f in gnrc_netif_hdr_get_netif (hdr=0x1468) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
#1  0x5656ddd4 in _send (pkt=0x5659d118 <_pktbuf+920>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:543
#2  0x5656d598 in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:197
#3  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()
(gdb) call gnrc_pktbuf_stats()
gnrc_netif: possibly lost interrupt.
packet buffer: first byte: 0x5659cd80, last byte: 0x5659e580 (size: 6144)
  position of last byte used: 2984
=========== chunk   0 (0x5659cd80 size:  224) ===========
00000000  B8  CF  59  56  50  CE  59  56  0E  00  00  00  01  00  00  00
00000010  FF  FF  FF  FF  00  00  00  00  00  00  00  00  B0  CE  59  56
[…]
=========== chunk   6 (0x5659d0f0 size:   40) ===========
00000000  60  00  00  00  00  08  3A  40  FE  80  00  00  00  00  00  00
00000010  1C  02  ED  FF  FE  E1  3E  C1  FE  80  00  00  00  00  00  00
00000020  1C  02  ED  FF  FE  E1  3E  C0
~ unused: 0x5659d118 (next: (nil), size: 5224) ~

miri64 · 2019-01-26T15:26:16Z

New patch, same result (but interesting inside):

diff --git a/core/msg.c b/core/msg.c
index a46875f16..61de2e7a2 100644
--- a/core/msg.c
+++ b/core/msg.c
@@ -310,6 +310,9 @@ static int _msg_receive(msg_t *m, int block)
         DEBUG("_msg_receive: %" PRIkernel_pid ": _msg_receive(): We've got a queued message.\n",
               sched_active_thread->pid);
         *m = me->msg_array[queue_index];
+        if (sched_active_pid == 4) {
+            printf("idx: %d (%p)\n", queue_index, m->content.ptr);
+        }
     }
     else {
         me->wait_data = (void *) m;
@@ -330,6 +333,9 @@ static int _msg_receive(msg_t *m, int block)
             thread_yield_higher();
 
             /* sender copied message */
+            if (sched_active_pid == 4) {
+                printf("blk %d => %p\n", queue_index, m->content.ptr);
+            }
         }
         else {
             irq_restore(state);
@@ -353,6 +359,7 @@ static int _msg_receive(msg_t *m, int block)
         /* copy msg */
         msg_t *sender_msg = (msg_t*) sender->wait_data;
         *m = *sender_msg;
+        printf("sbl %d => %p\n", queue_index, m->content.ptr);
 
         /* remove sender from queue */
         uint16_t sender_prio = THREAD_PRIORITY_IDLE;

idx: 0 (0x5659d210)
idx: 1 (0x5659cd80)
idx: 2 (0x5659cd98)
idx: 3 (0x5659cfe0)
idx: 4 (0x5659d0f0)
idx: 5 (0x5659cf40)
blk -1 => 0x5659cda8
idx: 6 (0x5659cdc0)
idx: 7 (0x5659cd80)
idx: 0 (0x5659cdd8)
idx: 1 (0x5659ce50)
idx: 2 (0x5659cec8)
idx: 3 (0x5659cf80)
idx: 4 (0x5659d178)
idx: 5 (0x5659cfe0)
idx: 6 (0x5659d108)
idx: 7 (0x5659ce38)
idx: 0 (0x5659ceb0)
idx: 1 (0x5659cf68)
idx: 2 (0x5659cff8)
idx: 3 (0x5659cd80)
idx: 4 (0x5659d130)
blk -1 => 0x5659d130

Program received signal SIGSEGV, Segmentation fault.
0x5656d114 in gnrc_netif_hdr_get_netif (hdr=0x1450) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
291         return gnrc_netif_get_by_pid(hdr->if_pid);
(gdb) where
#0  0x5656d114 in gnrc_netif_hdr_get_netif (hdr=0x1450) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
#1  0x5656dd50 in _send (pkt=0x5659d130 <_pktbuf+944>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:543
#2  0x5656d514 in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:197
#3  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()
(gdb) call gnrc_pktbuf_stats()
packet buffer: first byte: 0x5659cd80, last byte: 0x5659e580 (size: 6144)
  position of last byte used: 2944
=========== chunk   0 (0x5659cd80 size:   48) ===========
00000000  08  CE  59  56  D8  CD  59  56  30  00  00  00  01  00  00  00
00000010  01  00  00  00  00  00  00  00  80  CE  59  56  50  CE  59  56
[…]
~ unused: 0x5659d0d8 (next: 0x5659d148, size:   16) ~
=========== chunk   4 (0x5659d0e8 size:   96) ===========
00000000  60  09  3B  94  00  08  3A  40  FE  80  00  00  00  00  00  00
00000010  1C  02  ED  FF  FE  E1  3E  C0  FE  80  00  00  00  00  00  00
00000020  1C  02  ED  FF  FE  E1  3E  C1  80  00  54  C2  46  29  58  86
00000030  00  00  00  00  30  D1  59  56  14  00  00  00  01  00  00  00
00000040  FF  FF  FF  FF  3A  40  FE  80  06  06  06  00  00  00  00  00
00000050  1E  02  ED  E1  3E  C0  1E  02  ED  E1  3E  C1  00  00  1C  02
~ unused: 0x5659d148 (next: 0x5659d1a0, size:   40) ~
=========== chunk   5 (0x5659d170 size:   48) ===========
00000000  60  09  3B  94  00  08  3A  40  FE  80  00  00  00  00  00  00
00000010  1C  02  ED  FF  FE  E1  3E  C0  FE  80  00  00  00  00  00  00
00000020  1C  02  ED  FF  FE  E1  3E  C1  80  00  55  43  46  2B  58  03
~ unused: 0x5659d1a0 (next: (nil), size: 5088) ~

First interesting difference: this time pkt does not refer to an unused space for the first time, but I think this might be irrelevant for now. More interesting is, that the first time 0x5659d130 was taken from the queue in

RIOT/core/msg.c

Line 312 in bdd2d52

*m = me->msg_array[queue_index];

while the second time it was copied directly from the sending thread to wait_data set here

RIOT/core/msg.c

Line 315 in bdd2d52

me->wait_data = (void *) m;

I'm not 100% sure, but I think that puts the odds more to the side of the packet being somehow ~~queued~~ dispatched for send twice in a row... I'll investigate further.

gschorcht · 2019-01-26T15:42:23Z

@good job 😄

miri64 · 2019-01-26T15:53:17Z

(btw the IPv6 thread itself is the only one in this scenario sending packets to IPv6)

miri64 · 2019-01-26T16:24:22Z

(btw the IPv6 thread itself is the only one in this scenario sending packets to IPv6)

Here is where I'm confused... msg_send_to_self() only uses the queue appearently:

RIOT/core/msg.c

Lines 174 to 183 in bdd2d52

    
           int msg_send_to_self(msg_t *m) 
        
           { 
        
               unsigned state = irq_disable(); 
        
               m->sender_pid = sched_active_pid; 
        
               int res = queue_msg((thread_t *) sched_active_thread, m); 
        
               irq_restore(state); 
        
               return res; 
        
           }

so how does the message end up in wait_data?

miri64 · 2019-01-26T16:43:14Z

Yepp.. definitely no double dispatch:

echo:112 0x5659d230
echo:112 0x5659d168
echo:112 0x5659cf28
echo:112 0x5659cf78
echo:112 0x5659d030
echo:112 0x5659d0e0
echo:112 0x5659cef8
echo:112 0x5659d180
echo:112 0x5659d180
echo:112 0x5659d0a0
echo:112 0x5659d198
echo:112 0x5659cf70
echo:112 0x5659d200
echo:112 0x5659cff8
echo:112 0x5659d218
echo:112 0x5659d230
echo:112 0x5659d238
echo:112 0x5659ce88
echo:112 0x5659d078
echo:112 0x5659d0d0
echo:112 0x5659d130
echo:112 0x5659cff0
echo:112 0x5659cfd0
echo:112 0x5659cf80
echo:112 0x5659d1c8
echo:112 0x5659ced0
echo:112 0x5659d1e0
echo:112 0x5659d068
echo:112 0x5659d0c8
echo:112 0x5659d1f8
echo:112 0x5659cf60
echo:112 0x5659d210
echo:112 0x5659d1f8
echo:112 0x5659cfe0
echo:112 0x5659d140
echo:112 0x5659d058
echo:112 0x5659d180
echo:112 0x5659d098
echo:112 0x5659d0b0
echo:112 0x5659cf58
echo:112 0x5659d270
echo:112 0x5659ceb0
echo:112 0x5659d1f0
echo:112 0x5659d0c8
echo:112 0x5659d010
echo:112 0x5659d098
echo:112 0x5659d208
echo:112 0x5659d1b8
echo:112 0x5659d1d0
echo:112 0x5659d1a8
echo:112 0x5659d0d8
echo:112 0x5659ce20
echo:112 0x5659ce98
echo:112 0x5659cf38
echo:112 0x5659cfd0
echo:112 0x5659d048
echo:112 0x5659d0c0
echo:112 0x5659d1d8
echo:112 0x5659d060
echo:112 0x5659d160
echo:112 0x5659ce98
echo:112 0x5659cf38
echo:112 0x5659cf10
echo:112 0x5659d108
echo:112 0x5659d048
echo:112 0x5659d198
echo:112 0x5659d128

Program received signal SIGSEGV, Segmentation fault.
0x5656d0fc in gnrc_netif_hdr_get_netif (hdr=0x18) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
291         return gnrc_netif_get_by_pid(hdr->if_pid);
(gdb) where
#0  0x5656d0fc in gnrc_netif_hdr_get_netif (hdr=0x18) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
#1  0x5656dd38 in _send (pkt=0x5659d128 <_pktbuf+936>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:539
#2  0x5656d4fc in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:193
#3  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()
(gdb)

(this was the patch I used)

https://gist.github.com/miri64/a812f7b71fcf6dc71c70d8a9d79b24a9

I'll retry that with the msg.content.ptr print from above again, to make 100% sure.

miri64 · 2019-01-26T16:53:17Z

Yepp no double dispatch...

diff --git a/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c b/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
index a48ac04f3..0b6f9a73c 100644
--- a/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
+++ b/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
@@ -109,6 +109,7 @@ void gnrc_icmpv6_echo_req_handle(gnrc_netif_t *netif, ipv6_hdr_t *ipv6_hdr,
 
     LL_PREPEND(pkt, hdr);
 
+    printf("echo:112 %p\n", (void *)pkt);
     if (!gnrc_netapi_dispatch_send(GNRC_NETTYPE_IPV6, GNRC_NETREG_DEMUX_CTX_ALL,
                                    pkt)) {
         DEBUG("icmpv6_echo: no receivers for IPv6 packets\n");
diff --git a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
index feb2e8f10..e5dc0f32f 100644
--- a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
+++ b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
@@ -181,6 +181,7 @@ static void *_event_loop(void *args)
     while (1) {
         DEBUG("ipv6: waiting for incoming message.\n");
         msg_receive(&msg);
+        printf("ipv6:184 %p\n", msg.content.ptr);
 
         switch (msg.type) {
             case GNRC_NETAPI_MSG_TYPE_RCV:

[…]
ipv6:184 0x5659d0f0
ipv6:184 0x5659cd80
echo:112 0x5659d148
ipv6:184 0x5659ce08
echo:112 0x5659cea8
ipv6:184 0x5659d148
ipv6:184 0x5659cea8
ipv6:184 0x5659cea8
<crash>

miri64 · 2019-01-26T16:54:31Z

@kaspar030 it looks more and more that this is some issue in either core or native, since the message somehow ends up in wait_data I rather suspect the first.

miri64 · 2019-01-26T17:17:38Z

Summary

To summarize my findings from above (so you don't have to read all that):

A GNRC_NETAPI_MSG_TYPE_SND message is received twice in a row by the gnrc_ipv6 thread from itself at some point causing the thread to access parts of the packet released in

RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c

Line 558 in bdd2d52

pkt = gnrc_pktbuf_remove_snip(tmp_pkt, tmp_pkt);
The first time the message is retrieved from the thread's message queue. At the second receive the queue is empty and the message comes from thread_t::wait_data.
msg_send_to_self() only queues the packet.
It was confirmed that the packet is not dispatched twice for sending by the IPv6 thread, so at the moment the bug seems to be somewhere in the IPC state machine.
Bonus (not found above): There is no other msg_send() that could cause this (confirmed with git grep msg_send)

miri64 · 2019-01-26T18:24:30Z

That was bullshit... forget what I said. I should stop now.

Edit: I deleted that stupid comment...

miri64 · 2019-01-26T18:28:05Z

While the ping is running:

^C
Program received signal SIGINT, Interrupt.
native_isr_entry (sig=29, info=0x5659a5cc <sigalt_stk+6956>, context=0x5659a64c <sigalt_stk+7084>) at /home/mlenders/Repositories/RIOT-OS/RIOT/cpu/native/irq_cpu.c:287
287     {
(gdb) print sched_threads[4]->wait_data
$1 = (void *) 0x565a7480 <_stack+7648>
(gdb) print cib_avail(&sched_threads[4]->msg_queue)
$2 = 6
(gdb) call ps()
        pid | name                 | state    Q | pri | stack  ( used) | base addr  | current
          - | isr_stack            | -        - |   - |   8192 (   -1) | 0x56596500 | 0x56596500
          1 | idle                 | pending  Q |  15 |   8192 (  436) | 0x56594220 | 0x56596084
          2 | main                 | pending  Q |   7 |  12288 ( 1688) | 0x56591220 | 0x56594084
          3 | pktdump              | bl rx    _ |   6 |  12288 ( 1008) | 0x5659e5a0 | 0x565a1404
          4 | ipv6                 | running  Q |   4 |   8192 ( 1864) | 0x565a56a0 | 0x565a7504
          5 | udp                  | bl rx    _ |   5 |   8192 ( 1008) | 0x565a15c0 | 0x565a3424
          6 | gnrc_netdev_tap      | bl rx    _ |   2 |   8192 ( 2380) | 0x565a3680 | 0x565a54e4
          7 | RPL                  | bl rx    _ |   5 |   8192 (  944) | 0x565a7aa0 | 0x565a9904
            | SUM                  |            |     |  73728 ( 9328)

Why is wait_data set, when the thread is running? This has IMHO potential for a race condition.

Edit: 0x565a7480 <_stack+7648> in the pointer to the msg variable in gnrc_ipv6.c:_event_loop.

miri64 · 2019-01-27T11:29:09Z

I definitely can cause some funky behavior:

diff --git a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
index feb2e8f10..0c3fa97c4 100644
--- a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
+++ b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
@@ -180,6 +180,7 @@ static void *_event_loop(void *args)
     /* start event loop */
     while (1) {
         DEBUG("ipv6: waiting for incoming message.\n");
+        memset(&msg, 0, sizeof(msg));
         msg_receive(&msg);
 
         switch (msg.type) {
@@ -220,6 +221,7 @@ static void *_event_loop(void *args)
                 gnrc_ipv6_nib_handle_timer_event(msg.content.ptr, msg.type);
                 break;
             default:
+                printf("ipv6: unknown message type 0x%04x\n", msg.type);
                 break;
         }
     }

yields

[…]
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
ipv6: unknown message type 0x0000
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
[…]

miri64 · 2019-01-27T13:25:48Z

BTW with that patch, it took way longer (~1.5h) to crash the node and it crashed somewhere else (in some _receive() but don't know which, since I forgot to compile with debugging symbols, but I guess it is not in gnrc_ipv6.c ;-)).

miri64 · 2019-01-27T13:27:27Z

I've opened an issue btw (#10881) to have discussions regarding the potential msg_receive() bug there.

Intended to check if there is a regression to [1]. [1] RIOT-OS/RIOT#6123

miri64 · 2019-02-01T15:12:56Z

💃

kaspar030 added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) GNRC labels Nov 15, 2016

kaspar030 assigned miri64 Nov 15, 2016

bergzand mentioned this issue Apr 16, 2018

nanocoap: add serverside block1 support #8788

Merged

kaspar030 mentioned this issue Apr 20, 2018

native: race-condition in IPC #7206

Closed

cladmi changed the title ~~gnrc: crash with (excessive) traffic~~ gnrc: crash with (excessive) traffic in native Aug 11, 2018

miri64 added the Platform: MSP Platform: This PR/issue effects MSP-based platforms label Sep 30, 2018

miri64 assigned gschorcht Jan 26, 2019

miri64 mentioned this issue Jan 27, 2019

core: msg_receive() on native sometimes returns without msg being re-set #10881

Closed

kaspar030 mentioned this issue Jan 28, 2019

cpu/native: fix race in thread_yield_higher() #10891

Merged

miri64 added a commit to RIOT-OS/Release-Specs that referenced this issue Jan 29, 2019

task03: add stress test for native

a26cd52

Intended to check if there is a regression to [1]. [1] RIOT-OS/RIOT#6123

miri64 mentioned this issue Jan 29, 2019

task03: add stress test for native RIOT-OS/Release-Specs#95

Merged

kaspar030 closed this as completed in #10891 Feb 1, 2019

miri64 mentioned this issue Feb 1, 2019

cpu/native: fix race in thread_yield_higher() [backport 2019.01] #10921

Merged

gnrc: crash with (excessive) traffic in native #6123

gnrc: crash with (excessive) traffic in native #6123

Comments

kaspar030 commented Nov 15, 2016 • edited by miri64 Loading

miri64 commented Nov 15, 2016

miri64 commented Nov 15, 2016

miri64 commented Nov 16, 2016

miri64 commented Nov 16, 2016

miri64 commented Nov 16, 2016

kaspar030 commented Nov 16, 2016

miri64 commented Nov 16, 2016

smlng commented Jan 31, 2018

kaspar030 commented Jan 31, 2018

kaspar030 commented Jan 31, 2018

smlng commented Jan 31, 2018

kaspar030 commented Jan 31, 2018

kaspar030 commented Jan 31, 2018

smlng commented Jan 31, 2018

kaspar030 commented Jan 31, 2018 • edited Loading

smlng commented Jan 31, 2018

smlng commented Jan 31, 2018

bergzand commented Jan 31, 2018

kYc0o commented Jul 20, 2018

miri64 commented Jul 20, 2018

kaspar030 commented Jul 21, 2018

miri64 commented Jul 21, 2018

miri64 commented Jan 26, 2019

miri64 commented Jan 26, 2019

miri64 commented Jan 26, 2019

miri64 commented Jan 26, 2019

miri64 commented Jan 26, 2019 • edited Loading

miri64 commented Jan 26, 2019 • edited Loading

miri64 commented Jan 26, 2019

miri64 commented Jan 26, 2019

miri64 commented Jan 26, 2019 • edited Loading

gschorcht commented Jan 26, 2019

miri64 commented Jan 26, 2019

miri64 commented Jan 26, 2019

miri64 commented Jan 26, 2019

miri64 commented Jan 26, 2019 • edited Loading

miri64 commented Jan 26, 2019

miri64 commented Jan 26, 2019

Summary

miri64 commented Jan 26, 2019 • edited Loading

miri64 commented Jan 26, 2019 • edited Loading

miri64 commented Jan 27, 2019

miri64 commented Jan 27, 2019

miri64 commented Jan 27, 2019

miri64 commented Feb 1, 2019

kaspar030 commented Nov 15, 2016 •

edited by miri64

Loading

kaspar030 commented Jan 31, 2018 •

edited

Loading

miri64 commented Jan 26, 2019 •

edited

Loading

miri64 commented Jan 26, 2019 •

edited

Loading

miri64 commented Jan 26, 2019 •

edited

Loading

miri64 commented Jan 26, 2019 •

edited

Loading

miri64 commented Jan 26, 2019 •

edited

Loading

miri64 commented Jan 26, 2019 •

edited

Loading