Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

gnrc: crash with (excessive) traffic in native #6123

Closed
kaspar030 opened this issue Nov 15, 2016 · 44 comments
Closed

gnrc: crash with (excessive) traffic in native #6123

kaspar030 opened this issue Nov 15, 2016 · 44 comments
Assignees
Labels
Area: network Area: Networking Platform: native Platform: This PR/issue effects the native platform Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@kaspar030
Copy link
Contributor

kaspar030 commented Nov 15, 2016

#2071 seems to have fixed native interrupt handling. But I can still crash a riot native instance by flood pinging.

Steps to reproduce:

  1. run gnrc_networking
  2. start a couple (5-10) ping -f <riot-ip>%tap0
  3. wait some minutes

Sometimes it takes 30sec, sometimes 5min, but riot reliably crashes.

The problem seems somewhere in gnrc, as my other stack runs fine since #2071 is merged.
Here's a backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x0804df49 in gnrc_pktbuf_start_write ()
(gdb) bt
#0  0x0804df49 in gnrc_pktbuf_start_write ()
#1  0x0805c0ae in _send ()
#2  0x0805ba21 in _event_loop ()
#3  0xf7dba58b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()
(gdb)

Edit: summary of reasons why this might not be a GNRC issue in #6123 (comment).

@kaspar030 kaspar030 added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) GNRC labels Nov 15, 2016
@miri64
Copy link
Member

miri64 commented Nov 15, 2016

I have a hunch that this might be related to #5748 (or at least fixable with #6086). Can you try with #6086?

@miri64
Copy link
Member

miri64 commented Nov 15, 2016

(it might be, that gnrc_pktbuf_start_write() needs fixing in #6086 too)

@miri64
Copy link
Member

miri64 commented Nov 16, 2016

okay does not seem to be related. Here's a more detailed GDB dump:

Program received signal SIGSEGV, Segmentation fault.
0x0804e026 in gnrc_pktbuf_start_write (pkt=0x20) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_static/gnrc_pktbuf_static.c:256
256     if ((pkt == NULL) || (pkt->size == 0)) {
(gdb) where
#0  0x0804e026 in gnrc_pktbuf_start_write (pkt=0x20) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_static/gnrc_pktbuf_static.c:256
#1  0x0805c447 in _send (pkt=0x8083ef4 <_pktbuf+756>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:684
#2  0x0805bd8e in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:279
#3  0xf7e3116b in makecontext () at ../sysdeps/unix/sysv/linux/i386/makecontext.S:91
#4  0x00000000 in ?? ()

@miri64
Copy link
Member

miri64 commented Nov 16, 2016

Or maybe it is.. sorry about the noise. pkt=0x20 comes from the packet in _send, so this still might be a packet buffer problem...

@miri64
Copy link
Member

miri64 commented Nov 16, 2016

(but #6086 does not solve the problem)

@kaspar030
Copy link
Contributor Author

(but #6086 does not solve the problem)

Too bad. Thanks for testing, I didn't get around, yet. So you can reproduce this?

@miri64
Copy link
Member

miri64 commented Nov 16, 2016

So you can reproduce this?

Yepp. But since #6086 doesn't fix #5748 either the problems still seem to be related. But don't have the time atm to look into that. :(

@smlng
Copy link
Member

smlng commented Jan 31, 2018

cannot reproduce with 2018.01-RC2 on macOS native, flood pinged for >20min with 8 instances against 1 RIOT still works

@kaspar030
Copy link
Contributor Author

Still happens here, current master, on linux.

@kaspar030
Copy link
Contributor Author

Still happens here, current master, on linux.

8 instances, started manually. after 6-7 first packets get lost and riot prints a couple of "gnrc_netif: possibly lost interrupt.", like every second, then it crashes:

gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
Segmentation fault (core dumped)```

@smlng
Copy link
Member

smlng commented Jan 31, 2018

my test still runs on macOS, RIOT native not crashed after > 1h with 8 parallel ping6 -f on its link-local address. Though, according to Wireshark not all ping request get a reply but many do.

@kaspar030
Copy link
Contributor Author

cannot reproduce with 2018.01-RC2 on macOS native

That might hint that the problem is linux specific. What machine did you run the test on? Maybe this only happens when RIOT doesn't get scheduled enough? My thinkpad is fairly slow by today's standards.

@kaspar030
Copy link
Contributor Author

my test still runs on macOS, RIOT native not crashed after > 1h with 8 parallel ping6 -f on its link-local address.

Can you try increasing to 16 or more?

@smlng
Copy link
Member

smlng commented Jan 31, 2018

latest iMac with i7 @ ~4Ghz and 16GB RAM

Can you try increasing to 16 or more?

sure!

@kaspar030
Copy link
Contributor Author

kaspar030 commented Jan 31, 2018

latest iMac with i7 @ ~4Ghz and 16GB RAM

That's a 4GHz quad core? Eight ping6 -f are enough needed to break native on dual 1.6GHz. Do the math, we need to starve the RIOT native thread. ;)

@smlng
Copy link
Member

smlng commented Jan 31, 2018

so 16 parallel pings on RIOT native, still working - but lots of `ping6: sendmsg: No buffer space available" from macOS

@smlng
Copy link
Member

smlng commented Jan 31, 2018

and Wireshark has a hard time to capture all those ping requests and replies ...

@bergzand
Copy link
Member

The other difference between macOS and Linux is that the async_read used by netdev_tap has a different implementation. Maybe that is causing the different observations here.

@kYc0o
Copy link
Contributor

kYc0o commented Jul 20, 2018

I tried to reproduce this issue and it really took longtime before crashing, though I didn't exaggerate the number of pings, only 4 instances flooding on vagrant.

Is it expected that RIOT ensures an "eternal" functioning state?

@miri64
Copy link
Member

miri64 commented Jul 20, 2018

It is expected that a system exposed to a network is able to either handle the packets or drop them. Crashing is unexceptable as this leaves doors for DoS attacks open. It is unclear though if it is actually GNRC, RIOT, the native sublayer or even Linux' TAP infrastructure that is crashing here. Maybe some simple, complexity-increasing tests (first test a TAP application on Linux, than just netdev_tap, then netdev_tap in a non-main thread, then go to setup explained above) could pin-point the cause of the issue.

@kaspar030
Copy link
Contributor Author

Is it expected that RIOT ensures an "eternal" functioning state?

Yes.

@miri64
Copy link
Member

miri64 commented Jul 21, 2018

I still can reproduce btw. Maybe in September I find some time to bite myself into this like I did with the leak in gnrc_pktbuf_static.

@cladmi cladmi changed the title gnrc: crash with (excessive) traffic gnrc: crash with (excessive) traffic in native Aug 11, 2018
@miri64 miri64 added the Platform: MSP Platform: This PR/issue effects MSP-based platforms label Sep 30, 2018
@miri64
Copy link
Member

miri64 commented Jan 26, 2019

This issue was re-reported in #10875 with a bit more detailed test description. If the description above is not enough for a tester, please refer to #10875.

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

In #10875 I said

[…] in [the GDB dump in this issue and and in #10875] the packet seems to get corrupted while being in gnrc_ipv6's message queue (possibly due to a too early release)

I can confirm this now at least for one isolated case (note that pkt in _send is at an unused spot):

Program received signal SIGSEGV, Segmentation fault.
0x5656d0a2 in gnrc_netif_hdr_get_netif (hdr=0x18) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
291         return gnrc_netif_get_by_pid(hdr->if_pid);
(gdb) call gnrc_pktbuf_stats()
packet buffer: first byte: 0x5659cd80, last byte: 0x5659e580 (size: 6144)
  position of last byte used: 6144
=========== chunk   0 (0x5659cd80 size:   96) ===========
00000000  98  CD  59  56  B0  D3  59  56  DC  05  00  00  01  00  00  00
00000010  01  00  00  00  00  00  00  00  00  00  00  00  B0  CD  59  56
00000020  14  00  00  00  01  00  00  00  FF  FF  FF  FF  3A  40  FE  80
00000030  06  06  06  00  00  00  00  00  1E  02  ED  E1  3E  C0  1E  02
00000040  ED  E1  3E  C1  00  00  1C  02  08  CE  59  56  80  DF  59  56
00000050  DC  05  00  00  01  00  00  00  01  00  00  00  00  00  DD  63
~ unused: 0x5659cde0 (next: 0x5659ce38, size:   40) ~
=========== chunk   1 (0x5659ce08 size:   48) ===========
00000000  00  00  00  00  20  CE  59  56  14  00  00  00  01  00  00  00
00000010  FF  FF  FF  FF  46  47  48  49  06  06  06  00  00  00  00  00
[…]
(gdb) where
#0  0x5656d0a2 in gnrc_netif_hdr_get_netif (hdr=0x18) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
#1  0x5656dcde in _send (pkt=0x5659cde0 <_pktbuf+96>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:539
#2  0x5656d4a2 in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:193
#3  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()

I'll try to reproduce it a few times more now.

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

Reproduced it (with different pointers) three more times and one time even with sudo ping -s0 -f fe80::1c02:edff:fee1:3ec1%tapbr0, so size doesn't matter here as assessed by @gschorcht in #10875 (comment) is also confirmed. I'll try to reproduce with gnrc_pktbuf_malloc and valgrind now.

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

Was able to reproduce with valgrind, but I'm confused by the output.

==25483== Invalid read of size 4
==25483==    at 0x120672: _send (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:533)
==25483==    by 0x11FE55: _event_loop (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:193)
==25483==    by 0x498927A: makecontext (in /usr/lib32/libc-2.28.so)
==25483==  Address 0x5fd6c98 is 16 bytes inside a block of size 20 free'd
==25483==    at 0x4836857: free (/build/valgrind/src/valgrind/coregrind/m_replacemalloc/vg_replace_malloc.c:530)
==25483==    by 0x1120E5: _release_error_locked (sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:214)
==25483==    by 0x112165: gnrc_pktbuf_release_error (sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:228)
==25483==    by 0x12B753: gnrc_pktbuf_release (sys/include/net/gnrc/pktbuf.h:171)
==25483==    by 0x12B7F0: gnrc_pktbuf_remove_snip (sys/net/gnrc/pktbuf/gnrc_pktbuf.c:59)
==25483==    by 0x1206DC: _send (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:558)
==25483==    by 0x11FE55: _event_loop (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:193)
==25483==    by 0x498927A: makecontext (in /usr/lib32/libc-2.28.so)
==25483==  Block was alloc'd at
==25483==    at 0x483562B: malloc (/build/valgrind/src/valgrind/coregrind/m_replacemalloc/vg_replace_malloc.c:299)
==25483==    by 0x112301: _create_snip (sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:276)
==25483==    by 0x111BD8: gnrc_pktbuf_add (sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:97)
==25483==    by 0x11158D: gnrc_netif_hdr_build (sys/net/gnrc/netif/hdr/gnrc_netif_hdr.c:20)
==25483==    by 0x11E7E9: gnrc_icmpv6_echo_req_handle (sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c:94)
==25483==    by 0x11E252: gnrc_icmpv6_demux (sys/net/gnrc/network_layer/icmpv6/gnrc_icmpv6.c:91)
==25483==    by 0x11FC3F: _demux (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:112)
==25483==    by 0x120CF5: _receive (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:799)
==25483==    by 0x11FE42: _event_loop (sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:188)
==25483==    by 0x498927A: makecontext (in /usr/lib32/libc-2.28.so)

So it crashes in

if (pkt->type == GNRC_NETTYPE_NETIF) {
because it was already released in
pkt = gnrc_pktbuf_remove_snip(tmp_pkt, tmp_pkt);
(which comes after...). Either the packet was put into the queue twice or it did not properly dequeue for some reason. Could that be possible @kaspar030?

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

Since I am not yet able to reproduce this by adding some printfs around the msg_receive in gnrc_ipv6:c:_event_loop, I'm wondering if this might be related to #1811 (edit: which was only closed because vtimer got kicked out so the issue wasn't reproducable any more as described)... did hit the race condition now after some time

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

I was able to reproduce the crash with the following patch:

diff --git a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
index feb2e8f10..d42696c18 100644
--- a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
+++ b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
@@ -180,7 +180,11 @@ static void *_event_loop(void *args)
     /* start event loop */
     while (1) {
         DEBUG("ipv6: waiting for incoming message.\n");
+        printf("%d\n", cib_peek((cib_t *)&sched_active_thread->msg_queue));
         msg_receive(&msg);
+        printf("%d (%p)\n", cib_peek((cib_t *)&sched_active_thread->msg_queue),
+               msg.content.ptr);
+
 
         switch (msg.type) {
             case GNRC_NETAPI_MSG_TYPE_RCV:

Shortly before the crash I get the following output (using gnrc_pktbuf_malloc again, but just plane gdb):

[…]
7
0 (0x565ac0a0)
0
1 (0x565ab1d0)
1
2 (0x565ac060)
2
3 (0x565ab1f0)
3
4 (0x565acd60)
4
5 (0x565ac740)
5
6 (0x565ab190)
6
7 (0x565abfc0)
7
0 (0x565ab210)
0
1 (0x565ac0a0)
1
2 (0x565ab1d0)
2
3 (0x565ac060)
3
4 (0x565ab1f0)
4
5 (0x565acd60)
5
6 (0x565ac740)
6
7 (0x565ab190)
7
-1 (0x565ab230)
0
-1 (0x565a9e50)
-1
-1 (0x565a9e50)

Program received signal SIGSEGV, Segmentation fault.
0x5655efa7 in _release_error_locked (pkt=0x20, err=22) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:210
210             tmp = pkt->next;
(gdb) where
#0  0x5655efa7 in _release_error_locked (pkt=0x20, err=22) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:210
#1  0x5655f05e in gnrc_pktbuf_release_error (pkt=0x565a9de8, err=22) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/pktbuf_malloc/gnrc_pktbuf_malloc.c:228
#2  0x5656d6e9 in _send (pkt=0x565a9de8, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:566
#3  0x5656ce44 in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:197
#4  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#5  0x00000000 in ?? ()

Notice the last for lines before the SIGSEGV. Weirdly, msg.content.ptr also is different from the pointer handed to _send as pkt (which should be msg.content.ptr).

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

7
-1 (0x565ab230)
0
-1 (0x565a9e50)
-1
-1 (0x565a9e50)

I think that can be seen as significant, because these are the first -1 that comes up once the ping starts.

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

Also able to reproduce with gnrc_pktbuf_static. This time the pointers match.

[…]
5
6 (0x5659cdb0)
6
7 (0x5659ce28)
7
0 (0x5659ce40)
0
1 (0x5659d1d0)
1
2 (0x5659cf88)
2
3 (0x5659d1e8)
3
4 (0x5659cf68)
4
5 (0x5659cff8)
5
6 (0x5659d080)
6
7 (0x5659d098)
7
0 (0x5659d118)
0
1 (0x5659ce50)
1
2 (0x5659cd80)
2
3 (0x5659cff8)
3
-1 (0x5659d118)
-1
-1 (0x5659d118)

Program received signal SIGSEGV, Segmentation fault.
0x5656d13f in gnrc_netif_hdr_get_netif (hdr=0x1468) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
291         return gnrc_netif_get_by_pid(hdr->if_pid);
(gdb) where
#0  0x5656d13f in gnrc_netif_hdr_get_netif (hdr=0x1468) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
#1  0x5656ddd4 in _send (pkt=0x5659d118 <_pktbuf+920>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:543
#2  0x5656d598 in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:197
#3  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()
(gdb) call gnrc_pktbuf_stats()
gnrc_netif: possibly lost interrupt.
packet buffer: first byte: 0x5659cd80, last byte: 0x5659e580 (size: 6144)
  position of last byte used: 2984
=========== chunk   0 (0x5659cd80 size:  224) ===========
00000000  B8  CF  59  56  50  CE  59  56  0E  00  00  00  01  00  00  00
00000010  FF  FF  FF  FF  00  00  00  00  00  00  00  00  B0  CE  59  56
[…]
=========== chunk   6 (0x5659d0f0 size:   40) ===========
00000000  60  00  00  00  00  08  3A  40  FE  80  00  00  00  00  00  00
00000010  1C  02  ED  FF  FE  E1  3E  C1  FE  80  00  00  00  00  00  00
00000020  1C  02  ED  FF  FE  E1  3E  C0
~ unused: 0x5659d118 (next: (nil), size: 5224) ~

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

New patch, same result (but interesting inside):

diff --git a/core/msg.c b/core/msg.c
index a46875f16..61de2e7a2 100644
--- a/core/msg.c
+++ b/core/msg.c
@@ -310,6 +310,9 @@ static int _msg_receive(msg_t *m, int block)
         DEBUG("_msg_receive: %" PRIkernel_pid ": _msg_receive(): We've got a queued message.\n",
               sched_active_thread->pid);
         *m = me->msg_array[queue_index];
+        if (sched_active_pid == 4) {
+            printf("idx: %d (%p)\n", queue_index, m->content.ptr);
+        }
     }
     else {
         me->wait_data = (void *) m;
@@ -330,6 +333,9 @@ static int _msg_receive(msg_t *m, int block)
             thread_yield_higher();
 
             /* sender copied message */
+            if (sched_active_pid == 4) {
+                printf("blk %d => %p\n", queue_index, m->content.ptr);
+            }
         }
         else {
             irq_restore(state);
@@ -353,6 +359,7 @@ static int _msg_receive(msg_t *m, int block)
         /* copy msg */
         msg_t *sender_msg = (msg_t*) sender->wait_data;
         *m = *sender_msg;
+        printf("sbl %d => %p\n", queue_index, m->content.ptr);
 
         /* remove sender from queue */
         uint16_t sender_prio = THREAD_PRIORITY_IDLE;
idx: 0 (0x5659d210)
idx: 1 (0x5659cd80)
idx: 2 (0x5659cd98)
idx: 3 (0x5659cfe0)
idx: 4 (0x5659d0f0)
idx: 5 (0x5659cf40)
blk -1 => 0x5659cda8
idx: 6 (0x5659cdc0)
idx: 7 (0x5659cd80)
idx: 0 (0x5659cdd8)
idx: 1 (0x5659ce50)
idx: 2 (0x5659cec8)
idx: 3 (0x5659cf80)
idx: 4 (0x5659d178)
idx: 5 (0x5659cfe0)
idx: 6 (0x5659d108)
idx: 7 (0x5659ce38)
idx: 0 (0x5659ceb0)
idx: 1 (0x5659cf68)
idx: 2 (0x5659cff8)
idx: 3 (0x5659cd80)
idx: 4 (0x5659d130)
blk -1 => 0x5659d130

Program received signal SIGSEGV, Segmentation fault.
0x5656d114 in gnrc_netif_hdr_get_netif (hdr=0x1450) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
291         return gnrc_netif_get_by_pid(hdr->if_pid);
(gdb) where
#0  0x5656d114 in gnrc_netif_hdr_get_netif (hdr=0x1450) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
#1  0x5656dd50 in _send (pkt=0x5659d130 <_pktbuf+944>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:543
#2  0x5656d514 in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:197
#3  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()
(gdb) call gnrc_pktbuf_stats()
packet buffer: first byte: 0x5659cd80, last byte: 0x5659e580 (size: 6144)
  position of last byte used: 2944
=========== chunk   0 (0x5659cd80 size:   48) ===========
00000000  08  CE  59  56  D8  CD  59  56  30  00  00  00  01  00  00  00
00000010  01  00  00  00  00  00  00  00  80  CE  59  56  50  CE  59  56
[…]
~ unused: 0x5659d0d8 (next: 0x5659d148, size:   16) ~
=========== chunk   4 (0x5659d0e8 size:   96) ===========
00000000  60  09  3B  94  00  08  3A  40  FE  80  00  00  00  00  00  00
00000010  1C  02  ED  FF  FE  E1  3E  C0  FE  80  00  00  00  00  00  00
00000020  1C  02  ED  FF  FE  E1  3E  C1  80  00  54  C2  46  29  58  86
00000030  00  00  00  00  30  D1  59  56  14  00  00  00  01  00  00  00
00000040  FF  FF  FF  FF  3A  40  FE  80  06  06  06  00  00  00  00  00
00000050  1E  02  ED  E1  3E  C0  1E  02  ED  E1  3E  C1  00  00  1C  02
~ unused: 0x5659d148 (next: 0x5659d1a0, size:   40) ~
=========== chunk   5 (0x5659d170 size:   48) ===========
00000000  60  09  3B  94  00  08  3A  40  FE  80  00  00  00  00  00  00
00000010  1C  02  ED  FF  FE  E1  3E  C0  FE  80  00  00  00  00  00  00
00000020  1C  02  ED  FF  FE  E1  3E  C1  80  00  55  43  46  2B  58  03
~ unused: 0x5659d1a0 (next: (nil), size: 5088) ~

First interesting difference: this time pkt does not refer to an unused space for the first time, but I think this might be irrelevant for now. More interesting is, that the first time 0x5659d130 was taken from the queue in

RIOT/core/msg.c

Line 312 in bdd2d52

*m = me->msg_array[queue_index];
while the second time it was copied directly from the sending thread to wait_data set here

RIOT/core/msg.c

Line 315 in bdd2d52

me->wait_data = (void *) m;

I'm not 100% sure, but I think that puts the odds more to the side of the packet being somehow queued dispatched for send twice in a row... I'll investigate further.

@gschorcht
Copy link
Contributor

@good job 😄

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

(btw the IPv6 thread itself is the only one in this scenario sending packets to IPv6)

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

(btw the IPv6 thread itself is the only one in this scenario sending packets to IPv6)

Here is where I'm confused... msg_send_to_self() only uses the queue appearently:

RIOT/core/msg.c

Lines 174 to 183 in bdd2d52

int msg_send_to_self(msg_t *m)
{
unsigned state = irq_disable();
m->sender_pid = sched_active_pid;
int res = queue_msg((thread_t *) sched_active_thread, m);
irq_restore(state);
return res;
}

so how does the message end up in wait_data?

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

Yepp.. definitely no double dispatch:

echo:112 0x5659d230
echo:112 0x5659d168
echo:112 0x5659cf28
echo:112 0x5659cf78
echo:112 0x5659d030
echo:112 0x5659d0e0
echo:112 0x5659cef8
echo:112 0x5659d180
echo:112 0x5659d180
echo:112 0x5659d0a0
echo:112 0x5659d198
echo:112 0x5659cf70
echo:112 0x5659d200
echo:112 0x5659cff8
echo:112 0x5659d218
echo:112 0x5659d230
echo:112 0x5659d238
echo:112 0x5659ce88
echo:112 0x5659d078
echo:112 0x5659d0d0
echo:112 0x5659d130
echo:112 0x5659cff0
echo:112 0x5659cfd0
echo:112 0x5659cf80
echo:112 0x5659d1c8
echo:112 0x5659ced0
echo:112 0x5659d1e0
echo:112 0x5659d068
echo:112 0x5659d0c8
echo:112 0x5659d1f8
echo:112 0x5659cf60
echo:112 0x5659d210
echo:112 0x5659d1f8
echo:112 0x5659cfe0
echo:112 0x5659d140
echo:112 0x5659d058
echo:112 0x5659d180
echo:112 0x5659d098
echo:112 0x5659d0b0
echo:112 0x5659cf58
echo:112 0x5659d270
echo:112 0x5659ceb0
echo:112 0x5659d1f0
echo:112 0x5659d0c8
echo:112 0x5659d010
echo:112 0x5659d098
echo:112 0x5659d208
echo:112 0x5659d1b8
echo:112 0x5659d1d0
echo:112 0x5659d1a8
echo:112 0x5659d0d8
echo:112 0x5659ce20
echo:112 0x5659ce98
echo:112 0x5659cf38
echo:112 0x5659cfd0
echo:112 0x5659d048
echo:112 0x5659d0c0
echo:112 0x5659d1d8
echo:112 0x5659d060
echo:112 0x5659d160
echo:112 0x5659ce98
echo:112 0x5659cf38
echo:112 0x5659cf10
echo:112 0x5659d108
echo:112 0x5659d048
echo:112 0x5659d198
echo:112 0x5659d128

Program received signal SIGSEGV, Segmentation fault.
0x5656d0fc in gnrc_netif_hdr_get_netif (hdr=0x18) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
291         return gnrc_netif_get_by_pid(hdr->if_pid);
(gdb) where
#0  0x5656d0fc in gnrc_netif_hdr_get_netif (hdr=0x18) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/include/net/gnrc/netif/hdr.h:291
#1  0x5656dd38 in _send (pkt=0x5659d128 <_pktbuf+936>, prep_hdr=true) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:539
#2  0x5656d4fc in _event_loop (args=0x0) at /home/mlenders/Repositories/RIOT-OS/RIOT/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c:193
#3  0xf7d2e27b in makecontext () from /usr/lib32/libc.so.6
#4  0x00000000 in ?? ()
(gdb)

(this was the patch I used)

https://gist.github.com/miri64/a812f7b71fcf6dc71c70d8a9d79b24a9

I'll retry that with the msg.content.ptr print from above again, to make 100% sure.

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

Yepp no double dispatch...

diff --git a/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c b/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
index a48ac04f3..0b6f9a73c 100644
--- a/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
+++ b/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
@@ -109,6 +109,7 @@ void gnrc_icmpv6_echo_req_handle(gnrc_netif_t *netif, ipv6_hdr_t *ipv6_hdr,
 
     LL_PREPEND(pkt, hdr);
 
+    printf("echo:112 %p\n", (void *)pkt);
     if (!gnrc_netapi_dispatch_send(GNRC_NETTYPE_IPV6, GNRC_NETREG_DEMUX_CTX_ALL,
                                    pkt)) {
         DEBUG("icmpv6_echo: no receivers for IPv6 packets\n");
diff --git a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
index feb2e8f10..e5dc0f32f 100644
--- a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
+++ b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
@@ -181,6 +181,7 @@ static void *_event_loop(void *args)
     while (1) {
         DEBUG("ipv6: waiting for incoming message.\n");
         msg_receive(&msg);
+        printf("ipv6:184 %p\n", msg.content.ptr);
 
         switch (msg.type) {
             case GNRC_NETAPI_MSG_TYPE_RCV:
[…]
ipv6:184 0x5659d0f0
ipv6:184 0x5659cd80
echo:112 0x5659d148
ipv6:184 0x5659ce08
echo:112 0x5659cea8
ipv6:184 0x5659d148
ipv6:184 0x5659cea8
ipv6:184 0x5659cea8
<crash>

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

@kaspar030 it looks more and more that this is some issue in either core or native, since the message somehow ends up in wait_data I rather suspect the first.

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

Summary

To summarize my findings from above (so you don't have to read all that):

  1. A GNRC_NETAPI_MSG_TYPE_SND message is received twice in a row by the gnrc_ipv6 thread from itself at some point causing the thread to access parts of the packet released in
    pkt = gnrc_pktbuf_remove_snip(tmp_pkt, tmp_pkt);
  2. The first time the message is retrieved from the thread's message queue. At the second receive the queue is empty and the message comes from thread_t::wait_data.
  3. msg_send_to_self() only queues the packet.
  4. It was confirmed that the packet is not dispatched twice for sending by the IPv6 thread, so at the moment the bug seems to be somewhere in the IPC state machine.
  5. Bonus (not found above): There is no other msg_send() that could cause this (confirmed with git grep msg_send)

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

That was bullshit... forget what I said. I should stop now.

Edit: I deleted that stupid comment...

@miri64
Copy link
Member

miri64 commented Jan 26, 2019

While the ping is running:

^C
Program received signal SIGINT, Interrupt.
native_isr_entry (sig=29, info=0x5659a5cc <sigalt_stk+6956>, context=0x5659a64c <sigalt_stk+7084>) at /home/mlenders/Repositories/RIOT-OS/RIOT/cpu/native/irq_cpu.c:287
287     {
(gdb) print sched_threads[4]->wait_data
$1 = (void *) 0x565a7480 <_stack+7648>
(gdb) print cib_avail(&sched_threads[4]->msg_queue)
$2 = 6
(gdb) call ps()
        pid | name                 | state    Q | pri | stack  ( used) | base addr  | current
          - | isr_stack            | -        - |   - |   8192 (   -1) | 0x56596500 | 0x56596500
          1 | idle                 | pending  Q |  15 |   8192 (  436) | 0x56594220 | 0x56596084
          2 | main                 | pending  Q |   7 |  12288 ( 1688) | 0x56591220 | 0x56594084
          3 | pktdump              | bl rx    _ |   6 |  12288 ( 1008) | 0x5659e5a0 | 0x565a1404
          4 | ipv6                 | running  Q |   4 |   8192 ( 1864) | 0x565a56a0 | 0x565a7504
          5 | udp                  | bl rx    _ |   5 |   8192 ( 1008) | 0x565a15c0 | 0x565a3424
          6 | gnrc_netdev_tap      | bl rx    _ |   2 |   8192 ( 2380) | 0x565a3680 | 0x565a54e4
          7 | RPL                  | bl rx    _ |   5 |   8192 (  944) | 0x565a7aa0 | 0x565a9904
            | SUM                  |            |     |  73728 ( 9328)

Why is wait_data set, when the thread is running? This has IMHO potential for a race condition.

Edit: 0x565a7480 <_stack+7648> in the pointer to the msg variable in gnrc_ipv6.c:_event_loop.

@miri64
Copy link
Member

miri64 commented Jan 27, 2019

I definitely can cause some funky behavior:

diff --git a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
index feb2e8f10..0c3fa97c4 100644
--- a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
+++ b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
@@ -180,6 +180,7 @@ static void *_event_loop(void *args)
     /* start event loop */
     while (1) {
         DEBUG("ipv6: waiting for incoming message.\n");
+        memset(&msg, 0, sizeof(msg));
         msg_receive(&msg);
 
         switch (msg.type) {
@@ -220,6 +221,7 @@ static void *_event_loop(void *args)
                 gnrc_ipv6_nib_handle_timer_event(msg.content.ptr, msg.type);
                 break;
             default:
+                printf("ipv6: unknown message type 0x%04x\n", msg.type);
                 break;
         }
     }

yields

[…]
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
ipv6: unknown message type 0x0000
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
gnrc_netif: possibly lost interrupt.
[…]

@miri64
Copy link
Member

miri64 commented Jan 27, 2019

BTW with that patch, it took way longer (~1.5h) to crash the node and it crashed somewhere else (in some _receive() but don't know which, since I forgot to compile with debugging symbols, but I guess it is not in gnrc_ipv6.c ;-)).

@miri64
Copy link
Member

miri64 commented Jan 27, 2019

I've opened an issue btw (#10881) to have discussions regarding the potential msg_receive() bug there.

miri64 added a commit to RIOT-OS/Release-Specs that referenced this issue Jan 29, 2019
Intended to check if there is a regression to [1].

[1] RIOT-OS/RIOT#6123
@miri64
Copy link
Member

miri64 commented Feb 1, 2019

💃

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Area: network Area: Networking Platform: native Platform: This PR/issue effects the native platform Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

6 participants