-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
gnrc: crash with (excessive) traffic in native #6123
Comments
(it might be, that |
okay does not seem to be related. Here's a more detailed GDB dump:
|
Or maybe it is.. sorry about the noise. |
(but #6086 does not solve the problem) |
Too bad. Thanks for testing, I didn't get around, yet. So you can reproduce this? |
cannot reproduce with 2018.01-RC2 on macOS native, flood pinged for >20min with 8 instances against 1 RIOT still works |
Still happens here, current master, on linux. |
8 instances, started manually. after 6-7 first packets get lost and riot prints a couple of "gnrc_netif: possibly lost interrupt.", like every second, then it crashes:
|
my test still runs on macOS, RIOT native not crashed after > 1h with 8 parallel ping6 -f on its link-local address. Though, according to Wireshark not all ping request get a reply but many do. |
That might hint that the problem is linux specific. What machine did you run the test on? Maybe this only happens when RIOT doesn't get scheduled enough? My thinkpad is fairly slow by today's standards. |
Can you try increasing to 16 or more? |
latest iMac with i7 @ ~4Ghz and 16GB RAM
sure! |
That's a 4GHz quad core? Eight ping6 -f are |
so 16 parallel pings on RIOT native, still working - but lots of `ping6: sendmsg: No buffer space available" from macOS |
and Wireshark has a hard time to capture all those ping requests and replies ... |
The other difference between macOS and Linux is that the async_read used by netdev_tap has a different implementation. Maybe that is causing the different observations here. |
I tried to reproduce this issue and it really took longtime before crashing, though I didn't exaggerate the number of pings, only 4 instances flooding on vagrant. Is it expected that RIOT ensures an "eternal" functioning state? |
It is expected that a system exposed to a network is able to either handle the packets or drop them. Crashing is unexceptable as this leaves doors for DoS attacks open. It is unclear though if it is actually GNRC, RIOT, the |
Yes. |
I still can reproduce btw. Maybe in September I find some time to bite myself into this like I did with the leak in |
In #10875 I said
I can confirm this now at least for one isolated case (note that
I'll try to reproduce it a few times more now. |
Reproduced it (with different pointers) three more times and one time even with |
Was able to reproduce with
So it crashes in
|
|
I was able to reproduce the crash with the following patch: diff --git a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
index feb2e8f10..d42696c18 100644
--- a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
+++ b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
@@ -180,7 +180,11 @@ static void *_event_loop(void *args)
/* start event loop */
while (1) {
DEBUG("ipv6: waiting for incoming message.\n");
+ printf("%d\n", cib_peek((cib_t *)&sched_active_thread->msg_queue));
msg_receive(&msg);
+ printf("%d (%p)\n", cib_peek((cib_t *)&sched_active_thread->msg_queue),
+ msg.content.ptr);
+
switch (msg.type) {
case GNRC_NETAPI_MSG_TYPE_RCV: Shortly before the crash I get the following output (using
Notice the last for lines before the |
I think that can be seen as significant, because these are the first |
Also able to reproduce with
|
New patch, same result (but interesting inside): diff --git a/core/msg.c b/core/msg.c
index a46875f16..61de2e7a2 100644
--- a/core/msg.c
+++ b/core/msg.c
@@ -310,6 +310,9 @@ static int _msg_receive(msg_t *m, int block)
DEBUG("_msg_receive: %" PRIkernel_pid ": _msg_receive(): We've got a queued message.\n",
sched_active_thread->pid);
*m = me->msg_array[queue_index];
+ if (sched_active_pid == 4) {
+ printf("idx: %d (%p)\n", queue_index, m->content.ptr);
+ }
}
else {
me->wait_data = (void *) m;
@@ -330,6 +333,9 @@ static int _msg_receive(msg_t *m, int block)
thread_yield_higher();
/* sender copied message */
+ if (sched_active_pid == 4) {
+ printf("blk %d => %p\n", queue_index, m->content.ptr);
+ }
}
else {
irq_restore(state);
@@ -353,6 +359,7 @@ static int _msg_receive(msg_t *m, int block)
/* copy msg */
msg_t *sender_msg = (msg_t*) sender->wait_data;
*m = *sender_msg;
+ printf("sbl %d => %p\n", queue_index, m->content.ptr);
/* remove sender from queue */
uint16_t sender_prio = THREAD_PRIORITY_IDLE;
First interesting difference: this time Line 312 in bdd2d52
wait_data set hereLine 315 in bdd2d52
I'm not 100% sure, but I think that puts the odds more to the side of the packet being somehow |
@good job 😄 |
(btw the IPv6 thread itself is the only one in this scenario sending packets to IPv6) |
Here is where I'm confused... Lines 174 to 183 in bdd2d52
so how does the message end up in |
Yepp.. definitely no double dispatch:
(this was the patch I used) https://gist.github.com/miri64/a812f7b71fcf6dc71c70d8a9d79b24a9 I'll retry that with the |
Yepp no double dispatch... diff --git a/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c b/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
index a48ac04f3..0b6f9a73c 100644
--- a/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
+++ b/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
@@ -109,6 +109,7 @@ void gnrc_icmpv6_echo_req_handle(gnrc_netif_t *netif, ipv6_hdr_t *ipv6_hdr,
LL_PREPEND(pkt, hdr);
+ printf("echo:112 %p\n", (void *)pkt);
if (!gnrc_netapi_dispatch_send(GNRC_NETTYPE_IPV6, GNRC_NETREG_DEMUX_CTX_ALL,
pkt)) {
DEBUG("icmpv6_echo: no receivers for IPv6 packets\n");
diff --git a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
index feb2e8f10..e5dc0f32f 100644
--- a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
+++ b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
@@ -181,6 +181,7 @@ static void *_event_loop(void *args)
while (1) {
DEBUG("ipv6: waiting for incoming message.\n");
msg_receive(&msg);
+ printf("ipv6:184 %p\n", msg.content.ptr);
switch (msg.type) {
case GNRC_NETAPI_MSG_TYPE_RCV:
|
@kaspar030 it looks more and more that this is some issue in either |
SummaryTo summarize my findings from above (so you don't have to read all that):
|
That was bullshit... forget what I said. I should stop now. Edit: I deleted that stupid comment... |
While the ping is running:
Why is Edit: |
I definitely can cause some funky behavior: diff --git a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
index feb2e8f10..0c3fa97c4 100644
--- a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
+++ b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
@@ -180,6 +180,7 @@ static void *_event_loop(void *args)
/* start event loop */
while (1) {
DEBUG("ipv6: waiting for incoming message.\n");
+ memset(&msg, 0, sizeof(msg));
msg_receive(&msg);
switch (msg.type) {
@@ -220,6 +221,7 @@ static void *_event_loop(void *args)
gnrc_ipv6_nib_handle_timer_event(msg.content.ptr, msg.type);
break;
default:
+ printf("ipv6: unknown message type 0x%04x\n", msg.type);
break;
}
} yields
|
BTW with that patch, it took way longer (~1.5h) to crash the node and it crashed somewhere else (in some |
I've opened an issue btw (#10881) to have discussions regarding the potential |
Intended to check if there is a regression to [1]. [1] RIOT-OS/RIOT#6123
💃 |
#2071 seems to have fixed native interrupt handling. But I can still crash a riot native instance by flood pinging.
Steps to reproduce:
ping -f <riot-ip>%tap0
Sometimes it takes 30sec, sometimes 5min, but riot reliably crashes.
The problem seems somewhere in gnrc, as my other stack runs fine since #2071 is merged.
Here's a backtrace:
Edit: summary of reasons why this might not be a GNRC issue in #6123 (comment).
The text was updated successfully, but these errors were encountered: