Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

vtimer_sleep fails after long uptime #2435

Closed
maxencechotard opened this issue Feb 11, 2015 · 22 comments
Closed

vtimer_sleep fails after long uptime #2435

maxencechotard opened this issue Feb 11, 2015 · 22 comments
Assignees
Labels
Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@maxencechotard
Copy link
Member

I have experienced a problem with nodes on samr21 : after running about two hours, the main thread is automatically killed even if there is while(1) loop inside. I also tried to create the while(1) loop in a thread launched from the main thread and it also stopped after less than two hours. Finally, I tried the rpl_udp example on native and it also stopped after two hours. Is the problem coming from the watchdog ? The branch I used is : https://github.com/maxencechotard/RIOT/tree/UDP_ONLY.

@LudwigKnuepfer
Copy link
Member

sounds like a timer overrun

@maxencechotard
Copy link
Member Author

Is there a timer which stops the main thread ? Because there is no timer timer in the code I have implemented.

@LudwigKnuepfer
Copy link
Member

What exactly do you mean with "the main thread is automatically killed"?

@maxencechotard
Copy link
Member Author

After less than two hours the while(1) I used to send messages to other nodes in the main is stopped and the main thread is not running anymore (I mean it doesn't send anything then and I can't break in the while(1) anymore)

@LudwigKnuepfer
Copy link
Member

Are any other threads still running? what is ps saying?

@LudwigKnuepfer
Copy link
Member

(I assumed the whole node crashed due to some null pointer dereference or the likes of it after the timer overflowed..)

@maxencechotard
Copy link
Member Author

Actually the node is still running in idle thread after leaving main thread

@maxencechotard
Copy link
Member Author

I think that using ps command after leaving main thread will be not possible because not only main thread is stopped but also other which have been created in the main. When the main is stopped I can just break inside the IDLE thread.

@maxencechotard
Copy link
Member Author

Actually, when I execute a ps command after the main thread goes stopped I get :

pid | name               | state Q   | pri | stack ( used) | location
1   | idle               | pending Q | 15  | 256 ( 160)    | 0x200006cc
2   | main               | bl mutex _| 7   | 1536 ( 1296)  | 0x200000cc
3   | uart0              | bl rx _   | 6   | 1024 ( 280)   | 0x2000096c
4   | Transceiver        | bl rx _   | 4   | 1024 ( 508)   | 0x20004488
5   | radio              | bl rx _   | 5   | 1536 ( 300)   | 0x20003500
6   | ip_process         | bl rx _   | 6   | 1536 ( 312)   | 0x2000259c
7   | lowpan_context_rem | bl mutex _| 8   | 1024 ( 304)   | 0x20002b9c
8   | lowpan_transfer    | sleeping _| 6   | 1024 ( 200)   | 0x20002f9c
9   | udp_packet_handler | bl rx _   | 7   | 1536 ( 284)   | 0x20004a88
10  | init_udp_server    | bl rx _   | 7   | 1536 ( 528)   | 0x200013ac
11  | send_acquisition   | running Q | 5   | 1536 ( 556)   | 0x20000da4
    | SUM                |           |     | 13568 ( 4728)

So the main thread is not killed as I said before but it looks like a mutex somewhere stopped it.

@cgundogan
Copy link
Member

when I talked with @haukepetersen @authmillenon and @phiros in the RoMe meeting, I got the impression that they said something about the vtimer having an overflow when running the rpl_udp example for a long time. maybe you guys wanna add something here?

@thomaseichinger
Copy link
Member

@maxencechotard Could you try running ps after 70 and 72 minutes. The underlying hardware timer of vtimer has an overrun after 71 minutes (32bit at 1MHz). And additionally after 46 and 48 minutes. From your experience, does this happen closer to 47 or 70 minutes?

@maxencechotard
Copy link
Member Author

@thomaseichinger I am going try those ps and give you a feedback

@maxencechotard
Copy link
Member Author

I tried a ps at 46,48,70 and 72 minutes and the main thread is still running. I get the same ps print than the one I sent you before. I am trying now to figure out when the main thread is stopped.

@maxencechotard
Copy link
Member Author

I tried this morning and it tooks between 2h to 2h20minutes to stop main thread. I'll try to be more precise in next experiment.

@LudwigKnuepfer LudwigKnuepfer added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) and removed Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) labels Feb 12, 2015
@LudwigKnuepfer LudwigKnuepfer changed the title Problem with rpl_udp example without rpl : main thread is killed after running less than 2 hours vtimer_sleep fails after long uptime Feb 12, 2015
@LudwigKnuepfer
Copy link
Member

Judging from the ps output and your main loop, the vtimer_sleep call is responsible for your problems, and it has nothing to do with rpl/udp/...

@maxencechotard
Copy link
Member Author

@LudwigOrtmann Is there any way I can debug vtimer_sleep function ?

@LudwigKnuepfer
Copy link
Member

If you have a look at the implementation of vtimer_sleep, you will see it is flawless and has no debug output. The problem is probably in vtimer_set. You can enable debug output for vtimer like this:

index 4662ac8..721962e 100644
--- a/sys/vtimer/vtimer.c
+++ b/sys/vtimer/vtimer.c
@@ -32,7 +32,7 @@

 #include "vtimer.h"

-#define ENABLE_DEBUG    (0)
+#define ENABLE_DEBUG    (1)
 #include "debug.h"

Feel free to add some debug output in the same fashion you find all over the place to vtimer_sleep ;)

@maxencechotard
Copy link
Member Author

I can confirm that every time it take between 2h12 and 2h14 to stop the main thread. Sounds really like a timer problem :)

@jfischer-no
Copy link
Contributor

I can confirm that every time it take between 2h12 and 2h14 to stop the main thread. Sounds really like a timer problem :)

I have also observed it. Can not say more currently, but it is around 2hxx

@OlegHahm
Copy link
Member

Could be related to #1753

lightblu pushed a commit to lightblu/RIOT that referenced this issue Mar 2, 2015
vtimer does not handle well the different timers (vtimer <-> hwtimer)
with regard  to their overflows:

* in update_shortterm HWTIMER_TICKS cannot be just applied to next, this will be wrong when next overflows.
* in vtimer_now wrong parentheses mix up vtimer and hwtimer ticks.

Maybe related issues:
* RIOT-OS#2435
* RIOT-OS#1753
daniel-k pushed a commit to daniel-k/RIOT that referenced this issue Jun 1, 2015
vtimer does not handle well the different timers (vtimer <-> hwtimer)
with regard  to their overflows:

* in update_shortterm HWTIMER_TICKS cannot be just applied to next, this will be wrong when next overflows.
* in vtimer_now wrong parentheses mix up vtimer and hwtimer ticks.

Maybe related issues:
* RIOT-OS#2435
* RIOT-OS#1753
@jnohlgard
Copy link
Member

I managed to get a backtrace on a hanged system running the vtimer_msg test. The timer ran fine until hwtimer_now reported 12313 seconds (decimal) and 0x44e9 ticks (hexadecimal, 1/32768 second per tick). This was on a mulle with #3140, #3147 and #3146 applied.

The short term queue contains a loop:

(gdb) print shortterm_priority_queue_root 
$11 = {first = 0x2000112c <main_stack+1380>}
(gdb) print shortterm_priority_queue_root.first 
$12 = (priority_queue_node_t *) 0x2000112c <main_stack+1380>
(gdb) print *shortterm_priority_queue_root.first 
$13 = {next = 0x2000000c <msg_a>, priority = 3912824240, data = 16777216}
(gdb) print *shortterm_priority_queue_root.first->next
$14 = {next = 0x2000000c <msg_a>, priority = 3914335200, data = 0}
(gdb) print *shortterm_priority_queue_root.first->next->next
$15 = {next = 0x2000000c <msg_a>, priority = 3914335200, data = 0}
(gdb) print hwtimer_stats
$16 = {count_set_total = 25376, count_set_absolute = 25376, count_unset_total = 4847, count_isr_total = 20529, 
  count_isr_rollover = 0, count_isr_handler_call = 20528, count_isr_unset = 1}
(gdb) print stimer 
$17 = {counter32b = 403477422, cmr32b = 403494142, diff = 16720}
(gdb) print /x stimer 
$18 = {counter32b = 0x180c93ae, cmr32b = 0x180cd4fe, diff = 0x4150}

I still have the system paused in the debugger so I can investigate more but I don't have any time to start reading through the vtimer code right now.

Edit: Added some more debug information (hwtimer_stats and below)

@jnohlgard
Copy link
Member

vtimer was removed (replaced by xtimer)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

8 participants