Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

mrouted segfault after several hours #56

Closed
Lnx4F opened this issue May 25, 2022 · 12 comments
Closed

mrouted segfault after several hours #56

Lnx4F opened this issue May 25, 2022 · 12 comments
Milestone

Comments

@Lnx4F
Copy link

Lnx4F commented May 25, 2022

Hello! Currently have a setup using mrouted 4.4 on 2 interfaces on a brand new installation of Debian 11 CLI.
Downloaded the 4.4 client from deb.troglobit.com.

No modifications to mrouted.conf, just launching the daemon and letting it do it's thing for the 2 interfaces.

Please spare my lack of knowledge as I am neither a Linux pro, or a developer. :(

My 2 interfaces are:

  1. 10.11.12.0/24
  2. 192.168.1.0/24

I can see that the daemon crashes and pulled the syslog at that moment:

May 24 18:16:44 mrtr01 mrouted[625]: warning - Failed MRT_DEL_MFC(169.254.153.104 239.255.255.250): No such file or directory
May 24 18:16:44 mrtr01 mrouted[625]: warning - age_table_entry() trying to delete no-route (169.254.153.104 239.255.255.250): No such file or directory
May 25 05:03:56 mrtr01 kernel: [39776.168675] mrouted[625]: segfault at 21 ip 000055cd86b1fc61 sp 00007fff879362a0 error 6 in mrouted[55cd86b0e000+17000]
May 25 05:03:56 mrtr01 kernel: [39776.168727] Code: 30 85 ff 7f 54 8b 7b 38 85 ff 7f 3d 8b 73 10 89 ef e8 23 8e ff ff 48 8b 03 48 8b 53 08 48 85 c0 74 4f 48 89 50 08 48 8b 53 08 <48> 89 02 48 89 df e8 d4 e3>
May 25 05:03:56 mrtr01 systemd[1]: mrouted.service: Main process exited, code=killed, status=11/SEGV
May 25 05:03:56 mrtr01 systemd[1]: mrouted.service: Failed with result 'signal'.
May 25 05:03:56 mrtr01 systemd[1]: mrouted.service: Consumed 4.906s CPU time.

Restarting the daemon will allow it to run for another few hours, but will crash again. I previously had mrouted running on an Ubuntu VM, and the same issue happened. I am definitely suspecting something strange on MY network affecting the daemon. Perhaps the strange APIPA route del requests are doing something to the daemon?

My linux knowledge is very minimal, but if I'm given the steps to do something I will try and get it done for you. I am willing to debug or do whatever is needed. My goal is simply to have a stable daemon that I do not need to restart every so often. Appreciate any help that can be given!

@troglobit
Copy link
Owner

troglobit commented May 29, 2022

Hmm, yeah that's not right, it shouldn't segfault obviously. Regardless of your network setup.

The failure to delete the APIPA route could be related, but there are quite a few hours in between that log entry and the segfault.

My recommendation is to rebuild with GDB debug flags, start mrouted manually (or run make install instead of make install-strip) so we retain the debug symbols. When it eventually crashes, you launch coredumpctl to 1) list, and then 2) debug the crash. Like this:

make clean
./configure CFLAGS="-g -Og"  your-other-configure-flags-here-like-prefix-etc
make -j9
sudo make install

Start it ... wait for crash

coredumpctl debug
> bt full

The bt full command (in GDB, which you need to have installed) shows a detailed bactrace that you can attach to this issue. Thanks!

@troglobit
Copy link
Owner

I've set up a long-term test in my home network to see if I can reproduce, but it would be real helpful if you could get the backtrace (bt) from your setup.

@troglobit
Copy link
Owner

Hi again, unfortunately I've not been able to reproduce your crash in my (limited) setup at home. I'll let this issue remain open for other ppl to chime in as well.

@jjr-simiatec
Copy link

jjr-simiatec commented Oct 5, 2024

Hi.

With the setup I have at the moment, this problem occurs a few times an hour. I've looked at multiple core dumps and the crash occurs in the same place, which is good. Please find the stack backtrace below:

Core was generated by `/usr/sbin/mrouted --foreground --syslog --config /tmp/mrouted.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x004f64c2 in delete_group_cb (arg=0x1201318) at vif.c:2307
2307        TAILQ_REMOVE(&uv->uv_groups, g, al_link);
(gdb) bt full
#0  0x004f64c2 in delete_group_cb (arg=0x1201318) at vif.c:2307
        cbk = 0x1201318
        g = 0x1200129
        vifi = 2
        uv = 0x12071a0
#1  0x004f528c in timer_age_queue (elapsed_time=0) at timer.c:100
        ptr = 0x1201818
        tmp = 0x0
        i = 1
#2  0x004e9498 in timeout (n=-1, n@entry=0) at main.c:673
        difftime = {tv_sec = 0, tv_nsec = 943374966}
        curtime = {tv_sec = 10102, tv_nsec = 196178963}
        lasttime = {tv_sec = 10102, tv_nsec = 196178963}
        init = 0
        secs = 0
#3  0x004e9bee in main (argc=<optimized out>, argv=<optimized out>) at main.c:491
        fp = <optimized out>
        foreground = 1
        vers = <optimized out>
        n = 0
        i = <optimized out>
        ch = <optimized out>
        pfd = 0x1207340
        sa = {__sigaction_handler = {sa_handler = 0x4e8ee9 <handle_signals>, sa_sigaction = 0x4e8ee9 <handle_signals>}, sa_mask = {
            __val = {0, 0, 1996389268, 2, 1996257656, 9, 0, 1, 1996387824, 0, 1996387824, 1996389364, 1996264656, 1996386304, 0,
              2124702348, 1996389364, 2124702352, 1038094023, 2124702440, 4294967295, 1996264656, 1996387824, 1, 2124702464, 0, 0,
              0, 1996388904, 0, 1996266112, 1996264656}}, sa_flags = 0, sa_restorer = 0x7ea46298}
        long_options = {{name = 0x4f98cc "debug", has_arg = 2, flag = 0x0, val = 100}, {name = 0x4f98d4 "config", has_arg = 1,
            flag = 0x0, val = 102}, {name = 0x4f98dc "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x4f98e4 "ident",
            has_arg = 1, flag = 0x0, val = 105}, {name = 0x4f98ec "loglevel", has_arg = 1, flag = 0x0, val = 108}, {
            name = 0x4f98f8 "foreground", has_arg = 0, flag = 0x0, val = 110}, {name = 0x4f9904 "pidfile", has_arg = 1,
            flag = 0x0, val = 112}, {name = 0x4f990c "syslog", has_arg = 0, flag = 0x0, val = 115}, {name = 0x4f9914 "table-id",
            has_arg = 1, flag = 0x0, val = 116}, {name = 0x4f9920 "ipc", has_arg = 1, flag = 0x0, val = 117}, {
            name = 0x4f9924 "version", has_arg = 0, flag = 0x0, val = 118}, {name = 0x4f992c "startup-delay", has_arg = 1,
            flag = 0x0, val = 119}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}

Also the configuration file referenced on the command line:

no phyint
phyint br1 enable
phyint br2 enable

mrouted version is 4.5. There are no errors/warnings generated by mrouted in syslog.

@troglobit
Copy link
Owner

Thank you @jjr-simiatec for the backtrace! That code path is entered when a group subscription times out, i.e., when a receiver stops sending IGMP messages to "join" the given group.

I see now that the delete_group_cb() callback can potentially run if the group has already been removed. This can happen if the interface is taken down (ifconfig down), or SIGHUP/restart command is received, before the timer elapses. I tested the hypothesis and managed to trigger your exact segfault. Is this something you think occurred in your setup?

troglobit added a commit that referenced this issue Oct 6, 2024
Ensure all group timers are stopped when stopping interfaces.

Fixes #56

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
@troglobit
Copy link
Owner

@jjr-simiatec I've just pushed a branch misc with fixes for the issues identified so far. If you could give it a go in your setup, I'd be very happy.

@jjr-simiatec
Copy link

Hi @troglobit. That was some fast work!

I've tested the version on the misc branch and unfortunately the fault occurs at the same location within delete_group_cb():

Core was generated by `/usr/sbin/mrouted --foreground --syslog --config /tmp/mrouted.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00470542 in delete_group_cb (arg=0x16828e0) at vif.c:2328
2328        TAILQ_REMOVE(&uv->uv_groups, g, al_link);

I have added the -fsanitize=address compiler flag, rebuilt and repeated testing. This results in the output shown below:

=================================================================
==24375==ERROR: AddressSanitizer: heap-use-after-free on address 0x74301150 at pc 0x00489ba3 bp 0x7e83af78 sp 0x7e83af7c
READ of size 4 at 0x74301150 thread T0
    #0 0x489ba0 in delete_group_cb /usr/src/mrouted-misc/src/vif.c:2310
    #1 0x4856b2 in timer_age_queue /usr/src/mrouted-misc/src/timer.c:100
    #2 0x46045a in timeout /usr/src/mrouted-misc/src/main.c:673
    #3 0x461264 in main /usr/src/mrouted-misc/src/main.c:491
    #4 0x768fe446  (/lib/libc.so.6+0x23446)
    #5 0x768fe4da in __libc_start_main (/lib/libc.so.6+0x234da)

0x74301150 is located 0 bytes inside of 16-byte region [0x74301150,0x74301160)
freed by thread T0 here:
    #0 0x76abb044 in free ../../../../libsanitizer/asan/asan_malloc_linux.cpp:52
    #1 0x489b94 in delete_group_cb /usr/src/mrouted-misc/src/vif.c:2331
    #2 0x4856b2 in timer_age_queue /usr/src/mrouted-misc/src/timer.c:100
    #3 0x46045a in timeout /usr/src/mrouted-misc/src/main.c:673
    #4 0x461264 in main /usr/src/mrouted-misc/src/main.c:491
    #5 0x768fe446  (/lib/libc.so.6+0x23446)
    #6 0x768fe4da in __libc_start_main (/lib/libc.so.6+0x234da)

previously allocated by thread T0 here:
    #0 0x76abb41a in calloc ../../../../libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x48666e in delete_group_timer /usr/src/mrouted-misc/src/vif.c:2341
    #2 0x48b106 in accept_group_report /usr/src/mrouted-misc/src/vif.c:1016
    #3 0x467df8 in accept_igmp /usr/src/mrouted-misc/src/igmp.c:340
    #4 0x4683f0 in igmp_read /usr/src/mrouted-misc/src/igmp.c:257
    #5 0x4612d6 in main /usr/src/mrouted-misc/src/main.c:501
    #6 0x768fe446  (/lib/libc.so.6+0x23446)
    #7 0x768fe4da in __libc_start_main (/lib/libc.so.6+0x234da)

SUMMARY: AddressSanitizer: heap-use-after-free /usr/src/mrouted-misc/src/vif.c:2310 in delete_group_cb
Shadow bytes around the buggy address:
  0x74300e80: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
  0x74300f00: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
  0x74300f80: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
  0x74301000: fa fa fd fd fa fa 00 00 fa fa 00 00 fa fa fd fd
  0x74301080: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
=>0x74301100: fa fa fd fd fa fa fd fd fa fa[fd]fd fa fa fd fd
  0x74301180: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
  0x74301200: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
  0x74301280: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
  0x74301300: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
  0x74301380: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==24375==ABORTING

Again for multiple runs, the stack traces when the failure occurs are the same.

With regard to your query about a possible cause, the test is free-running without user interaction. There's nothing in the logs to indicate link state changes are occurring either.

Hope this provides a little more insight into the problem!

@troglobit
Copy link
Owner

Great stuff, thank you so much! This was really helpful, and also confirms a nagging thought I had: the callback seems to be called twice -- which was also the only logical option left.

While debugging this during the weekend, I also noticed the internal time was off by a factor two, so I started looking at the timer code again and ... well it's a legacy implementation that will have to go. Fortunately all of these multicast routers were built on top of each other around the same time, so I have another implementation that I know works.

It'll take a few evenings to sort it out though, so can't give you a prognosis for when I can have something new for you to test.

@jjr-simiatec
Copy link

No problem, I'll be ready to test when you are.

troglobit added a commit that referenced this issue Oct 27, 2024
The existing timer implementation was too imprecise and jittered several
seconds between query intervals.  This patch fixes that by replacing it
with the Public Domain library Portable Events (pev).

Issue #56

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
@troglobit
Copy link
Owner

@jjr-simiatec took a bit longer than expected, but found a few other nasty things, as well as a potential speedup at startup. Hope the latest commits on the misc branch work better.

@jjr-simiatec
Copy link

@troglobit looks like a lot of great work! I recreated the test set up with the original version of mrouted to verify that I could still trigger the issue and then dropped in your new version. It's rock-solid and has been for hours - can't argue with results!

The only issue I could find is when mrouted is compiled with 64-bit time support on a 32-bit Linux platform by adding the options -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64. This scenario can occur when mrouted is built with Buildroot and the Build Y2038-ready code option has been enabled, for example. I'm not sure this scenario is supported so won't go into specifics right now, but just so you're aware.

If there are any other aspects you would like me to test, please let me know.

@troglobit
Copy link
Owner

troglobit commented Oct 28, 2024

@troglobit looks like a lot of great work! I recreated the test set up with the original version of mrouted to verify that I could still trigger the issue and then dropped in your new version. It's rock-solid and has been for hours - can't argue with results!

Glad to hear it works better now, finally! 🎉 🥳

Then I'll go ahead and clean up the commits, squash, and merge to master. There are some other things I'd like to do before the next release (like look at your issue below), but then I'll get right on the release train.

The only issue I could find is when mrouted is compiled with 64-bit time support on a 32-bit Linux platform by adding the options -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64. This scenario can occur when mrouted is built with Buildroot and the Build Y2038-ready code option has been enabled, for example. I'm not sure this scenario is supported so won't go into specifics right now, but just so you're aware.

This sounds very interesting! Please go ahead and report that as a separate issue and I'll have a look. (Massive Buildroot user and maintainer here 👋 :)

If there are any other aspects you would like me to test, please let me know.

Nothing that comes to mind at the moment. Thank you so much for helping out testing and nailing this one down!

troglobit added a commit that referenced this issue Oct 28, 2024
Ensure all group timers are stopped when stopping interfaces.

Issue #56

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
@troglobit troglobit added this to the 4.6 milestone Oct 28, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants