Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[T2][202405] Zebra process consuming a large amount of memory resulting in OOM kernel panics #20337

Closed
arista-nwolfe opened this issue Sep 23, 2024 · 11 comments
Assignees
Labels
Chassis 🤖 Modular chassis support P0 Priority of the issue Triaged this issue has been triaged

Comments

@arista-nwolfe
Copy link
Contributor

arista-nwolfe commented Sep 23, 2024

On full T2 devices in 202405 Arista is seeing the zebra process in FRR consume a large amount of memory (10x compared to 202205).

202405:

root@cmp206-4:~# docker exec -it bgp0 bash
root@cmp206-4:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.2  38116 32024 pts/0    Ss+  21:23   0:01 /usr/bin/python3 /usr/local/bin/supervisord
root          44  0.1  0.2 131684 31888 pts/0    Sl   21:23   0:06 python3 /usr/bin/supervisor-proc-exit-listener --container-name bgp
root          47  0.0  0.0 230080  4164 pts/0    Sl   21:23   0:00 /usr/sbin/rsyslogd -n -iNONE
frr           51 27.5  8.1 2018736 1283692 pts/0 Sl   21:23  16:57 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp

202205:

root@cmp210-3:~# docker exec -it bgp bash
root@cmp210-3:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.1  30524 26232 pts/0    Ss+  21:59   0:00 /usr/bin/python3 /usr/local/bin/supervisord
root          26  0.0  0.1  30808 25712 pts/0    S    21:59   0:00 python3 /usr/bin/supervisor-proc-exit-listener --container-name bgp
root          27  0.0  0.0 220836  3764 pts/0    Sl   21:59   0:00 /usr/sbin/rsyslogd -n -iNONE
frr           31  9.7  0.7 730360 128852 pts/0   Sl   21:59   2:32 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M fpm -M snmp

This results in the system having very low amounts of free memory:

> free -m
               total        used        free      shared  buff/cache   available
Mem:           15388       15304         158         284         481          83

If we run a command which causes zebra to consume even more memory like show ip route it can cause kernel panics due to OOM:

[74531.234009] Kernel panic - not syncing: Out of memory: compulsory panic_on_oom is enabled
[74531.260707] CPU: 1 PID: 735 Comm: auditd Kdump: loaded Tainted: G           OE      6.1.0-11-2-amd64 #1  Debian 6.1.38-4
[74531.313431] Call Trace:
[74531.365891]  <TASK>
[74531.418342]  dump_stack_lvl+0x44/0x5c
[74531.470844]  panic+0x118/0x2ed
[74531.523334]  out_of_memory.cold+0x67/0x7e

When we look at the show memory in FRR we see the max nexthops is significantly higher on 202405 than 202205.
202405:

show memory
Memory statistics for zebra:
  Total heap allocated:  > 2GB
--- qmem libfrr ---
Type                          : Current#   Size       Total     Max#  MaxBytes
Nexthop                       :     1669    160      280536  8113264 1363218720    # ASIC0
Nexthop                       :     1535    160      258120  2097270 352476288     # ASIC1

202205:

show memory
Memory statistics for zebra:
  Total heap allocated:  72 MiB
--- qmem libfrr ---
Type                          : Current#   Size       Total     Max#  MaxBytes
Nexthop                       :     1173    152      178312    36591   5563080

NOTES:
-both 202205 and 202405 have the same number of routes installed
-we also seen an increase on t2-min topologies but the absolute memory usage is at least half of what T2 is seeing so we aren't seeing OOMs on t2-min
-the FRR version changed between 202205=FRRouting 8.2.2 and 202405=FRRouting 8.5.4

@kenneth-arista
Copy link
Contributor

@arlakshm @wenyiz2021

@arlakshm
Copy link
Contributor

arlakshm commented Sep 23, 2024

@bingwang-ms bingwang-ms added Chassis 🤖 Modular chassis support Triaged this issue has been triaged labels Sep 25, 2024
@arlakshm arlakshm added the P0 Priority of the issue label Sep 25, 2024
@arlakshm arlakshm self-assigned this Sep 25, 2024
@anamehra
Copy link
Contributor

anamehra commented Oct 2, 2024

This is the output from Cisco Chassis LC0:

Nexthop                       :     1274    160      214128  1352174 227269168
Nexthop Group                 :        0     32           0        2        80
Nexthop Group Entry           :      440    144       71056     6096    934928
Nexthop Group Connected       :      645     40       25848     1037     41528
Nexthop Group Context         :        0   2104           0        1      2104
Nexthop tracking object       :      173    248       42920      173     42920
Nexthop                       :      159    160       26888      167     28232
Static Nexthop tracking data  :      140     88       12544      140     12544
Static Nexthop                :      141    224       32872      141     32872
root@sfd-t2-lc0:/home/cisco# vtysh  -n 1 -c 'show memory'| grep Next
Nexthop                       :     1270    160      213424  1453138 244261040
Nexthop Group                 :        0     32           0        2        80
Nexthop Group Entry           :      458    144       69824    18917   2878184
Nexthop Group Connected       :      697     40       27880     1437     57480
Nexthop Group Context         :        0   2104           0        1      2104
Nexthop tracking object       :      177    248       43928      177     43928
Nexthop                       :      163    160       27560      173     29240
Static Nexthop tracking data  :      140     88       12752      140     12752
Static Nexthop                :      141    224       33016      141     33016
root@sfd-t2-lc0:/home/cisco# vtysh  -n 0 -c 'show memory'| grep Next
Nexthop                       :     1270    160      213392   189032  31782240
Nexthop Group                 :        0     32           0        2        80
Nexthop Group Entry           :      440    144       69232     2890    439440
Nexthop Group Connected       :      641     40       25672     1191     47672
Nexthop Group Context         :        0   2104           0        1      2104
Nexthop tracking object       :      169    248       41928      169     41928
Nexthop                       :      155    160       26088      161     27096
Static Nexthop tracking data  :      140     88       12688      140     12688
Static Nexthop                :      141    224       32824      141     32824
root@sfd-t2-lc0:/home/cisco# docker exec -it bgp2 ps aux | grep frr
frr           52  0.2  2.2 1418524 718064 pts/0  Sl   05:00   1:46 /usr/lib/frr/
frr           71  0.0  0.0  44380 14460 pts/0    S    05:00   0:00 /usr/lib/frr/
frr           72  0.2  1.1 664720 354504 pts/0   Sl   05:00   1:50 /usr/lib/frr/
root@sfd-t2-lc0:/home/cisco# docker exec -it bgp1 ps aux | grep frr
frr           52  1.6  2.8 1612844 912296 pts/0  Sl   05:00  12:19 /usr/lib/frr/
frr           67  0.0  0.0  44384 14420 pts/0    S    05:00   0:01 /usr/lib/frr/
frr           68  0.2  1.2 703736 395336 pts/0   Sl   05:00   2:12 /usr/lib/frr/
root@sfd-t2-lc0:/home/cisco# docker exec -it bgp0 ps aux | grep frr
frr           58  1.9  4.0 2000064 1298412 pts/0 Sl   05:00  14:38 /usr/lib/frr/
frr           75  0.0  0.0  44380 14460 pts/0    S    05:00   0:00 /usr/lib/frr/
frr           76  0.2  0.9 619544 314072 pts/0   Sl   05:00   1:39 /usr/lib/frr/```

There are some high Max Nexthop entries for asic1. Did not see OOM issue though. Will check 202305 image for comparision.

@arista-nwolfe
Copy link
Contributor Author

Thanks for the output @anamehra
The zebra/frr memory usage looks comparable to what we're seeing on Arista.
Specifically the RSS in the ps aux output
Cisco

root@sfd-t2-lc0:/home/cisco# docker exec -it bgp0 ps aux | grep frr
frr           58  1.9  4.0 2000064 1298412 pts/0 Sl   05:00  14:38 /usr/lib/frr/

Arista

root@cmp206-4:~# docker exec -it bgp0 bash
root@cmp206-4:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
frr           51 27.5  8.1 2018736 1283692 pts/0 Sl   21:23  16:57 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp

I'm guessing the difference in %MEM is due to total memory differences between the two devices.

@arista-nwolfe
Copy link
Contributor Author

arista-nwolfe commented Oct 9, 2024

We tried patching #19717 into 202405 and we saw that the amount of memory Zebra used was significantly reduced:

root@cmp210-3:~# show ip route summary
:
Route Source         Routes               FIB  (vrf default)
kernel               26                   26
connected            28                   28
ebgp                 50841                50841
ibgp                 435                  435
------
Totals               51330                51330

root@cmp210-3:~# ps aux | grep -i zebra
300        37412 23.8  1.4 930336 226108 pts/0   Sl   Oct08   7:21 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp
root       37424  0.0  0.0  96744  7696 ?        Sl   Oct08   0:00 /usr/bin/rsyslog_plugin -r /etc/rsyslog.d/zebra_regex.json -m sonic-events-bgp
root       54932  0.0  0.0   6972  2044 pts/0    S+   00:20   0:00 grep -i zebra

cmp210-3# show memory
Memory statistics for zebra:
System allocator statistics:
  Total heap allocated:  156 MiB
--- qmem libfrr ---
Type                          : Current#   Size       Total     Max#  MaxBytes
Nexthop                       :      675    160      113512   446268  74995760

This explains why master doesn't see high Zebra memory usage as #19717 is only present in master today.

@arlakshm
Copy link
Contributor

arlakshm commented Oct 9, 2024

@lguohan, @StormLiangMS, @dgsudharsan for viz..

@rawal01
Copy link

rawal01 commented Oct 9, 2024

Output from Nokia LC:
2405:
docker exec -it bgp0 ps aux | grep frr
frr 53 0.5 3.6 1916528 1211680 pts/0 Sl Oct08 6:43 /usr/lib/frr/

docker exec -it bgp1 ps aux | grep frr
frr 53 0.7 3.5 1876120 1171496 pts/0 Sl Oct08 8:47 /usr/lib/frr/

issue exists with master too, but seems less than 2405
master:
bgp0:
frr 54 6.9 1.2 1117048 415040 pts/0 Sl 14:21 5:18 /usr/lib/frr/
bgp1:
frr 54 34.4 1.5 1275912 521904 pts/0 Sl 14:21 4:21 /usr/lib/frr/

@arlakshm
Copy link
Contributor

Attaching the full output of show memory
frr_show_mem_output.txt

@anamehra
Copy link
Contributor

anamehra commented Oct 21, 2024

Hi @abdosi , do we plan to pick #19717 for 202405? Based on above comments, looks like this is needed for the zebra memory consumption issue on 202405. Thanks

@anamehra
Copy link
Contributor

Hi @abdosi , do we plan to pick #19717 for 202405? Based on above comments, looks like this is needed for the zebra memory consumption issue on 202405. Thanks

Please ignore, already merged in 202405.

@arlakshm
Copy link
Contributor

#19717 merged to 202405.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Chassis 🤖 Modular chassis support P0 Priority of the issue Triaged this issue has been triaged
Projects
Status: Done
Development

No branches or pull requests

6 participants