Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

SNMP service started and triggered SWSS start in the middle of warm-reboot #2750

Closed
stepanblyschak opened this issue Apr 5, 2019 · 2 comments
Assignees
Labels

Comments

@stepanblyschak
Copy link
Collaborator

Description
SNMP service start is delayed for 3 min and depends on SWSS service.
When issuing sudo warm-reboot after daemons reconciled state was reached (~2 min after system start) SNMP service may wake up in the middle of warm-reboot script, after SWSS was killed by warm-reboot script, and start SWSS container again. This causes warm reboot failure and system becomes in inconsistent state.

See syslog messages from "Apr 5 13:24:46.321628" to "Apr 5 13:25:50.512102"

Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: drained queue
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- handleRestartQuery: received WARM switch shutdown event
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- profile_get_value: SAI_WARM_BOOT_WRITE_FILE: /var/warmboot/
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: using warmBootWriteFile: '/var/warmboot/'
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: Warm Reboot requested, keeping data plane running
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: Removing the switch gSwitchId=0x100000021
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: Fast/warm reboot requested, keeping data plane running
Apr  5 13:25:03.784348 arc-switch1025 INFO syncd#supervisord: syncd Apr 05 13:25:03 NOTICE  SAI_UTILS: mlnx_sai_utils.c[2307]- set_dispatch_attrib_handler: Set RESTART_WARM, key:Switch ID 1, val:true
Apr  5 13:25:03.784542 arc-switch1025 INFO syncd#supervisord: syncd Apr 05 13:25:03 NOTICE  SAI_UTILS: mlnx_sai_utils.c[2307]- set_dispatch_attrib_handler: Set UNINIT_DATA_PLANE_ON_REMOVAL, key:Switch ID 1, val:false
Apr  5 13:25:03.784542 arc-switch1025 INFO syncd#supervisord: syncd Apr 05 13:25:03 NOTICE  SAI_SWITCH: mlnx_sai_switch.c[5041]- mlnx_disconnect_switch: Disconnect switch
Apr  5 13:25:03.784896 arc-switch1025 INFO syncd.sh[12173]: requested WARM shutdown
Apr  5 13:25:03.785978 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: remove switch took 0.001976 sec
Apr  5 13:25:03.786457 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: calling api uninitialize
Apr  5 13:25:03.786457 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: uninitialize finished
Apr  5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,752 INFO reaped unknown pid 39
Apr  5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,753 INFO reaped unknown pid 31
Apr  5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,754 INFO reaped unknown pid 23
Apr  5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,756 INFO reaped unknown pid 47
Apr  5 13:25:04.325406 arc-switch1025 NOTICE root: Finished warm shutdown syncd process ...
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:03,801 INFO exited: syncd (exit status 0; expected)
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:04,479 WARN received SIGTERM indicating exit request
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:04,479 INFO waiting for rsyslogd, mlnx-sfpd to die
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:07,484 INFO waiting for rsyslogd, mlnx-sfpd to die
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:10,488 INFO waiting for rsyslogd, mlnx-sfpd to die
Apr  5 13:25:14.482720 arc-switch1025 INFO dockerd[1680]: time="2019-04-05T13:25:14.482018945Z" level=info msg="Container f214ce00b9045dd7edba5c441a05fa747bb2f12936aaf7baaa29665f093ec8bf failed to exit within 10 seconds of signal 15 - us
ing the force"
Apr  5 13:25:14.746779 arc-switch1025 INFO containerd[1678]: time="2019-04-05T13:25:14.746534087Z" level=info msg="shim reaped" id=f214ce00b9045dd7edba5c441a05fa747bb2f12936aaf7baaa29665f093ec8bf
Apr  5 13:25:14.756912 arc-switch1025 INFO dockerd[1680]: time="2019-04-05T13:25:14.756756411Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr  5 13:25:14.812920 arc-switch1025 INFO syncd.sh[12173]: syncd
Apr  5 13:25:14.813636 arc-switch1025 INFO syncd.sh[6050]: 137
Apr  5 13:25:14.822741 arc-switch1025 NOTICE root: Stopped syncd service...
Apr  5 13:25:14.829094 arc-switch1025 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from syncd service
Apr  5 13:25:14.842953 arc-switch1025 INFO systemd[1]: Stopped syncd service.
Apr  5 13:25:43.154294 arc-switch1025 INFO systemd[1]: Starting switch state service...
Apr  5 13:25:43.164059 arc-switch1025 NOTICE root: Starting swss service...
Apr  5 13:25:43.171856 arc-switch1025 NOTICE root: Locking /tmp/swss-syncd-lock from swss service
Apr  5 13:25:43.181013 arc-switch1025 NOTICE root: Locked /tmp/swss-syncd-lock (10) from swss service
Apr  5 13:25:44.918430 arc-switch1025 NOTICE root: Warm boot flag: swss true.
Apr  5 13:25:46.137482 arc-switch1025 INFO swss.sh[12753]: Starting existing swss container with HWSKU ACS-MSN2700
Apr  5 13:25:46.244446 arc-switch1025 INFO containerd[1678]: time="2019-04-05T13:25:46.242046842Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/d8fdfb3a13a185551f59650deea154bd6144adeb2c030377374133b58b7b2
54c/shim.sock" debug=false pid=13130
Apr  5 13:25:46.502350 arc-switch1025 INFO swss.sh[12753]: swss
Apr  5 13:25:47.386488 arc-switch1025 NOTICE root: Started swss service...
Apr  5 13:25:47.399593 arc-switch1025 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from swss service
Apr  5 13:25:47.412764 arc-switch1025 INFO systemd[1]: Started switch state service.
Apr  5 13:25:47.413771 arc-switch1025 INFO systemd[1]: Starting SNMP container...
Apr  5 13:25:49.406666 arc-switch1025 INFO snmp.sh[13325]: Starting existing snmp container with HWSKU ACS-MSN2700
Apr  5 13:25:49.575907 arc-switch1025 INFO containerd[1678]: time="2019-04-05T13:25:49.575188155Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/13a2ea7a84dfe96b70d702893c561f4cd8732f50be9552937f1f169060951c25/shim.sock" debug=false pid=13547
Apr  5 13:25:49.949186 arc-switch1025 INFO snmp.sh[13325]: snmp

Steps to reproduce the issue:

  1. execute sudo warm-reboot few times

Describe the results you received:
See description

Describe the results you expected:
SNMP start does not mess with warm-reboot

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**
SONiC Software Version: SONiC.HEAD.42-5c663ca
Distribution: Debian 9.8
Kernel: 4.9.0-8-amd64
Build commit: 5c663ca
Build date: Fri Apr  5 04:51:41 UTC 2019
Built by: johnar@jenkins-worker-3

Docker images:
REPOSITORY                 TAG                 IMAGE ID            SIZE
docker-orchagent-mlnx      HEAD.42-5c663ca     92227cd6d9e5        286MB
docker-orchagent-mlnx      latest              92227cd6d9e5        286MB
docker-syncd-mlnx          HEAD.42-5c663ca     213b01ae948a        331MB
docker-syncd-mlnx          latest              213b01ae948a        331MB
docker-lldp-sv2            HEAD.42-5c663ca     27eb551cdb9e        274MB
docker-lldp-sv2            latest              27eb551cdb9e        274MB
docker-dhcp-relay          HEAD.42-5c663ca     c918d4fb6c35        256MB
docker-dhcp-relay          latest              c918d4fb6c35        256MB
docker-database            HEAD.42-5c663ca     20c13f3c77c4        255MB
docker-database            latest              20c13f3c77c4        255MB
docker-teamd               HEAD.42-5c663ca     a29189c45fe7        274MB
docker-teamd               latest              a29189c45fe7        274MB
docker-snmp-sv2            HEAD.42-5c663ca     c102aa38ddf5        294MB
docker-snmp-sv2            latest              c102aa38ddf5        294MB
docker-router-advertiser   HEAD.42-5c663ca     c46d7f87c715        254MB
docker-router-advertiser   latest              c46d7f87c715        254MB
docker-platform-monitor    HEAD.42-5c663ca     6784680cc095        286MB
docker-platform-monitor    latest              6784680cc095        286MB
docker-fpm-quagga          HEAD.42-5c663ca     d26abe5b1234        281MB
docker-fpm-quagga          latest              d26abe5b1234        281MB
**Attach debug file `sudo generate_dump`:**

```

syslog.txt

(paste your output here)
```
@yxieca
Copy link
Contributor

yxieca commented Apr 9, 2019

@stepanblyschak great finding! Should we proactively stop snmp before warm reboot?

@yxieca
Copy link
Contributor

yxieca commented Apr 10, 2019

It is hard to solve this issue unless we have an indication of system boot up is done.

This timer is a one time timer since boot up. snmp services wait for 3.5 minutes before starting service.

Can we fix snmp service to not depending on the delay? This delay is a result of delaying creating counters. why did we delay creating counters? @pavel-shirshov can you fill in the information?

Per discussion in warm reboot meeting, we would like to see if we could remove this 3.5 minutes delay.

  • do we have counters db ready flag?
  • do not log error/warming messages until counter is ready.

@stepanblyschak will drive the action items.

dgsudharsan added a commit to dgsudharsan/sonic-buildimage that referenced this issue Jul 11, 2023
Update sonic-utilities submodule pointer to include the following:
* ff380e04 [hash]: Implement GH frontend ([sonic-net#2580](sonic-net/sonic-utilities#2580))
* 61bad064 [db_migrator] Set correct CURRENT_VERSION, extend UT ([sonic-net#2895](sonic-net/sonic-utilities#2895))
* 6b8ee47c [CLI][Show][BGP] Show BGP Change for no neighbor scenario ([sonic-net#2885](sonic-net/sonic-utilities#2885))
* 73d8d633 [doc] Update Command-Reference.md, change show bgp peer command to show bfd peer ([sonic-net#2750](sonic-net/sonic-utilities#2750))
* 7bc08c28 [db_migrator] Remove hardcoded config and migrate config from minigraph ([sonic-net#2887](sonic-net/sonic-utilities#2887))
* b1aa9426 [generate_dump]: Enhance show techsupport for Marvell platform ([sonic-net#2676](sonic-net/sonic-utilities#2676))
* 316b14c0 Add support for secure upgrade ([sonic-net#2698](sonic-net/sonic-utilities#2698))
* dc2945bc [dns] Implement config and show commands for static DNS. ([sonic-net#2737](sonic-net/sonic-utilities#2737))
* 8414a709 [chassis][multi asic] change acl_loader to use tcp socket for db communication ([sonic-net#2525](sonic-net/sonic-utilities#2525))
* 0b629ba1 Revert [chassis][voq] Clear fabric counters queue/port (2789) ([sonic-net#2882](sonic-net/sonic-utilities#2882))
* 3ba8241a [db_migtrator] Add migration of FLEX_COUNTER_DELAY_STATUS during 1911->master upgrade + fast-reboot. Add UT. ([sonic-net#2839](sonic-net/sonic-utilities#2839))
* fceef2ed [chassis][voq] Clear fabric counters queue/port ([sonic-net#2789](sonic-net/sonic-utilities#2789))

Signed-off-by: dgsudharsan <sudharsand@nvidia.com>
liat-grozovik pushed a commit that referenced this issue Jul 11, 2023
Update sonic-utilities submodule pointer to include the following:
* ff380e04 [hash]: Implement GH frontend ([#2580](sonic-net/sonic-utilities#2580))
* 61bad064 [db_migrator] Set correct CURRENT_VERSION, extend UT ([#2895](sonic-net/sonic-utilities#2895))
* 6b8ee47c [CLI][Show][BGP] Show BGP Change for no neighbor scenario ([#2885](sonic-net/sonic-utilities#2885))
* 73d8d633 [doc] Update Command-Reference.md, change show bgp peer command to show bfd peer ([#2750](sonic-net/sonic-utilities#2750))
* 7bc08c28 [db_migrator] Remove hardcoded config and migrate config from minigraph ([#2887](sonic-net/sonic-utilities#2887))
* b1aa9426 [generate_dump]: Enhance show techsupport for Marvell platform ([#2676](sonic-net/sonic-utilities#2676))
* 316b14c0 Add support for secure upgrade ([#2698](sonic-net/sonic-utilities#2698))
* dc2945bc [dns] Implement config and show commands for static DNS. ([#2737](sonic-net/sonic-utilities#2737))
* 8414a709 [chassis][multi asic] change acl_loader to use tcp socket for db communication ([#2525](sonic-net/sonic-utilities#2525))
* 0b629ba1 Revert [chassis][voq] Clear fabric counters queue/port (2789) ([#2882](sonic-net/sonic-utilities#2882))
* 3ba8241a [db_migtrator] Add migration of FLEX_COUNTER_DELAY_STATUS during 1911->master upgrade + fast-reboot. Add UT. ([#2839](sonic-net/sonic-utilities#2839))
* fceef2ed [chassis][voq] Clear fabric counters queue/port ([#2789](sonic-net/sonic-utilities#2789))

Signed-off-by: dgsudharsan <sudharsand@nvidia.com>
mssonicbld added a commit that referenced this issue Jul 11, 2023
…atically (#15456)

#### Why I did it
src/sonic-utilities
```
* ff380e04 - (HEAD -> master, origin/master, origin/HEAD) [hash]: Implement GH frontend (#2580) (13 hours ago) [Nazarii Hnydyn]
* 61bad064 - [db_migrator] Set correct CURRENT_VERSION, extend UT (#2895) (4 days ago) [Vadym Hlushko]
* 6b8ee47c - [CLI][Show][BGP] Show BGP Change for no neighbor scenario (#2885) (6 days ago) [Dev Ojha]
* 73d8d633 - [doc] Update Command-Reference.md, change "show bgp peer" command to "show bfd peer" (#2750) (11 days ago) [PinghaoQu]
* 7bc08c28 - [db_migrator] Remove hardcoded config and migrate config from minigraph (#2887) (11 days ago) [Vaibhav Hemant Dixit]
* b1aa9426 - [generate_dump]: Enhance show techsupport for Marvell platform (#2676) (11 days ago) [pavannaregundi]
* 316b14c0 - Add support for secure upgrade (#2698) (2 weeks ago) [ycoheNvidia]
* dc2945bc - [dns] Implement config and show commands for static DNS. (#2737) (2 weeks ago) [Oleksandr Ivantsiv]
* 8414a709 - [chassis][multi asic] change acl_loader to use tcp socket for db communication (#2525) (2 weeks ago) [Arvindsrinivasan Lakshmi Narasimhan]
* 0b629ba1 - Revert "[chassis][voq] Clear fabric counters queue/port (#2789)" (#2882) (3 weeks ago) [RoRonoa]
* 3ba8241a - [db_migtrator] Add migration of FLEX_COUNTER_DELAY_STATUS during 1911->master upgrade + fast-reboot. Add UT. (#2839) (4 weeks ago) [Vadym Hlushko]
* fceef2ed - [chassis][voq] Clear fabric counters queue/port (#2789) (4 weeks ago) [jfeng-arista]
```
#### How I did it
#### How to verify it
#### Description for the changelog
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this issue Sep 20, 2023
Update sonic-utilities submodule pointer to include the following:
* ff380e04 [hash]: Implement GH frontend ([sonic-net#2580](sonic-net/sonic-utilities#2580))
* 61bad064 [db_migrator] Set correct CURRENT_VERSION, extend UT ([sonic-net#2895](sonic-net/sonic-utilities#2895))
* 6b8ee47c [CLI][Show][BGP] Show BGP Change for no neighbor scenario ([sonic-net#2885](sonic-net/sonic-utilities#2885))
* 73d8d633 [doc] Update Command-Reference.md, change show bgp peer command to show bfd peer ([sonic-net#2750](sonic-net/sonic-utilities#2750))
* 7bc08c28 [db_migrator] Remove hardcoded config and migrate config from minigraph ([sonic-net#2887](sonic-net/sonic-utilities#2887))
* b1aa9426 [generate_dump]: Enhance show techsupport for Marvell platform ([sonic-net#2676](sonic-net/sonic-utilities#2676))
* 316b14c0 Add support for secure upgrade ([sonic-net#2698](sonic-net/sonic-utilities#2698))
* dc2945bc [dns] Implement config and show commands for static DNS. ([sonic-net#2737](sonic-net/sonic-utilities#2737))
* 8414a709 [chassis][multi asic] change acl_loader to use tcp socket for db communication ([sonic-net#2525](sonic-net/sonic-utilities#2525))
* 0b629ba1 Revert [chassis][voq] Clear fabric counters queue/port (2789) ([sonic-net#2882](sonic-net/sonic-utilities#2882))
* 3ba8241a [db_migtrator] Add migration of FLEX_COUNTER_DELAY_STATUS during 1911->master upgrade + fast-reboot. Add UT. ([sonic-net#2839](sonic-net/sonic-utilities#2839))
* fceef2ed [chassis][voq] Clear fabric counters queue/port ([sonic-net#2789](sonic-net/sonic-utilities#2789))

Signed-off-by: dgsudharsan <sudharsand@nvidia.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants