Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[config] Adding sanity checks for config reload #1664

Merged
merged 4 commits into from
Jun 21, 2021
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 41 additions & 1 deletion config/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -702,6 +702,32 @@ def _restart_services():
click.echo("Reloading Monit configuration ...")
clicommon.run_command("sudo monit reload")

def _get_delay_timers():
out = clicommon.run_command("systemctl list-dependencies delay.target --plain |sed '1d'", return_cmd=True)
return [timer.strip() for timer in out.splitlines()]

def _delay_timers_elapsed():
for timer in _get_delay_timers():
out = clicommon.run_command("systemctl show {} --property=LastTriggerUSecMonotonic --value".format(timer), return_cmd=True)
if out.strip() == "0":
return False
return True

def _swss_ready():
out = clicommon.run_command("systemctl show swss.service --property ActiveState --value", return_cmd=True)
if out.strip() != "active":
return False
out = clicommon.run_command("systemctl show swss.service --property ActiveEnterTimestampMonotonic --value", return_cmd=True)
swss_up_time = float(out.strip())/1000000
now = time.monotonic()
if (now - swss_up_time > 120):
return True
else:
return False

def _system_running():
out = clicommon.run_command("sudo systemctl is-system-running", return_cmd=True)
return out.strip() == "running"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While verifying PR sonic-net/sonic-buildimage#8003. I hit this condition where I was I was not able to do "config reload". Even after system is up for a while the systemctl was showing running state as degraded

admin@str-s6000-acs-9:~$ sudo systemctl is-system-running
degraded
admin@str-s6000-acs-9:~$ sudo config reload -y
System is not up. Retry later or use -f to avoid system checks
admin@str-s6000-acs-9:~$ docker ps
CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS               NAMES
8baf489477e6        docker-snmp:latest                   "/usr/local/bin/supe…"   18 minutes ago      Up 2 minutes                            snmp
ad835c8b42ba        docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   18 minutes ago      Up 2 minutes                            mgmt-framework
930f188f20a7        docker-router-advertiser:latest      "/usr/bin/docker-ini…"   20 minutes ago      Up 5 minutes                            radv
15fd0337e6ea        docker-dhcp-relay:latest             "/usr/bin/docker_ini…"   20 minutes ago      Up 5 minutes                            dhcp_relay
c37a6752f57f        docker-lldp:latest                   "/usr/bin/docker-lld…"   20 minutes ago      Up 5 minutes                            lldp
79cb8660e7d2        docker-syncd-brcm:latest             "/usr/local/bin/supe…"   20 minutes ago      Up 5 minutes                            syncd
77e8ca15b220        docker-teamd:latest                  "/usr/local/bin/supe…"   20 minutes ago      Up 5 minutes                            teamd
ce63284a0f0a        docker-orchagent:latest              "/usr/bin/docker-ini…"   20 minutes ago      Up 5 minutes                            swss
277e43234314        docker-fpm-frr:latest                "/usr/bin/docker_ini…"   20 minutes ago      Up 5 minutes                            bgp
d9c63106202a        docker-platform-monitor:latest       "/usr/bin/docker_ini…"   21 minutes ago      Up 5 minutes                            pmon
34129d689ebd        docker-database:latest               "/usr/local/bin/dock…"   21 minutes ago      Up 6 minutes                            database

I have seen cases where system remains in degraded state as may be one of the systemd service is not started. @dgsudharsan do share your observations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@judyjoseph I believe we should try to fix the service that is in the degraded state. The whole point of introducing these checks were to avoid issues around config reload which happen when the system was degraded. Until the faulty services are fixed we can expect config reload without force not to work. However to give a handle we introduced '-f' option which lets user to take own risk and perform config reload when the system may not be in a stable state.


def interface_is_in_vlan(vlan_member_table, interface_name):
""" Check if an interface is in a vlan """
Expand Down Expand Up @@ -1191,12 +1217,26 @@ def list_checkpoints(ctx, verbose):
@click.option('-l', '--load-sysinfo', is_flag=True, help='load system default information (mac, portmap etc) first.')
@click.option('-n', '--no_service_restart', default=False, is_flag=True, help='Do not restart docker services')
@click.option('-d', '--disable_arp_cache', default=False, is_flag=True, help='Do not cache ARP table before reloading (applies to dual ToR systems only)')
@click.option('-f', '--force', default=False, is_flag=True, help='Force config reload without system checks')
@click.argument('filename', required=False)
@clicommon.pass_db
def reload(db, filename, yes, load_sysinfo, no_service_restart, disable_arp_cache):
def reload(db, filename, yes, load_sysinfo, no_service_restart, disable_arp_cache, force):
"""Clear current configuration and import a previous saved config DB dump file.
<filename> : Names of configuration file(s) to load, separated by comma with no spaces in between
"""
if not force and not no_service_restart:
if not _system_running():
click.echo("System is not up. Retry later or use -f to avoid system checks")
return

if not _delay_timers_elapsed():
click.echo("Relevant services are not up. Retry later or use -f to avoid system checks")
return

if not _swss_ready():
click.echo("SwSS container is not ready. Retry later or use -f to avoid system checks")
return

if filename is None:
message = 'Clear current config and reload config from the default config file(s) ?'
else:
Expand Down