-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Mid-December changes to SWSS made SONiC on Dell N3248TE-ON unusable #18421
Comments
@jeff-yin This might need attention from Dell. I saw a comment on another issue where the symptom is similar (multi-container failure on Dell Broadcom(Trident) units) although the direct cause may be different (I'm not using subinterfaces). |
@vpsubramaniam please take a look and self-assign this issue to yourself. |
@dgsudharsan would you be able to work with @prsunny to ensure swss does not crash on supported SAI call? |
I still have this image installed as a secondary on my switch and can gather more logs if you need. This was installed from the 202311 base image with the following configuration entered manually:
But this is the same behavior I see on this platform regardless of configuration (even using a build with some significant changes related to OSPF management and DB-integrated routing configuration layered on top). Ever since that point in mid-December, all builds display this cascading container failure. That set of functionality I employ is consistent (VLANs, dhcp-relay, BGP, etc.) |
Due to merges like #18038, I can't build from commits as far back as 12/2023 anymore (the files referenced in the older commits are no longer available). I've tried cherry-picking some commits to see if I can get the updated URLs merged without whatever SWSS changes (presumably) are causing the failures, but I haven't been successful yet. I'm going to try going back further to the 202305 branch and see if that's currently stable on this platform. It looks to me like maybe the 202305 branch lags master more than 202311 (i.e., not as much stuff is back-ported). Hopefully that's accurate. |
202305 seems to be stable on the N3248TE platform, so the changes that are causing problems in 202311 and master were not backported to 202305. |
@justindthomas, Below image seems fine, all docker services come up without any issues. Probably something got fixed in the latest 202311 branch, please check this image and if you still see any issues kindly share the configuration details. |
@vpsubramaniam Okay, I'll try loading up the current 202311 image tomorrow. My suspicion is that the failure is in something that's activated by the configuration (e.g., maybe the activation of BGP). Hopefully it's fixed, though. I'll report back. |
My guess is you hit the same issue as me with ipv6 link local neighbor removal since you mentioned ipv6: #21247 |
I've been running a custom build (with some of my own changes) of the master branch from December 11 on my Dell N3248TE-ON for months because any attempts to use a commit date later than around that time result in the
docker-orchagent
container periodically dying and taking everything else down with it.I had assumed it was just because I was trying to be on the bleeding edge and figured it would be resolved eventually. Today I decided to roll back to the "current" release of 202311 from https://sonic.software with a clean configuration so that I could focus on some IPv6 work and not worry about my platform. But the problematic changes to
swss
seem to have been merged into that branch and I'm seeing the same behavior as when I build on master.Here is a log of how the
swss
container fails:From there, the system tries to restart everything, but the whole thing just cycles from failure to failure. Note that this starts a few minutes after the system has come up and is successfully passing traffic.
This is the version I'm running:
The text was updated successfully, but these errors were encountered: