-
Notifications
You must be signed in to change notification settings - Fork 760
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add SmartSwitch HA feature test plan #13043
base: master
Are you sure you want to change the base?
Conversation
/azp run |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
/azp run |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
/azp run |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
/azp run |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
/azp run |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
/azp run |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
/azp run |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
/azp run |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
/azp run |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
|
||
### Setup configuration | ||
|
||
The production scenario simulated with this testbed, is a VM-to-VM traffic scenario. Basically 2 VMs located in perhaps two different clusters in the data center, try to communicate with each other. Traffic passes through the HA set under test. Assuming dpu0 in SmartSwitch0 will be the Active node and dpu0 in SmartSwitch1 shall be set to Standby. Both DPUs will share same network configurations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might need to call out that inline sync is emitted in the graph to avoid confusion.
|
||
The name convention of a test case will be “\<Test Scenario\>-[Active|Standby]”, indicating the traffic is sent through the initial active or standby side. | ||
|
||
### Module 1 Normal OP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to specify the initial state of the test setup, e.g. 2 SmartSwitch forming pairs. this is because we can have cases using only 1 smartswitch running test, but it is not the scope of HA testing.
### Module 1 Normal OP | ||
| Case | Goal | Test Steps | Expected Control Plane Behavior | Expected Data Plane Behavior | | ||
|---------------------------|--------------------------------------------|--------------------------------------------------------------------------------------------|--------------------------------------------------------------|---------------------------------------------------| | ||
| Normal OP – Active | Verify normal operation in healthy state | • Start I/O through Active side | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start I/O from active side
need more describe in more detailed setup on how the traffic would land on active side.
the outer packet will have the destination IP to the VIP, hence, the traffic would land on either side, if no special configuration.
Also better change start I/O from active side to "Start sending traffic to active side"
| Case | Goal | Test Steps | Expected Control Plane Behavior | Expected Data Plane Behavior | | ||
|---------------------------|--------------------------------------------|--------------------------------------------------------------------------------------------|--------------------------------------------------------------|---------------------------------------------------| | ||
| Normal OP – Active | Verify normal operation in healthy state | • Start I/O through Active side | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. | | ||
| Normal OP – Standby | Verify normal operation in healthy state | • Start I/O through Standby side | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption from the active side. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as standby - how can we ensure the packet lands on standby
| Case | Goal | Test Steps | Expected Control Plane Behavior | Expected Data Plane Behavior | | ||
|---------------------------|--------------------------------------------|--------------------------------------------------------------------------------------------|--------------------------------------------------------------|---------------------------------------------------| | ||
| Normal OP – Active | Verify normal operation in healthy state | • Start I/O through Active side | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. | | ||
| Normal OP – Standby | Verify normal operation in healthy state | • Start I/O through Standby side | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption from the active side. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we know the packet is coming back from active side?
| syncd on DPU | Verify when syncd crash on DPU. | • Start I/O<br>• Kill syncd on DPU | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. | | ||
| hamgrd on NPU | Verify when hamgrd crash on NPU. | • Start I/O<br>• Kill hamgrd on NPU | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. | | ||
| pmon on NPU | Verify when pmon crash on NPU. | • Start I/O<br>• Kill pmon on NPU | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. | | ||
| bgpd on NPU | Verify when bgpd crash on NPU. | • Start I/O<br>• Kill bgpd on NPU | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this section is all unplanned events, it is really hard to achieve no packet drops. the current goal for unplanned events are "time from detection to mitigation" is 2s.
|
||
| Case | Goal | Test Steps | Expected Control Plane Behavior | Expected Data Plane Behavior | | ||
|-----------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|--------------------------------------------------|---------------------------------------------------| | ||
| Active NPU-to-DPU probe drop -Active | Verify packet flow when NPU1 to DPU1 link starts dropping probe packets. | • Start I/O through active side.<br>• Configure the NPU1-to-DPU1 link to drop packets. | DPU1 becomes non-active, DPU2 becomes standalone. | T2 receives packets with 1 allowed disruption. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the expected behavior is a bit vague.
|
||
| Case | Goal | Test Steps | Expected Control Plane Behavior | Expected Data Plane Behavior | | ||
|------------------------|--------------------------------------------------|------------------------------------------------------|--------------------------------------------------|---------------------------------------------------| | ||
| DPU hardware failure | Verify traffic flow when DPU hardware fails | • Start I/O<br>• Force DPU reset (ChassisStateDB DPU_STATE) | DPU1 becomes non-active, DPU2 becomes standalone. | T2 receives packets without disruption. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unplanned events will drop packets.
|
||
| Case | Goal | Test Steps | Expected Control Plane Behavior | Expected Data Plane Behavior | | ||
|----------------------------------------|---------------------------------------------------------|------------------------------------------------------------------------------|--------------------------------------------------|---------------------------------------------------| | ||
| Shutdown/Startup BGP sessions from NPU | Verify traffic when shutdown and startup sessions from NOS | • Start I/O<br>• Shutdown all BGP sessions on NPU<br>• Startup all BGP sessions on NPU | Impacted side become non-active, the peer side become standalone. | T2 receives packets without disruption. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these will cause packet drops too, maybe I missed something or maybe the scenario that we are intended to test here is not described clearly.
|----------------------------------------|---------------------------------------------------------|------------------------------------------------------------------------------|--------------------------------------------------|---------------------------------------------------| | ||
| Shutdown/Startup BGP sessions from NPU | Verify traffic when shutdown and startup sessions from NOS | • Start I/O<br>• Shutdown all BGP sessions on NPU<br>• Startup all BGP sessions on NPU | Impacted side become non-active, the peer side become standalone. | T2 receives packets without disruption. | | ||
| TSA on T1 | Verify traffic when TSA on T1 | • Start I/O<br>• TSA on T1<br>• TSB on T1 | Impacted side become non-active, the peer side become standalone. | T2 receives packets without disruption. | | ||
| Config reload on T1 | Verify traffic when config reload on T1 | • Start I/O<br>• Config reload on T1 | Impacted side become non-active, the peer side become standalone. | T2 receives packets without disruption. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above - all 3 cases should have data path impact
Description of PR
Summary:
Fixes # (issue)
Adding test plan for smart switch HA feature.
sign-off: Jing Zhang zhangjing@microsoft.com
Type of change
Back port request
Approach
What is the motivation for this PR?
How did you do it?
How did you verify/test it?
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation