Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add SmartSwitch HA feature test plan #13043

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

zjswhhh
Copy link
Contributor

@zjswhhh zjswhhh commented May 29, 2024

Description of PR

Summary:
Fixes # (issue)
Adding test plan for smart switch HA feature.

sign-off: Jing Zhang zhangjing@microsoft.com

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012
  • 202205
  • 202305
  • 202311

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@zjswhhh zjswhhh requested review from wangxin and yxieca as code owners May 29, 2024 22:45
@zjswhhh zjswhhh requested a review from r12f May 29, 2024 22:47
@Pterosaur Pterosaur self-requested a review May 31, 2024 08:28
@zjswhhh zjswhhh changed the title [smart switch][HA] add test plan Add SmartSwitch HA feature test plan May 31, 2024
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@zjswhhh zjswhhh requested a review from bingwang-ms February 13, 2025 18:57
docs/testplan/HA-SmartSwitch-test-plan.md Show resolved Hide resolved
docs/testplan/HA-SmartSwitch-test-plan.md Show resolved Hide resolved
docs/testplan/HA-SmartSwitch-test-plan.md Show resolved Hide resolved
docs/testplan/HA-SmartSwitch-test-plan.md Outdated Show resolved Hide resolved
docs/testplan/HA-SmartSwitch-test-plan.md Outdated Show resolved Hide resolved
docs/testplan/HA-SmartSwitch-test-plan.md Outdated Show resolved Hide resolved
docs/testplan/HA-SmartSwitch-test-plan.md Show resolved Hide resolved
docs/testplan/HA-SmartSwitch-test-plan.md Outdated Show resolved Hide resolved
docs/testplan/HA-SmartSwitch-test-plan.md Outdated Show resolved Hide resolved
docs/testplan/HA-SmartSwitch-test-plan.md Outdated Show resolved Hide resolved
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.


### Setup configuration

The production scenario simulated with this testbed, is a VM-to-VM traffic scenario. Basically 2 VMs located in perhaps two different clusters in the data center, try to communicate with each other. Traffic passes through the HA set under test. Assuming dpu0 in SmartSwitch0 will be the Active node and dpu0 in SmartSwitch1 shall be set to Standby. Both DPUs will share same network configurations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to call out that inline sync is emitted in the graph to avoid confusion.


The name convention of a test case will be “\<Test Scenario\>-[Active|Standby]”, indicating the traffic is sent through the initial active or standby side.

### Module 1 Normal OP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to specify the initial state of the test setup, e.g. 2 SmartSwitch forming pairs. this is because we can have cases using only 1 smartswitch running test, but it is not the scope of HA testing.

### Module 1 Normal OP
| Case | Goal | Test Steps | Expected Control Plane Behavior | Expected Data Plane Behavior |
|---------------------------|--------------------------------------------|--------------------------------------------------------------------------------------------|--------------------------------------------------------------|---------------------------------------------------|
| Normal OP – Active | Verify normal operation in healthy state | • Start I/O through Active side | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start I/O from active side

need more describe in more detailed setup on how the traffic would land on active side.

the outer packet will have the destination IP to the VIP, hence, the traffic would land on either side, if no special configuration.

Also better change start I/O from active side to "Start sending traffic to active side"

| Case | Goal | Test Steps | Expected Control Plane Behavior | Expected Data Plane Behavior |
|---------------------------|--------------------------------------------|--------------------------------------------------------------------------------------------|--------------------------------------------------------------|---------------------------------------------------|
| Normal OP – Active | Verify normal operation in healthy state | • Start I/O through Active side | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. |
| Normal OP – Standby | Verify normal operation in healthy state | • Start I/O through Standby side | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption from the active side. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as standby - how can we ensure the packet lands on standby

| Case | Goal | Test Steps | Expected Control Plane Behavior | Expected Data Plane Behavior |
|---------------------------|--------------------------------------------|--------------------------------------------------------------------------------------------|--------------------------------------------------------------|---------------------------------------------------|
| Normal OP – Active | Verify normal operation in healthy state | • Start I/O through Active side | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. |
| Normal OP – Standby | Verify normal operation in healthy state | • Start I/O through Standby side | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption from the active side. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we know the packet is coming back from active side?

| syncd on DPU | Verify when syncd crash on DPU. | • Start I/O<br>• Kill syncd on DPU | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. |
| hamgrd on NPU | Verify when hamgrd crash on NPU. | • Start I/O<br>• Kill hamgrd on NPU | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. |
| pmon on NPU | Verify when pmon crash on NPU. | • Start I/O<br>• Kill pmon on NPU | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. |
| bgpd on NPU | Verify when bgpd crash on NPU. | • Start I/O<br>• Kill bgpd on NPU | DPU1 remains active, DPU2 remains standby. | T2 receives packets without disruption. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section is all unplanned events, it is really hard to achieve no packet drops. the current goal for unplanned events are "time from detection to mitigation" is 2s.


| Case | Goal | Test Steps | Expected Control Plane Behavior | Expected Data Plane Behavior |
|-----------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|--------------------------------------------------|---------------------------------------------------|
| Active NPU-to-DPU probe drop -Active | Verify packet flow when NPU1 to DPU1 link starts dropping probe packets. | • Start I/O through active side.<br>• Configure the NPU1-to-DPU1 link to drop packets. | DPU1 becomes non-active, DPU2 becomes standalone. | T2 receives packets with 1 allowed disruption. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the expected behavior is a bit vague.


| Case | Goal | Test Steps | Expected Control Plane Behavior | Expected Data Plane Behavior |
|------------------------|--------------------------------------------------|------------------------------------------------------|--------------------------------------------------|---------------------------------------------------|
| DPU hardware failure | Verify traffic flow when DPU hardware fails | • Start I/O<br>• Force DPU reset (ChassisStateDB DPU_STATE) | DPU1 becomes non-active, DPU2 becomes standalone. | T2 receives packets without disruption. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unplanned events will drop packets.


| Case | Goal | Test Steps | Expected Control Plane Behavior | Expected Data Plane Behavior |
|----------------------------------------|---------------------------------------------------------|------------------------------------------------------------------------------|--------------------------------------------------|---------------------------------------------------|
| Shutdown/Startup BGP sessions from NPU | Verify traffic when shutdown and startup sessions from NOS | • Start I/O<br>• Shutdown all BGP sessions on NPU<br>• Startup all BGP sessions on NPU | Impacted side become non-active, the peer side become standalone. | T2 receives packets without disruption. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these will cause packet drops too, maybe I missed something or maybe the scenario that we are intended to test here is not described clearly.

|----------------------------------------|---------------------------------------------------------|------------------------------------------------------------------------------|--------------------------------------------------|---------------------------------------------------|
| Shutdown/Startup BGP sessions from NPU | Verify traffic when shutdown and startup sessions from NOS | • Start I/O<br>• Shutdown all BGP sessions on NPU<br>• Startup all BGP sessions on NPU | Impacted side become non-active, the peer side become standalone. | T2 receives packets without disruption. |
| TSA on T1 | Verify traffic when TSA on T1 | • Start I/O<br>• TSA on T1<br>• TSB on T1 | Impacted side become non-active, the peer side become standalone. | T2 receives packets without disruption. |
| Config reload on T1 | Verify traffic when config reload on T1 | • Start I/O<br>• Config reload on T1 | Impacted side become non-active, the peer side become standalone. | T2 receives packets without disruption. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above - all 3 cases should have data path impact

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants