Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

WIP: Testing communication resilience #500

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from

Conversation

igor-davidyuk
Copy link
Contributor

@igor-davidyuk igor-davidyuk commented Sep 6, 2022

This PR introduces a bash script that emulates cutting a network connection of one or several FL actors.
All the actors run in docker containers and are connected to a docker network, disconnection may be triggered by certain log output in a container's stdout.

There are several experiments, that may be conducted to test Collaborator's tolerance to network breakage:
A. cut it off when it is CPU bound, performing some calculations
B. cut it off when it is waiting for an Aggregators response
C. cut it off when it is sending a message
D. cut it off when it is receiving a message

Luckily, Collaborator's gRPC client features exactly 3 RPCs: get_tasks, get_aggregated_tensor, and send_local_task_results.
There is a peculiarity regarding Collaborator's gRPC client implementation: fetching an aggregated model from the collaborator happens on a tensor per tensor basis while a local model is sent in one piece, in a stream. Therefore it would be easier to catch A collaborator sending data than receiving.

How to use:

  1. You need a clean supported Python virtual environment + upgraded pip + openfl installed.
  2. Source the script itself with desired parameters, it will create all the required artifacts.
  3. test-federation will contain logs for all the actors you can analyze.

Please note that our tests still require a lot of intervention to the script, thus treat it as just a base layer.
The script accepts the following parameters:

  • STAGE- integer. 1 - Start by building the base image; 2 - Start by creating a new workspace and dockerizing it; 3 - Certify federation, choose any number of collaborators; 4 - Run the existing image
  • NUMBER_OF_COLS - integer. You can start as many collaborators as you want, but the script will disconnect only the last one!
  • RECONNECTION_TIMEOUT - integer, seconds. Timeout to connect the last collaborator back to the network.
  • MAXIMUM_DISCONNECTIONS - integer. In case you want to limit the number of disconnections throughout the experiment.
  • CUT_ON_LOG - string. A pattern in logs that will trigger disconnections. In the script, wildcards are used to detect the pattern.
  • TEMPLATE - string. OpenFL template to use.

Returning back to the experiments.

A: cut it off when it is CPU bound, performing some calculations.

We can use the Run 0 epoch of N round log message in the keras_cnn_mnist experiment to trigger disconnection while the collaborator does computations with the command:
bash tests/github/docker_disconnecting_test.sh 1 2 20 1 "Run 0 epoch of 1 round"
It will disconnect the second collaborator when training for the first round starts. 20 seconds of disconnection is enough in this case. The collaborator will be affected once it tries to send the task results. It recovers, seemingly thanks to #465, that resends task results if we get grpc.StatusCode.UNKNOWN error code on the client side. Logs are attached
A_aggregator.log
A_collaborator1.log
A_collaborator2.log

B: cut it off when it is waiting for an Aggregators response

C: cut it off when it is sending a message.

We can use send_local_task_results RPC to drop the connection while sending task results.
Here is the exact command that I used. "Setting stream chunks with size 10" is a debug message inside the RPC, 10 is a number specific to the default keras_cnn_mnist experiment.
bash tests/github/docker_disconnecting_test.sh 4 2 20 2 "Setting stream chunks with size 10"
Logs are attached, TLDR: the disconnected collaborator never recovers after reconnection.
C_aggregator.log
C_collaborator1.log
C_collaborator2.log

@igor-davidyuk igor-davidyuk changed the title Testing communication resilience WIP: Testing communication resilience Sep 8, 2022
@psfoley
Copy link
Contributor

psfoley commented Dec 14, 2022

@igor-davidyuk this is functional in its current state, correct? Can we merge this for the 1.5 release and continue to make improvements later?

@igor-davidyuk igor-davidyuk force-pushed the disconnecting-col-test branch from 3ddcebd to 1cb91d4 Compare January 11, 2023 13:11
@theakshaypant
Copy link
Collaborator

@igor-davidyuk can you please resolve the conflicts and request for review again?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants