-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
WIP: Testing communication resilience #500
Open
igor-davidyuk
wants to merge
13
commits into
securefederatedai:develop
Choose a base branch
from
igor-davidyuk:disconnecting-col-test
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
WIP: Testing communication resilience #500
igor-davidyuk
wants to merge
13
commits into
securefederatedai:develop
from
igor-davidyuk:disconnecting-col-test
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@igor-davidyuk this is functional in its current state, correct? Can we merge this for the 1.5 release and continue to make improvements later? |
3ddcebd
to
1cb91d4
Compare
Signed-off-by: igor-davidyuk <igor.davidyuk@intel.com>
Signed-off-by: igor-davidyuk <igor.davidyuk@intel.com>
running on a virtual networks, but does not support dissconnection yet Signed-off-by: igor-davidyuk <igor.davidyuk@intel.com>
Signed-off-by: igor-davidyuk <igor.davidyuk@intel.com>
Signed-off-by: igor-davidyuk <igor.davidyuk@intel.com>
@igor-davidyuk can you please resolve the conflicts and request for review again? |
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a bash script that emulates cutting a network connection of one or several FL actors.
All the actors run in docker containers and are connected to a docker network, disconnection may be triggered by certain log output in a container's stdout.
There are several experiments, that may be conducted to test Collaborator's tolerance to network breakage:
A. cut it off when it is CPU bound, performing some calculations
B. cut it off when it is waiting for an Aggregators response
C. cut it off when it is sending a message
D. cut it off when it is receiving a message
Luckily, Collaborator's gRPC client features exactly 3 RPCs:
get_tasks
,get_aggregated_tensor
, andsend_local_task_results
.There is a peculiarity regarding Collaborator's gRPC client implementation: fetching an aggregated model from the collaborator happens on a tensor per tensor basis while a local model is sent in one piece, in a stream. Therefore it would be easier to catch A collaborator sending data than receiving.
How to use:
test-federation
will contain logs for all the actors you can analyze.Please note that our tests still require a lot of intervention to the script, thus treat it as just a base layer.
The script accepts the following parameters:
STAGE
- integer. 1 - Start by building the base image; 2 - Start by creating a new workspace and dockerizing it; 3 - Certify federation, choose any number of collaborators; 4 - Run the existing imageNUMBER_OF_COLS
- integer. You can start as many collaborators as you want, but the script will disconnect only the last one!RECONNECTION_TIMEOUT
- integer, seconds. Timeout to connect the last collaborator back to the network.MAXIMUM_DISCONNECTIONS
- integer. In case you want to limit the number of disconnections throughout the experiment.CUT_ON_LOG
- string. A pattern in logs that will trigger disconnections. In the script, wildcards are used to detect the pattern.TEMPLATE
- string. OpenFL template to use.Returning back to the experiments.
A: cut it off when it is CPU bound, performing some calculations.
We can use the
Run 0 epoch of N round
log message in thekeras_cnn_mnist
experiment to trigger disconnection while the collaborator does computations with the command:bash tests/github/docker_disconnecting_test.sh 1 2 20 1 "Run 0 epoch of 1 round"
It will disconnect the second collaborator when training for the first round starts. 20 seconds of disconnection is enough in this case. The collaborator will be affected once it tries to send the task results. It recovers, seemingly thanks to #465, that resends task results if we get
grpc.StatusCode.UNKNOWN
error code on the client side. Logs are attachedA_aggregator.log
A_collaborator1.log
A_collaborator2.log
B: cut it off when it is waiting for an Aggregators response
C: cut it off when it is sending a message.
We can use
send_local_task_results
RPC to drop the connection while sending task results.Here is the exact command that I used. "Setting stream chunks with size 10" is a debug message inside the RPC, 10 is a number specific to the default
keras_cnn_mnist
experiment.bash tests/github/docker_disconnecting_test.sh 4 2 20 2 "Setting stream chunks with size 10"
Logs are attached, TLDR: the disconnected collaborator never recovers after reconnection.
C_aggregator.log
C_collaborator1.log
C_collaborator2.log