Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[CAY-1089, 1127, 1130] Introduce worker-side components of SyncSGD without backup worker #1131

Closed
wants to merge 7 commits into from

Conversation

hjp615
Copy link
Contributor

@hjp615 hjp615 commented May 5, 2017

This closes #1089 , #1127 , #1130

Block diagram and sequence diagram of SyncSGD are in the following presentation:
https://docs.google.com/presentation/d/1ao_9D3qbyxilypLM7xohZbR2Ihin5N8Hd50JyrSVOmY/edit?usp=sharing

Two main changed policies for SyncSGD

  1. PSModelAccessor can push new models to the server, when PushBarrier is unblocked.
  2. In AsyncWorkerTask, next mini-batch can be started when MiniBatchBarrier is unblocked.

Specific changes in each files

1. Parameters
New parameter Synchronicity is added. Since default value is async, if no information is given in command line, parameter server will work asynchronously.

2. syncmsg.avsc
Necessary messages for SyncSGD are defined in this file. Message names are quite explicit.
Messages from worker to server : RequestPushPermissionMsg, MiniBatchFinishedMsg
Messages from server to worker : PermitPushMsg, StartNextMiniBatchMsg, TerminateLearningMsg

3. AsyncDolphinLauncher
Distinguish parameter server's model with isAsync boolean value.
In async model, NullPushBarrier and NullMiniBatchBarrier, which do nothing, will be binded.
In sync model, SyncPushBarrier and SyncMiniBatchBarrier will be binded.
For communication, BatchManager is added as a client of CentCommConf.

4. AsyncWorkerTask
StateMachine for three states MINI_BATCH_RUNNING, WAITING_NEXT_MINI_BATCH, MINI_BATCH_CLOSING is added.

Two main changes are the following:
a. For loop for each epoch
In async model, each worker can finish their own for loop asynchronously when epochIdx == maxNumEpochs. In sync model, worker can finish their for loop when the worker receives TerminateLearningMsg from driver. If the message is received, learningFlag value will be changed to finish for loop.
b. MiniBatchBarrier
After trainer.runMiniBatch() is finished, workers are blocked by MiniBatchBarrier.

5. PSModelAccessor
Before push operation, PushBarrier asks driver whether it would be ok to push.

6. ResettableCountDownLatch
Modified version of CountDownLatch, since CountDownLatch is unresettable.

7. Driver-side components(including BatchManager, DriverSideSyncSGDMsgSender)
BatchManager manages workers' mini-batch life cycle. DriverSideSyncSGDMsgSender sends messages related to SyncSGD to workers. They are introduced in this PR to add BatchManager as a client of CentCommConf. They will be implemented in latter PR since they are driver-side components.

8. LearningState
This enum indicates learning state of AsyncWorkerTask. If the state is ProgressLearning, next mini-batch will be started. If the state is TerminateLearning, AsyncWorkerTask will finish its learning.

9. NullMiniBatchBarrier, NullPushBarrier
These are for sync model. These components do nothing(no blocking).

10. SyncMiniBatchBarrier
As mentioned in AsyncWorkerTask, this barrier will be blocked in waitMiniBatchControlMsgFromDriver().
There are two kinds of MiniBatchControlMsg : StartNextMiniBatchMsg, TerminateLearningMsg. Both messages will count down miniBatchLatch and the barrier will be unblocked. If StartNextMiniBatchMsg is received, this function will return ProgressLearning, which allows next mini-batch to be started. If TerminateLearningMsg is received, this function will return TerminateLearning, which makes AsyncWorkerTask to stop its learning.

11. SyncPushBarrier
As mentioned in PSModelAccessor, push operation will be blocked by pushBarrier.requestPushPermission().
If the worker is slow worker, push operation will be blocked until StartNextMiniBatchMsg is received from driver. However, because this PR is implementing SyncSGD without backup worker, this kind of situation will not be happened. thisRoundNum value is necessary to distinguish between up-to-date RequestPushPermissionMsg and old RequestPushPermissionMsg.

12. WorkerSideSyncSGDMsgHandler
Following events will be happened when each message is received.
PushPermitMsg : syncPushBarrier will be unblocked.
StartNextMiniBatchMsg : update thisRoundNum value of syncPushBarrier and reset its latch. Then, unblock syncMiniBatchBarrier to start next minibatch.
TerminateLearningMsg : update learningState value of AsyncWorkerTask with TerminateLearning. Then unblock syncMiniBatchBarrier to terminate learning.

13. WorkerSideSyncSGDMsgSender
There are two kinds of messages that will be sent from worker to the driver : RequestPushPermissionMsg and MiniBatchFinishedMsg.

14. SyncPushBarrierTest
For now, driver-side components are not implemented yet. Therefore, test class for SyncPushBarrier is necessary to check whether it works correctly. In this test class, handlers for each messages that worker receives from driver are tested.
When PermitPushMsg is received, handler should unblock syncPushBarrier by counting down its pushLatch. This point is tested in testPermitPush().
When StartNextMiniBatchMsg is received, handler should update thisRoundNum and this point is tested in testStartNextMiniBatch().

@hjp615 hjp615 added the SyncSGD label May 5, 2017
@hjp615 hjp615 requested review from yunseong and wynot12 May 5, 2017 03:09
@yunseong
Copy link
Contributor

yunseong commented May 5, 2017

@hjp615 Thanks a lot for the work! I'll review this PR from today. :)

@hjp615 and I discussed offline and we'll work on experiments for SIGMOD submission with the highest priority while making progress on syncSGD.

@yunseong
Copy link
Contributor

@hjp615, @wynot12 and I concluded to close this PR for now and come back later when we work on SyncSGD.

@yunseong yunseong closed this May 31, 2017
@wynot12 wynot12 deleted the SyncSGD_NEW branch December 16, 2017 11:21
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce PushBarrier on worker-side
2 participants