You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The set-up of timely communication channels, from timely_communication, make strong assumptions about the synchronization of the workers. In particular, they assume that if a message is received for a channel that does not yet exist, it is safe to spin waiting for it to appear (the associated worker is assumed to also be constructing the same graph at the same moment, perhaps just slower).
This has the potential to go wrong if the worker is for whatever reason blocked, for example on the wrong side of a worker.step_while() call. While the workers are expected to be running equivalent code, slight non-determinism could cause some divergence.
Instead, the channels could probably easily rendezvous, with either end-point creating the appropriate (send, recv) pairs in some common location, and extracting their endpoint from the list. The process-local channels look a bit like this.
I haven't actually seen this happen in practice yet, but we haven't exercised dataflow construction at anything other than start of computation, before workers might diverge on their synchrony. If nothing else, it would be valuable to spec out what is expected to work when, for guidance on writing worker code that doesn't diverge.
The text was updated successfully, but these errors were encountered:
The zero copy allocators in #135 no longer make this assumption, and create shared queues for channels either when constructed or when they first see a message bearing that identifier.
There is still an assumption of determinism in graph construction, but the problem above of channel construction should be resolved.
The set-up of timely communication channels, from
timely_communication
, make strong assumptions about the synchronization of the workers. In particular, they assume that if a message is received for a channel that does not yet exist, it is safe to spin waiting for it to appear (the associated worker is assumed to also be constructing the same graph at the same moment, perhaps just slower).This has the potential to go wrong if the worker is for whatever reason blocked, for example on the wrong side of a
worker.step_while()
call. While the workers are expected to be running equivalent code, slight non-determinism could cause some divergence.Instead, the channels could probably easily rendezvous, with either end-point creating the appropriate (send, recv) pairs in some common location, and extracting their endpoint from the list. The process-local channels look a bit like this.
I haven't actually seen this happen in practice yet, but we haven't exercised dataflow construction at anything other than start of computation, before workers might diverge on their synchrony. If nothing else, it would be valuable to spec out what is expected to work when, for guidance on writing worker code that doesn't diverge.
The text was updated successfully, but these errors were encountered: