Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Scheduler kills whole cluster with 10 instances #1

Open
philipgiuliani opened this issue Sep 4, 2020 · 2 comments
Open

Scheduler kills whole cluster with 10 instances #1

philipgiuliani opened this issue Sep 4, 2020 · 2 comments

Comments

@philipgiuliani
Copy link

Hi,
we have a Cluster running 10 instances that gets killed when using quantum-swarm. With just 2 instances it was working fine.

08:12:56.435 [info] [swarm on A] [tracker:ensure_swarm_started_on_remote_node] nodeup B
08:12:56.435 [info] [swarm on A] [tracker:handle_topology_change] topology change complete
08:13:18.898 [info] GenStage consumer MyProject.Scheduler.ExecutorSupervisor is stopping after receiving cancel from producer #PID<61039.8804.0> with reason: :shutdown
08:13:18.898 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094427137.33179>, :process, #PID<61039.8804.0>, :shutdown}
08:13:18.899 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094427137.33175>, :process, #PID<61039.8802.0>, :shutdown}
08:13:18.901 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094427137.33171>, :process, #PID<61039.8801.0>, :shutdown}
08:13:18.901 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094427137.33167>, :process, #PID<61039.8799.0>, :shutdown}
08:13:18.902 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094427137.33163>, :process, #PID<61039.8797.0>, :shutdown}
08:13:18.903 [warn] [swarm on my_project@XXX.XX.XX.21] [tracker:handle_replica_event] received track event for MyProject.Scheduler.NodeSelectorBroadcaster, mismatched pids, local clock conflicts with remote clock, event unhandled
08:13:18.906 [warn] [swarm on my_project@XXX.XX.XX.21] [tracker:handle_replica_event] received track event for MyProject.Scheduler.JobBroadcaster, mismatched pids, local clock conflicts with remote clock, event unhandled
08:13:18.907 [warn] [swarm on my_project@XXX.XX.XX.21] [tracker:handle_replica_event] received track event for MyProject.Scheduler.ExecutionBroadcaster, mismatched pids, local clock conflicts with remote clock, event unhandled
08:13:18.911 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094164994.181688>, :process, #PID<61061.8615.0>, :noproc}
08:13:18.911 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094164994.181719>, :process, #PID<61043.8615.0>, :noproc}
08:13:18.911 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094164994.181711>, :process, #PID<61061.8621.0>, :shutdown}
08:13:18.911 [info] GenStage consumer MyProject.Scheduler.ExecutorSupervisor is stopping after receiving cancel from producer #PID<61043.8615.0> with reason: :noproc
08:13:18.912 [error] GenServer MyProject.Scheduler.ExecutorSupervisor terminating
** (stop) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
Last message: {:DOWN, #Reference<0.981856171.4094164994.181723>, :process, #PID<61043.8615.0>, :noproc}
08:13:18.912 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094164994.181702>, :process, #PID<61061.8619.0>, :shutdown}
08:13:18.912 [error] Supervisor received unexpected message: {:DOWN, #Reference<0.981856171.4094164994.181698>, :process, #PID<61061.8617.0>, :shutdown}
08:13:18.913 [warn] [swarm on my_project@XXX.XX.XX.21] [tracker:handle_replica_event] received track event for MyProject.Scheduler.TaskRegistry, mismatched pids, local clock conflicts with remote clock, event unhandled
08:13:18.914 [warn] [swarm on my_project@XXX.XX.XX.21] [tracker:handle_replica_event] received track event for MyProject.Scheduler.NodeSelectorBroadcaster, mismatched pids, local clock conflicts with remote clock, event unhandled
08:13:18.919 [error] GenServer #PID<0.8135.0> terminating
** (stop) 'stopping because dependent process <0.8127.0> died: shutdown'
Last message: {:EXIT, #PID<0.8127.0>, :shutdown}
08:13:18.919 [error] GenServer #PID<0.8138.0> terminating
** (stop) 'stopping because dependent process <0.8128.0> died: shutdown'
Last message: {:EXIT, #PID<0.8128.0>, :shutdown}
08:13:18.919 [error] GenServer #PID<0.8131.0> terminating
** (stop) 'stopping because dependent process <0.8126.0> died: shutdown'
Last message: {:EXIT, #PID<0.8126.0>, :shutdown}
08:13:18.927 [info] Application my_project exited: shutdown
"Kernel pid terminated (application_controller) ({application_terminated,my_project,shutdown})
"
"{"Kernel pid terminated",application_controller,"{application_terminated,my_project,shutdown}"}
"

Crash dump is being written to: erl_crash.dump...done

I am not sure what other information I could supply you that will help.

@philipgiuliani
Copy link
Author

Hey @maennchen ,

maybe its just our architecture but this seems like a critical problem for me. If you use this library in a Cluster it will crash the whole production system 😀

@maennchen
Copy link
Member

@philipgiuliani Hm, I only tested it with two machines so far and it seemed to work fine.

I‘ll try to replicate the issue.

If you‘re able to determine the problem a PR would also be very welcome.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants