Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

workers launched with htcondor cluster manager cannot connect back with master? #107

Closed
rgavazzi opened this issue Nov 19, 2018 · 3 comments

Comments

@rgavazzi
Copy link

rgavazzi commented Nov 19, 2018

I get the following error on my local cluster with htcondor scheduler ( julia version 1.1.0-DEV). 1

julia>  addproc_htc( 4 )   
Error launching condor
MethodError(iterate, (Process(`condor_submit /raid/gavazzi/.julia-htc/julia-1195449.sub`, ProcessExited(0)),), 0x00000000000061f6)
0-element Array{Int64,1}

The created condor script file seems OK:

executable = /bin/bash
arguments = ./julia-1195449.sh
universe = vanilla
should_transfer_files = yes
transfer_input_files = /home/dir/.julia-htc/julia-1195449.sh
Notification = Error
output = /home/dir/.julia-htc/julia-1195449-1.o
error= /home/dir/.julia-htc/julia-1195449-1.e
queue
output = /home/dir/.julia-htc/julia-1195449-2.o
error= /home/dir/.julia-htc/julia-1195449-2.e
queue
output = /home/dir/.julia-htc/julia-1195449-3.o
error= /home/dir/.julia-htc/julia-1195449-3.e
queue
output = /home/dir/.julia-htc/julia-1195449-4.o
error= /home/dir/.julia-htc/julia-1195449-4.e
queue

The temporary shell script file /home/dir/.julia-htc/julia-1195449.sh seems OK:

#!/bin/sh
cd /tmp
/usr/bin/julia --worker=o7tjjc9VsZGKA8qn | /usr/bin/telnet  machinenode.from_which_I_ran.julia 8848

All ouput *.o files look like:
Trying 192.168.1.3...

All ouput *.e files look like:
telnet: connect to address 192.168.1.3: Connection refused

(machinenode.from_which_I_ran.julia has IP address 192.168.1.3 , locally )

Other issue: The method "addprocs_htc(np::Integer) = addprocs(HTCManager(np))" does not seem to allow the specification a a different working directory. In many cases, htcondor will place the julia-1195449.sh and associated files into a temporary scratch working directory where one may want to stay during the worker lifetime. Couldn't we avoid that with a

(dir!=nothing) && println(scriptf, "cd $(Base.shell_escape(dir))")

and
addprocs_htc(np::Integer ; dir=nothing ) = addprocs(HTCManager(np) , dir=dir)

change in condor.jl

@vchuravy
Copy link
Member

Condor might need a similar fix to JuliaParallel/MPI.jl#222

@juliohm
Copy link
Collaborator

juliohm commented Oct 6, 2020

Too old to reproduce. Please retry with the current stable release and reopen the issue if needed.

@juliohm juliohm closed this as completed Oct 6, 2020
@rgavazzi
Copy link
Author

As far as I can tell, the problem is stlll present!!! I keep failing launching workers with htcondor. The problem remains the same.
telnet keeps complaining:

telnet: connect to address 192.168.1.3: Connection refused

If I directly run "nc -l 8200" on a machine mmm in the cluster and I telnet mmm 820 . Telnet connection succeeds!!
It seems to me that equivalent of nc -l command is the listen(portnum) call at line 45 of the condor.jl script...

Anyhow, I'd be interested to read from anyone facing the same issue or not, while using ClusterManagers in a HTCondor scheduler!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants