Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Job cancel due to time limit on IFB cluster #22

Open
florianecoulmance opened this issue Nov 9, 2019 · 7 comments
Open

Job cancel due to time limit on IFB cluster #22

florianecoulmance opened this issue Nov 9, 2019 · 7 comments

Comments

@florianecoulmance
Copy link

Hello everyone,

I got my psiblast job cancelled due to time limit with the following error :

slurmstepd: error: *** JOB 3162130 ON cpu-node-7 CANCELLED AT 2019-11-08T15:32:30 DUE TO TIME LIMIT ***

Because I am doing PSIBLAST against Uniref50 it will take about more than 10 days to run.

What is the time limit ?

Can I change it with this / Do i have to specify it in order for it to work:
#SBATCH --time=20-24:00:00 # days-hh:mm:ss

Bon weekend,
Floriane

@emorice
Copy link

emorice commented Nov 9, 2019

Hi, did you try to submit it in the long queue/partition with -p long ?
If I read this correctly :

[emorice@clust-slurm-client ~]$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
fast*          up 1-00:00:00      3    mix cpu-node-[9,13-14]
fast*          up 1-00:00:00     54   idle cpu-node-[6-8,10-12,15-62]
long           up 30-00:00:0      2    mix cpu-node-[13-14]
long           up 30-00:00:0     19   idle cpu-node-[10-12,15-30]
bigmem         up 60-00:00:0      1    mix cpu-node-69
training       up 30-00:00:0      5   idle cpu-node-[1-5]
maintenance    up 30-00:00:0     13  drain cpu-node-[70-74,76-83]
maintenance    up 30-00:00:0      1   idle cpu-node-75

default partition is fast with a default limit of 1 day while long has a limit of 30 (I am not familiar of slurm nor have tested it yet, this just what I understand)

Also, I believe the purpose of --time is to force a shorter time limit than the queue/partition default (i.e one wants to run a job of unknown length but have it killed if does not finish in, say, one hour) but does not allow a longer one.

@florianecoulmance
Copy link
Author

It did not work but I found a solution :

#SBATCH --partition=long

I put this line in the header of my script, so I guess now it is up to 30 days :)

Thank you,

Floriane

@elolaine
Copy link
Collaborator

elolaine commented Nov 9, 2019

10 days for a psiblast?! That's sounds a bit crazy... is it because you launch all queries, one after the other? Alternatively you could launch N jobs in parallel for N queries...?

@florianecoulmance
Copy link
Author

Can I run 400 jobs at the same time on the cluster ?
1 query against uniref50 takes 40min to run for psiblast

@elolaine
Copy link
Collaborator

Well, probably the 400 jobs are not going to run all at the same time... But you can submit your 400 jobs independently to the cluster queue, and at least some of them will run in parallel. The interest of using the cluster is to be able to run jobs on several CPUs at the same time! (I'll ask the IFB support team if there's a cleverer way to submit the 400 queries)

@florianecoulmance
Copy link
Author

Great, thank you !

I found some advices on the internet, but let me know what the IFB support team advices, I do not want to break the cluster....

@elolaine
Copy link
Collaborator

Ok, so, in case you have any question regarding the usage of the cluster, you can post it here: https://community.cluster.france-bioinformatique.fr.

For this particular problem, you should try and use Slurm's job array mode. You can find the full documentation about it here: https://slurm.schedmd.com/job_array.html.

Here is an example of a job array launching 30 fastqc on 30 different sequences : https://ifb-elixirfr.gitlab.io/cluster/trainings/slurm/ebai2019.html#56. This should be pretty similar to what you want to do.

When sbatch sees the "array" option, it launches the job as many times as values in the indicated array (for instance from 0 to 400). In each job, a variable with the list of the files to be analyzed is loaded. Then the treatment is launched on one of the file using the environment variable $SLURM_ARRAY_TASK_ID which takes as value the index of the current job (0, 1, 2, 3 etc.until 400).

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants