-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
automatic splitting fails on missing throughput file #8792
Comments
looks like the problem is to run a tail step w/o info on the main one. The file
which may have something to do with the odd log
|
last lines of
|
after some debugging I believe that in here CRABServer/src/python/TaskWorker/Actions/PreDAG.py Lines 112 to 127 in 3b2e705
Line 123 is wrong, it should be instead
That will make the error in this issue go away. But since things are usually working, there may be something more. |
I believe this problem only happens when all CRABServer/src/python/TaskWorker/Actions/PreDAG.py Lines 200 to 201 in 3b2e705
but for reasons obscure to me processFailed is set to False which makes failed jobs NOT skipped (!!)CRABServer/src/python/TaskWorker/Actions/PreDAG.py Lines 123 to 124 in 3b2e705
while "normally" the default processFailed=True is used
Surely variable naming is confusing ! CRABServer/src/python/TaskWorker/Actions/PreDAG.py Lines 117 to 118 in 3b2e705
At this point I have no idea why the But failed jobs have no throughput report, so can't be used ! |
I made that task DAG complete successfully by rerunning PreDag manually after changing
estimates = set(self.completedJobs(stage='processing', processFailed=True)) which basically forces submission of a tail job with same config. as the processing one (OK, since the failure was an accidental 8028). But I am still worried that making the change in the code for everybody may trigger problems in different situations which I can not imagine/test no. |
Maybe there are situations where processing jobs fail, but still produce a report ? E.g. if they hit the time limit ? CRABServer/scripts/TweakPSet.py Lines 209 to 212 in 3b2e705
Or will they count as successful ? |
that No comments. no issue. I am still unsure what to do. |
some (but not all) probe jobs failing and all processing jobs failing is all in all a very rare case. |
I have prepared PR #8795 with that fix. But need to think more about possible side effects |
I am now convinced that the problematic line
is only relevant when all processing jobs fail. And therefore it is correct to change. If the problem never surfaces it must be because all processing jobs failing is very rare and never happened together with some probes also failing. |
I found this while looking at stuck automatic task in the CI pipeline
https://cmsweb-testbed.cern.ch/crabserver/ui/task/241113_203248%3Acrabint1_crab_20241113_213248
The text was updated successfully, but these errors were encountered: