Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

StatusTracking should handle tail jobs in automatic splitting #8794

Closed
belforte opened this issue Nov 14, 2024 · 4 comments
Closed

StatusTracking should handle tail jobs in automatic splitting #8794

belforte opened this issue Nov 14, 2024 · 4 comments
Assignees

Comments

@belforte
Copy link
Member

if procssing step fails, but task completes via the tail jobs, the ST script does not notice but keeps seeing a failed job in the status summary and tries to resubmit. So test is stuck forever in "testResubmitted" even if task is OK.

@belforte
Copy link
Member Author

this was triggered by https://cmsweb-testbed.cern.ch/crabserver/ui/task/241113_203248%3Acrabint1_crab_20241113_213248
even after I made it complete successfully, the CI pipeline test was failing

@belforte
Copy link
Member Author

At first sigh one problem is here

# remove failed probe jobs (job id of X-Y kind) if any from count
for job in task['jobs'].keys():
if '-' in job and task['jobs'][job]['State'] == 'failed':
task['jobsPerStatus']['failed'] -= 1

that code removes also tail jobs, which is not what the comment says

@belforte
Copy link
Member Author

But the real problem is that this script test is based on job counting. In this case there are 5 probes (one failing), 1 processing (failed), one tail (OK). Namely

(Pdb) status_command_output['jobsPerStatus']
{'finished': 5, 'failed': 2}
(Pdb) status_command_output['jobList']
[['finished', '0-5'], ['finished', '0-3'], ['failed', '0-4'], ['finished', '0-1'], ['finished', '0-2'], ['failed', '1'], ['finished', '1-1']]
(Pdb) 

Need some smarter logic to tell that "yes, one job failed, but tail stage took care". Of course we can't gliss over failed tails like done for probes, but if we sometimes run a larger task with automatic splitting will also face the problem that there can be multiple tail stages and number of jobs is not defined.

@belforte
Copy link
Member Author

maybe "as simple as"

  • if there's any jobid 0-x , call it automatic
  • if automatic, check if all processing were OK, if not check that all tails are OK

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

1 participant