Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Cleanup after switch to input args #8893

Merged
merged 1 commit into from
Jan 30, 2025

Conversation

belforte
Copy link
Member

No description provided.

@cmsdmwmbot

This comment was marked as outdated.

@cmsdmwmbot

This comment was marked as outdated.

@cmsdmwmbot

This comment was marked as outdated.

@belforte belforte force-pushed the CleanupAfterSwitchTo-input_args branch from 7bc90d3 to b4657d5 Compare January 27, 2025 22:07
@cmsdmwmbot

This comment was marked as outdated.

@cmsdmwmbot

This comment was marked as outdated.

@belforte
Copy link
Member Author

belforte commented Jan 27, 2025

now testing in test2: https://gitlab.cern.ch/crab3/CRABServer/-/pipelines/10013422
BUG 1 fixed
test used crab-dev and triggered a but in DagmanCreator. requiredmicroarch was not consistently set to a string.

@belforte belforte added To be tested PR: do not merge This PR is a work in progress and not ready to be merged labels Jan 27, 2025
@cmsdmwmbot

This comment was marked as outdated.

@belforte
Copy link
Member Author

belforte commented Jan 27, 2025

new test pipeline after above fix: https://gitlab.cern.ch/crab3/CRABServer/-/pipelines/10013506
BUG 2 fixed in 02b8699
CV fails because crab preparelocal fails with

  File "/cvmfs/cms.cern.ch/share/cms/crab-dev/v3.241218.00/lib/CRABClient/Commands/preparelocal.py", line 153, in prepareDir
    inputArgsForScript[key] = jobArgs[value]
KeyError: 'CRAB_Archive'

so pipeline does not submit CCV

@belforte
Copy link
Member Author

belforte commented Jan 27, 2025

during automatic splitting, report for probe job on the WN i called e.g. jobReport.json.5 while htcondor will want to transfer and PostJob will look for
JOB AD: TransferOutput = "jobReport.json.0-5,WMArchiveReport.json.0-5"
I need to keep renaming files in CMSRunAnalysis.py. and this means that I need to keep the confusing functionality overlap with CRAB_Destination until I understand this part well enough to do a proper cleanup.
BUG 3 fixed in 88de9c9

Jobs get help and retried and then fail with

[crabtw@vocms059 cluster10080353.proc0.subproc0]$ condor_q 10080374.0 -af holdreason
Transfer output files failure at the execution point while sending files to access point vocms059. Details: reading from file /storage/local/data1/condor/execute/dir_3383924/glide_3MGoC2/execute/dir_2079020/jobReport.json.0-1: (errno 2) No such file or directory
[crabtw@vocms059 cluster10080353.proc0.subproc0]$ 

problem is still there after the jobNumber fix for BUG 6

damn.... the failing job has
== JOB AD: CRAB_Id = 0 - 1
it shuld be
== JOB AD: CRAB_Id = "0-1"
!!!!!

crab kill did nothing on that task. To be investigated.
BUG 5 Fixed in #8901
Task URL: https://cmsweb-test2.cern.ch/crabserver/ui/task/250127_223131%3Abelforte_crab_20250127_233124

problem is

Failure message from server:	Problem handling 250127_223131:belforte_crab_20250127_233124 because of 'clusterid' failure, traceback follows
				Traceback (most recent call last):
				  File "/data/srv/current/lib/python/site-packages/TaskWorker/Actions/Handler.py", line 98, in executeAction
				    output = work.execute(nextinput, task=self.task, tempDir=self.tempDir)
				  File "/data/srv/current/lib/python/site-packages/TaskWorker/Actions/DagmanKiller.py", line 109, in execute
				    self.executeInternal(*args, **kwargs)
				  File "/data/srv/current/lib/python/site-packages/TaskWorker/Actions/DagmanKiller.py", line 42, in executeInternal
				    if not self.task['tw_name'] or not self.task['clusterid']:
				KeyError: 'clusterid'

understood: problem is this line introduced in 8ed2195 . I did not notice at that time that the query used to retrieve tasks to process does not retrieve the full info from DB and in particular clusterid is always missing from the task dictionary at this point. Need to decide which is the best way ahead.

@belforte
Copy link
Member Author

belforte commented Jan 27, 2025

submitted CCV via Jenkins: #8894
mixed result: 8 tests failed, others OK
BUG 6 fixed via 754ad1d 7ebdb5b and 0c08e3c

FAILED TESTS:
maxMemoryMB-check.sh 250127_234911:cmsbot_crab_maxMemoryMB - 1
maxJobRuntimeMin-check.sh 250127_234914:cmsbot_crab_maxJobRuntimeMin - 1
numCores-check.sh 250127_234919:cmsbot_crab_numCores - 1
scriptExe-check.sh 250127_234922:cmsbot_crab_scriptExe - 1
scriptArgs-check.sh 250127_234925:cmsbot_crab_scriptArgs - 1
lumiMaskFile-check.sh 250127_234946:cmsbot_crab_lumiMaskFile - 1
lumiMaskUrl-check.sh 250127_234949:cmsbot_crab_lumiMaskUrl - 1
runRange-check.sh 250127_234956:cmsbot_crab_runRange - 1
  • maxMemoryMB , maxJobRuntimeMin , numCores were due to missing ads in Job.submit - fixed

  • lumiMaskFile fails because ad format has changed to CRAB_AlgoArgs = "....'lumis': ['1,10 instead of lumis": \["1,10' I a not sure if it is worth to do anything but change our test when we merge the final PR - same for lumiMaskUrl where test looks for grep -q '== JOB AD: CRAB_AlgoArgs.*"273158"' crab_lumiMaskUrl/results/job_out.1.0.txt but file now has single quotes around Run Number 273158, not double quotes

  • runRange again single vs. double quotes in the ad value.

  • the new format is "better" since it is the same as in the DataBase
    image

  • Idea: maybe I can remove the quotes from the grep in the text, it will not be "as strict" but should still do

    • scriptExe* real bug. When the script runs in the WN it gets None as argument, instead of the job ID. Indeed job stdout has
    • ==== Will execute stdbuf -oL -eL /srv/SIMPLE-SCRIPT.sh None > cmsRun-stdout.log.tmp 2>&1
    • and
      image
    • need to add jobNumber to input_args.json !! as usual it is not immediately clear what to use for automatic splitting (see also BUG 3), if the integer index ($count) of the dagman spec for that job or the crab ID with the 0-x, 1-x... etc. format
  • same for scriptArgs, the arguments are passed finely but test looks also for the first arg (jobNumber)
    image

@belforte
Copy link
Member Author

belforte commented Jan 27, 2025

CRAB_Workflow is missing in OpenSearch so jobs do not appear in Grafana

  • odd because I find it in job classAds

BUG 4 Fixed in 75fbb0a
ok. I was defining CRAB_workflow with lowercase w. Grafana is clearly case sensitive.

belforte added a commit to belforte/CRABServer that referenced this pull request Jan 28, 2025
@cmsdmwmbot

This comment was marked as outdated.

belforte added a commit to belforte/CRABServer that referenced this pull request Jan 28, 2025
@cmsdmwmbot

This comment was marked as outdated.

@cmsdmwmbot

This comment was marked as outdated.

@belforte
Copy link
Member Author

belforte commented Jan 28, 2025

try again CCV test via Jenkins, now expect to see only ScriptExe and scriptArgs to fail.
This time submitted also CV which now should work since I fixed the problem with preparelocal (BUG 2)
Jenkins test: #8896

  • CV is OK
  • CCV script*failed as expected, runRange appear still running which is odd since job completed 30min ago

image

@cmsdmwmbot

This comment was marked as outdated.

@cmsdmwmbot

This comment was marked as outdated.

@cmsdmwmbot

This comment was marked as outdated.

@belforte
Copy link
Member Author

belforte commented Jan 29, 2025

test again CV and CCV via Jenkins

All OK at last ! Only remaining problem is BUG 3 about Automatic Splitting

@cmsdmwmbot

This comment was marked as outdated.

@belforte
Copy link
Member Author

belforte commented Jan 30, 2025

Fixed all bugs found so far.
New automatic split task - processing stage jobs are running

New pipeline CV and CCV OK

belforte added a commit that referenced this pull request Jan 30, 2025
the current code doe not really set to an integer, but to a string. Somehow I did not properly understand all the parameters transformation/formatting in the DagmanCreator code. No point in digging in since it works in new code in #8893
@belforte belforte force-pushed the CleanupAfterSwitchTo-input_args branch from 88de9c9 to a3457ab Compare January 30, 2025 16:17
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Pylint check: succeeded
    • 70 comments to review
  • Pycodestyle check: succeeded
    • 268 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2361/artifact/artifacts/PullRequestReport.html

@belforte belforte merged commit 367e569 into dmwm:master Jan 30, 2025
2 checks passed
@belforte belforte deleted the CleanupAfterSwitchTo-input_args branch January 30, 2025 16:24
@belforte belforte removed To be tested PR: do not merge This PR is a work in progress and not ready to be merged labels Feb 3, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants