Skip to content

WMCore debugging tools

todor-ivanov edited this page Apr 3, 2020 · 11 revisions

This wiki is meant to list debugging use cases, either to solve/debug Operations issues or internal Dev ones.

[Ops] Debug whether all jobs have been recovered via ACDCs

Problem: Ops request us to check why the workflow hasn't processed 100% of the lumi sections, even though all the failures have been recovered via ACDCs

Solution: first we need to make sure that ACDCs have been created AND executed for every single task path (fileset_name, in terms of ACDC collection).

Details: what we need to retrieve/check, is:

  • did the ACDCs get created after the initial/original workflow moved to completed status?
  • list the amount of jobs/lumis in each fileset_name, from the ACDC collection
  • query reqmgr2 for ACDC workflows recovering that workflow (and fetch their InitialTaskPath)
  • make sure that those ACDC workflows are in completed status
  • anything else

[Ops] Find out which run/lumi is missing in the output dataset

Problem: Ops request us to investigate why the output datasets are missing statistics, even though there are no job failures reported (or they have all been recovered).

Solution: not necessarily a solution. However, part of the solution above has to be applied here, thus check whether all lumis have been recovered. In addition to that, we could have a tool that takes a workflow as input, it finds all the run/lumis meant to be processed, randomly selects one output dataset and compare it against the input dataset. Finally, yielding a list of run/lumis missing in the output dataset.

[Dev] Debugging subscriptions not finished

Problem: When we are completing the agent draining procedure, there are some rare cases where subscriptions are stuck in unfinished state (finished=0). It also usually means that there is - at least - one GQ workqueue element in Running state (and its equivalent LQ workqueue/workqueue_inbox element).

Solution: there are many possible reasons for having a subscription stuck, so there is no common solution. Among the checks we can perform are: correlate the subscription to its fileset and workflow task; check whether they have files either in the available or acquired tables.

Details: further details can be extracted from this github issue: https://github.com/dmwm/WMCore/issues/9568

[Dev/Ops] Debug where (in which component of the system) a Workflow has stuck

Problem: While traversing the whole chain from ReqMgr to the final Worker node for calculation and back a workflow can get stuck in any state (from 'new' to 'announced' [1]). For any each of those states there is its respective components in the system which holds the workflow at the moment.

[1] https://github.com/dmwm/WMCore/blob/master/doc/wmcore/RequestStateTransition.png

Solution: As an example the WF may stay in 'Aquired' or 'Running Open' in the agent, but Condor may have not generated jobs for it. So the corresponding action in this case should be to try to find the WF in the local Work queue and eventually the jobs (if there are any) in the condor queue and compare the results. One way of querying the local Work queue should be to tunnel do the agent, then one can access the couch futon interface. Alternative to that approach is to parse WorkQueueManager logs. For the condor queue a simple condor_q with the proper constrains will do.

Details: This was just one of the possible status transitions discussed above. We need to add similar details for the rest.

[Dev/Ops] Find all the output blocks for a given workflow name

Problem: There might be a situation where we need to invalidate (in PhEDEx and DBS) blocks produced by a given workflow. Among the reasons, it could be that there were two workflows writing to the same output (like a duplicate ACDC).

Solution: we need to find out which agents were processing that given workflow. With that information in hands, we can then query their local SQL database and list all the output blocks (from all the tasks). What to do then with the output blocks, is out of the scope of this debugging.

Details: a SQL query like the following can yield all the output blocks (starting from files associated to blocks) for a given workflow

SELECT dbsbuffer_block.id AS blockid, dbsbuffer_block.blockname AS blockname FROM dbsbuffer_block
  INNER JOIN dbsbuffer_file ON dbsbuffer_block.id = dbsbuffer_file.block_id
  INNER JOIN dbsbuffer_workflow ON dbsbuffer_file.workflow = dbsbuffer_workflow.id
  WHERE dbsbuffer_workflow.name='cmsunified_ACDC0_Run2016B-v2-ZeroBias2-21Feb2020_UL2016_HIPM_1068p1_200313_133114_5167';
Clone this wiki locally