Skip to content

Fallback Mechanism in Unified PanDA Queues

Paul Nilsson edited this page Feb 5, 2025 · 1 revision

What Are Unified PanDA Queues?

Unified PanDA Queues (also known as Grand-Unified-Queues (GUQs)) allow production and analysis activities to share the same PanDA queues. This enhances resource utilization by dynamically adjusting fairshare allocations between production and analysis tasks, which in turn benefits analysis activities by ensuring better access to computing resources, thereby increasing the efficiency and success rate of analysis jobs.

Fallback Mechanism

When a unified PanDA queue encounters an issue with its primary data writing endpoint (e.g., write_lan/0 being down), it can automatically fallback to an alternative endpoint, such as write_lan/1. This capability is particularly beneficial for queues utilizing a remote Rucio Storage Element (RSE) to store job output (e.g. VP queues).

How the Pilot Determines the Destination

The pilot dynamically selects the appropriate destination based on:

  1. Job Type:
    • For analysis jobs, the Pilot considers write_lan_analysis and write_lan activities
    • For production jobs, the default selection is write_lan
  2. Fallback Logic:
    • If the preferred write_lan/0 endpoint is unavailable, the Pilot automatically tries write_lan/1
    • This decision helps prevent failures due to temporary storage or network issues

Exclusion of job.nucleus as an Alternative

One important restriction in the fallback mechanism is that the job.nucleus (the originally desired destination of the output data) is excluded from being considered as an alternative stage-out destination.

This exclusion is managed through the "Associated Pilot Copytools" configuration within the Computing Resource Information Catalog (CRIC). Each PanDA queue has its own settings in CRIC, defining which copy tools (data transfer utilities) can be used for job output transfer.