Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Enhancement: Implement monitoring for suspicious replica recoverer #862

Open
haozturk opened this issue Nov 12, 2024 · 2 comments
Open

Enhancement: Implement monitoring for suspicious replica recoverer #862

haozturk opened this issue Nov 12, 2024 · 2 comments
Assignees

Comments

@haozturk
Copy link
Contributor

Enhancement Description

We need monitoring for the actions that replica recoverer daemon takes

Use Case

  • to see what's the impact of this daemon, i.e. how many replicas it "fixes"
  • if a user reports corrupt replicas, we can check this monitoring to see if it's caught by this machinery or not

Possible Solution

I think we simply need the suspicious replicas that the daemon processes (file name and RSE or simply PFN) and the action it took (ignore, create rule, declare bad or declare temporary unavailable). Ideally these should be pushed to rucio event monitoring.

Related Issues

No response

@haozturk haozturk self-assigned this Nov 12, 2024
@haozturk
Copy link
Contributor Author

A high level monitoring is already available in FTS monitoring [1]. Rucio uses Recovery activity for the transfers that replace bad replicas with healthy ones. It's already visible that this activity picked up after we enabled the daemon.

[1] https://monit-grafana.cern.ch/d/mtQFDScGk/cms-fts-metrics?from=1730234072240&orgId=11&to=1731493015953&var-activity=Recovery&var-bin=1h&var-dst_rse=All&var-fts_server=All&var-group_by=dst_rse&var-src_rse=All&var-vo=cms&var-protocol=All&viewPanel=11&var-auth_method=All

@haozturk
Copy link
Contributor Author

waiting for rucio/rucio#7167

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

1 participant