-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Feature: Deploy suspicious replica recoverer daemon #806
Comments
[1]
[2] #692 (comment) |
The daemon is deployed. Now, I'm working on making it work. This configuration has been added initially [1] in addition to the config file living as a secret [2]. Currently, it's only working on Caltech. There were 42 suspicious replicas at Caltech and it set these 42 replicas [1]
[2]
[3]
|
We increased Now I enabled the daemon for Estonia. These are the suspicious replicas there [2]. I'll monitor the situation. [1] dmwm/rucio-flux#339
|
What is the count above? For e.g in the first line, what is 48? |
number of times it's declared suspicious |
Ok, I thought so. And who and what all clients are declaring these files as such? |
conveyor-finisher only at the moment. Hopefully kronos too in the future if we can manage to propagate job failure info to rucio via traces |
And what does the conveyor use to decide if some file is suspicious? |
Yours answers are here: https://indico.cern.ch/event/1356295/contributions/5713494/attachments/2770348/4826824/Suspicious_replica_recovery_121223.pdf :) See slide 9 |
Not exactly, what patterns did we end up using? Where is that configured? The reason for me asking all this is that the first file looks all good to me.
Thank you so much for the link to the slides. It slipped my mind. |
Thanks for the feedback Rahul, it's a config in the conveyor section that I specified here [1]. This is the config ATLAS is using. We need operational experience to tune it better, so your feedback is valuable. Ping me anytime, we can chat about this. [1] #806 (comment) |
Sure, I will appreciate that. Will write to you on matter most about the chat. So, the first thing here would be to go after the reason this is being declared suspicious while a good file as far as a gfal-copy is concerned. |
I'd say the daemon is deployed as of two weeks ago. Please open a new issue related to further improvements. |
Feature Description
With rucio/rucio#6396 fixed, this daemon should be ready to deploy. It won't do anything until we successfully start marking replicas suspicious, but we don't need to wait for other issues to fixed to do it. Will work on this a
Use Case
https://indico.cern.ch/event/1356295/
Possible Solution
To be figured out by checking how other daemons are deployed
Related Issues
@voetberg fyi
The text was updated successfully, but these errors were encountered: