Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Feature: Deploy suspicious replica recoverer daemon #806

Closed
haozturk opened this issue May 28, 2024 · 13 comments
Closed

Feature: Deploy suspicious replica recoverer daemon #806

haozturk opened this issue May 28, 2024 · 13 comments
Assignees

Comments

@haozturk
Copy link
Contributor

Feature Description

With rucio/rucio#6396 fixed, this daemon should be ready to deploy. It won't do anything until we successfully start marking replicas suspicious, but we don't need to wait for other issues to fixed to do it. Will work on this a

Use Case

https://indico.cern.ch/event/1356295/

Possible Solution

To be figured out by checking how other daemons are deployed

Related Issues

@voetberg fyi

@haozturk haozturk self-assigned this May 28, 2024
@haozturk haozturk changed the title Feature: Deploy suspicious replica recoveror daemon Feature: Deploy suspicious replica recoverer daemon Jun 6, 2024
@haozturk
Copy link
Contributor Author

  1. I deployed the daemon https://github.com/dmwm/rucio-flux/pull/294/files
  2. Added its config file [1] as a secret:
    1. add a secret mount for replica recoverer config rucio-flux#297
    2. Add subpath for replica recoverer secret mount rucio-flux#298
  3. Added the necessary configs
[haozturk@lxplus9107 ~]$ rucio-admin-int config set --section replicarecoverer  --option rule_rse_expression --value "cms_type=int"
Set configuration: replicarecoverer.rule_rse_expression=cms_type=int
[haozturk@lxplus9107 ~]$ rucio-admin-int config set --section replicarecoverer  --option use_file_metadata --value False
Set configuration: replicarecoverer.use_file_metadata=False
[haozturk@lxplus9107 ~]$ rucio-admin-int config set --section replicarecoverer  --option did_name_expression --value "RAW"
Set configuration: replicarecoverer.did_name_expression=RAW
  1. I declared a replica suspicious 5 times manually (in a very hacky way. The automatic suspicious declaration doesn't work at the moment [1])
$ rucio-int list-suspicious-replicas
RSE Expression:        Scope:    Created at:            Nattempts:  File Name:
---------------------  --------  -------------------  ------------  -------------------------------------------------------------------------------------------------------------
T1_US_FNAL_Tape_Input  cms       2021-10-28 15:35:01             5  /store/data/Run2016F/MuonEG/MINIAOD/HIPM_UL2016_MiniAODv2-v2/280000/B3FF92F9-855D-5144-BA28-3877560A93B2.root
  1. Now, trying to fix the next issue:
{"message": "[1/6]: Exception\nProvided RSE expression is considered invalid.\nDetails: RSE Expression resulted in an empty set.\n  File \"/usr/local/lib/python3.9/site-packages/rucio/daemons/common.py\", line 215, in _generator\n    result = run_once_fnc(heartbeat_handler=heartbeat_handler, activity=activity)\n  File \"/usr/local/lib/python3.9/site-packages/rucio/daemons/replicarecoverer/suspicious_replica_recoverer.py\", line 242, in run_once\n    rse_list = sorted([rse for rse in parse_expression('enable_suspicious_file_recovery=true') if rse['vo'] == vo], key=lambda k: k['rse'])\n  File \"/usr/local/lib/python3.9/site-packages/rucio/db/sqla/session.py\", line 453, in new_funct\n    result = function(*args, session=session, **kwargs)\n  File \"/usr/local/lib/python3.9/site-packages/rucio/core/rse_expression_parser.py\", line 95, in parse_expression\n    raise InvalidRSEExpression('RSE Expression resulted in an empty set.')\n", "error": {"type": "InvalidRSEExpression", "message": "Provided RSE expression is considered invalid.\nDetails: RSE Expression resulted in an empty set.", "stack_trace": "  File \"/usr/local/lib/python3.9/site-packages/rucio/daemons/common.py\", line 215, in _generator\n    result = run_once_fnc(heartbeat_handler=heartbeat_handler, activity=activity)\n  File \"/usr/local/lib/python3.9/site-packages/rucio/daemons/replicarecoverer/suspicious_replica_recoverer.py\", line 242, in run_once\n    rse_list = sorted([rse for rse in parse_expression('enable_suspicious_file_recovery=true') if rse['vo'] == vo], key=lambda k: k['rse'])\n  File \"/usr/local/lib/python3.9/site-packages/rucio/db/sqla/session.py\", line 453, in new_funct\n    result = function(*args, session=session, **kwargs)\n  File \"/usr/local/lib/python3.9/site-packages/rucio/core/rse_expression_parser.py\", line 95, in parse_expression\n    raise InvalidRSEExpression('RSE Expression resulted in an empty set.')\n"}, "@timestamp": "2024-06-11T10:06:26.377Z", "log": {"level": "CRITICAL", "logger": "root"}, "process": {"pid": 9}}

[1]

[
    {
        "action": "ignore",
        "datatype": ["RAW"],
        "scope": []
    },
    {
        "action": "declare bad",
        "datatype": [],
        "scope": []
    }
]

[2] #692 (comment)

@haozturk
Copy link
Contributor Author

haozturk commented Sep 24, 2024

The daemon is deployed. Now, I'm working on making it work. This configuration has been added initially [1] in addition to the config file living as a secret [2]. Currently, it's only working on Caltech. There were 42 suspicious replicas at Caltech and it set these 42 replicas Temporary unavailable. Check these replicas via [3]. Currently the limit is 5 replicas beyond which the replicas are declared T. This is too low. I'm planning to increase it. This requires changes in the helm charts. For now, I switched off the daemon on Caltech. In principle, minos-temporary-expiration daemon should set these replicas available again in 3 days. Will keep an eye on those.

[1]

rucio-admin config set --section policy --option pfn2lfn --value cms_pfn2lfn
rucio-admin config set --section conveyor --option suspicious_pattern --value ".*No such file or directory.*,.*no such file or directory.*,.*CHECKSUM MISMATCH Source and destination checksums do not match.*,.*SOURCE CHECKSUM MISMATCH.*,.*Unable to read file - wrong file checksum.*,.*checksum verification failed.*,.*direct_access.*,.*Copy failed with mode 3rd pull, with error: Transfer failed: failure: Server returned nothing.*,.*HTTP 404 : File not found.*"
rucio-admin config set --section replicarecoverer  --option rule_rse_expression --value “cms_type=real&rse_type=DISK&tier=2”
rucio-admin config set --section replicarecoverer  --option use_file_metadata --value False
rucio-admin config set --section replicarecoverer  --option did_name_expression --value "(/LHE/|/GEN-SIM-DIGI-RAW-MINIAOD/|/AODSIM/|/MINIAODSIM/|/GEN-SIM-RAW/|/GEN-SIM-RECO/|/GEN-SIM-RECODEBUG/|/AOD/|/MINIAOD/|/ALCARECO/|/USER/|/RAW-RECO/|/NANOAOD/|/NANOAODSIM/|/FEVT/|/PREMIX/|/GEN-SIM-DIGI-RAW-HLTDEBUG-RECO/|/GEN-SIM/|/GEN/|/SIM/|/GEN-SIM-DIGI-RAW/|/RAW/|/GEN-SIM-DIGI/|/GEN-SIM-DIGI-RAW-HLTDEBUG/|/DQMIO/|/DQM/|/RECO/|/RECODEBUG/|/RAWAODSIM/|/FEVTDEBUGHLT|/RAW/|/HC/)"
rucio-admin rse set-attribute --rse T2_US_Caltech --key "enable_suspicious_file_recovery" --value true

[2]

[
    {
        "action": "ignore",
        "datatype": ["/RAW", "/HC/"],
        "scope": []
    },
    {
        "action": "declare bad",
        "datatype": ["/LHE/", "/GEN-SIM-DIGI-RAW-MINIAOD/", "/AODSIM/", "/MINIAODSIM/", "/GEN-SIM-RAW/", "/GEN-SIM-RECO/", "/GEN-SIM-RECODEBUG/", "/AOD/", "/MINIAOD/", "/ALCARECO/", "/USER/", "/RAW-RECO/", "/NANOAOD/", "/FEVT/", "/PREMIX/", "/GEN-SIM-DIGI-RAW-HLTDEBUG-RECO/", "/GEN-SIM/", "/GEN/", "/SIM/", "/GEN-SIM-DIGI-RAW/", "/GEN-SIM-DIGI/", "/GEN-SIM-DIGI-RAW-HLTDEBUG/", "/DQMIO/", "/DQM/", "/RECO/", "/RECODEBUG/", "/RAWAODSIM/", "/FEVTDEBUGHLT", "/NANOAODSIM/"],
        "scope": []
    }
]

[3]

SELECT *
FROM cms_rucio_prod.replicas 
WHERE state = 'T'

@haozturk
Copy link
Contributor Author

We increased limitSuspiciousFilesOnRse to 100.

Now I enabled the daemon for Estonia. These are the suspicious replicas there [2]. I'll monitor the situation.

[1] dmwm/rucio-flux#339
[2]

$ rucio list-suspicious-replicas | grep Estonia
T2_EE_Estonia       cms       2024-03-28 13:49:44            48  /store/mc/Run3Summer23MiniAODv4/GluGluHtoZZto4L_M-3000_TuneCP5_13p6TeV_powheg-jhugen-pythia8/MINIAODSIM/130X_mcRun3_2023_realistic_v15-v2/2830000/d3472b6a-e0c0-4014-bd7b-30bdd6b9cac7.root
T2_EE_Estonia       cms       2022-10-31 19:24:31            51  /store/mc/RunIISummer20UL17RECO/NMSSM_XToYHTo2Tau2B_MX-2400_MY-250_TuneCP5_13TeV-madgraph-pythia8/AODSIM/106X_mc2017_realistic_v6-v2/2530000/50D7679C-BCE8-7745-98AC-375B28282C4A.root
T2_EE_Estonia       cms       2024-02-22 15:57:25            27  /store/mc/Run3Summer23MiniAODv4/ZH_Hto2C_Zto2Q_M-125_TuneCP5_13p6TeV_powheg-pythia8/MINIAODSIM/130X_mcRun3_2023_realistic_v14-v2/2820000/05ce2ca3-3db3-4562-93eb-58f2f56ca616.root
T2_EE_Estonia       cms       2022-10-07 12:07:10            24  /store/backfill/1/data/Tier0_REPLAY_2022/EGamma/MINIAOD/PromptReco-v7132607/000/359/813/00000/5d55225b-af6b-4d63-8277-df4058146d94.root
T2_EE_Estonia       cms       2022-11-01 18:58:24            53  /store/mc/RunIISummer20UL17RECO/NMSSM_XToYHTo2Tau2B_MX-2200_MY-900_TuneCP5_13TeV-madgraph-pythia8/AODSIM/106X_mc2017_realistic_v6-v2/2520000/871DF23C-075F-8B41-B33B-9EB7044AE4DF.root
T2_EE_Estonia       cms       2022-11-01 18:58:40            27  /store/mc/RunIISummer20UL17RECO/NMSSM_XToYHTo2Tau2B_MX-1600_MY-70_TuneCP5_13TeV-madgraph-pythia8/AODSIM/106X_mc2017_realistic_v6-v2/2520000/A82E5A52-6C2A-254A-8445-D0C67309138D.root
T2_EE_Estonia       cms       2024-03-28 13:50:34            50  /store/mc/Run3Summer23BPixDRPremix/GluGlutoHHto4B_kl-0p00_kt-1p00_c2-0p00_TuneCP5_13p6TeV_powheg-pythia8/GEN-SIM-RAW/130X_mcRun3_2023_realistic_postBPix_v6-v2/2830000/bdcde240-16c7-4381-87dc-f71b331d091b.root
T2_EE_Estonia       cms       2024-03-29 03:21:43            24  /store/mc/Run3Summer23BPixGS/QCD_PT-15to7000_TuneCP5_Flat2022_13p6TeV_pythia8/GEN-SIM/130X_mcRun3_2023_realistic_postBPix_v5-v3/2830000/1b88718d-db3a-4c2b-a520-3a06d1b6b60c.root
T2_EE_Estonia       cms       2022-10-31 19:25:04            27  /store/mc/RunIISummer20UL17RECO/SUSYGluGluHToAA_AToTauTau_M-125_M-16_TuneCP5_13TeV_PSWeights_pythia8/AODSIM/106X_mc2017_realistic_v6-v2/2530000/4EA578E6-D47A-5342-BD24-C648CA3CD0A9.root
T2_EE_Estonia       cms       2024-04-02 12:00:22             1  /store/mc/Run3Summer22EEMiniAODv4/DYto2L-2Jets_MLL-4to10_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_postEE_v6-v2/40000/c230736c-9da4-421c-97af-88a693b8b971.root
T2_EE_Estonia       cms       2024-04-02 13:28:54            51  /store/mc/Run3Summer23BPixDRPremix/DYto2L-2Jets_MLL-4to10_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/AODSIM/130X_mcRun3_2023_realistic_postBPix_v6-v2/50005/1ac286ce-e62d-4306-b938-afaef6c84ac2.root

@dynamic-entropy
Copy link
Contributor

dynamic-entropy commented Sep 26, 2024

What is the count above?

For e.g in the first line, what is 48?
T2_EE_Estonia cms 2024-03-28 13:49:44 48 /store/mc/Run3Summer23MiniAODv4/GluGluHtoZZto4L_M-3000_TuneCP5_13p6TeV_powheg-jhugen-pythia8/MINIAODSIM/130X_mcRun3_2023_realistic_v15-v2/2830000/d3472b6a-e0c0-4014-bd7b-30bdd6b9cac7.root

@haozturk
Copy link
Contributor Author

number of times it's declared suspicious

@dynamic-entropy
Copy link
Contributor

Ok, I thought so. And who and what all clients are declaring these files as such?

@haozturk
Copy link
Contributor Author

conveyor-finisher only at the moment. Hopefully kronos too in the future if we can manage to propagate job failure info to rucio via traces

@dynamic-entropy
Copy link
Contributor

And what does the conveyor use to decide if some file is suspicious?

@haozturk
Copy link
Contributor Author

@dynamic-entropy
Copy link
Contributor

Not exactly, what patterns did we end up using? Where is that configured?

The reason for me asking all this is that the first file looks all good to me.

T2_EE_Estonia       cms       2024-03-28 13:49:44            48  /store/mc/Run3Summer23MiniAODv4/GluGluHtoZZto4L_M-3000_TuneCP5_13p6TeV_powheg-jhugen-pythia8/MINIAODSIM/130X_mcRun3_2023_realistic_v15-v2/2830000/d3472b6a-e0c0-4014-bd7b-30bdd6b9cac7.root

Thank you so much for the link to the slides. It slipped my mind.

@haozturk
Copy link
Contributor Author

Thanks for the feedback Rahul, it's a config in the conveyor section that I specified here [1]. This is the config ATLAS is using. We need operational experience to tune it better, so your feedback is valuable.

Ping me anytime, we can chat about this.

[1] #806 (comment)

@dynamic-entropy
Copy link
Contributor

Sure, I will appreciate that. Will write to you on matter most about the chat.

So, the first thing here would be to go after the reason this is being declared suspicious while a good file as far as a gfal-copy is concerned.

@ericvaandering
Copy link
Member

I'd say the daemon is deployed as of two weeks ago. Please open a new issue related to further improvements.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants