Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Access PathsAndConsumesOfModules from new signal #47467

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented Feb 26, 2025

PR description:

Use the new LookupInitializationComplete signal to get access to the PathsAndConsumesOfModules object that is available to Services. Previously the PreBeginJob signal was used for this purpose, but since PR #46929 was merged, the object is only partially filled at preBeginJob. When the new signal is emitted it is fully filled. PathsAndConsumesOfModules now includes information related to EventSetup modules that previously did not exist. That is the part that is not filled at preBeginJob.

This shouldn't affect the output of anything (except DependencyGraph service used with SubProcesses, see below).

This PR removes the argument from the preBeginJob signal interface. For several of the services modified in this PR, that argument is not used and the PR just deletes it from the argument list. These changes are trivial and the fact that those services compile should be a sufficient test.

This PR modifies other affected services to use the new signal.

  • A few services are used in Framework unit tests and those test pass with the changes in this PR.
  • DependencyGraph runs and produces a graph successfully. If one modifies it to run without SubProcesses, the output .gv file is identical before and after this PR. (Note that with SubProcesses the version of DependencyGraph before this PR has a bug that is fixed with the changes in this PR. The old version assumes the PostBeginRun signal is a "local" signal" in ActivityRegistry but it is "global" so none of the SubProcess info is included in the .gv file that is output in the old version.)
  • FastTimerService is moved to the new signal and it looks like nothing should change in the output. There are many references to FastTimerService in CMSSW, but I don't see any unit test devoted to testing it. Is there an existing test of FastTimerService somewhere that I could run? Or is there an expert willing to test this? It is probably just a matter of running and checking the output is reasonable. Nothing should change.

There is one Service where the changes are more complex. That is the NVProfilerService. The PathsAndConsumesOfModules object is used to size some vectors at PreBeginJob. Not only is this case more complicated but the current implementation appears to be incorrect. There is a bug that could cause out of bounds vector access. The size of the ProcessCallGraph was being used to size the vectors, but that size could be incorrect if modules are deleted. The framework will delete unused modules. The ID from the ModuleDescription is incremented for every module constructed, including ones that are later deleted. This ID is not modified when a module is deleted. The index used to access the vector is the ID from the ModuleDescription, but in the old version the size is a count of undeleted modules from PathsAndConsumesOfModules and that does have deleted modules removed. This is fixed in the new version where the size is derived from the number of modules constructed and it counts the ones that later get deleted.

Note that there are no tests in the CMSSW repository of NVProfilerService. In fact there are no references to it at all in CMSSW. Is there a test of NVProfilerService somewhere that I could run? Or is there an expert willing to test this?

Should I develop tests for these two services? If necessary, this could be accomplished in a separate PR...

Resolves cms-sw/framework-team#1194

PR validation:

Relied on existing unit tests which all pass.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 26, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-47467/43899

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wddgit for master.

It involves the following packages:

  • EventFilter/Utilities (daq)
  • FWCore/Framework (core)
  • FWCore/Integration (core)
  • FWCore/PrescaleService (core)
  • FWCore/ServiceRegistry (core)
  • FWCore/Services (core)
  • FWCore/TestProcessor (core)
  • HLTrigger/Timer (hlt)
  • HeterogeneousCore/CUDAServices (heterogeneous)
  • HeterogeneousCore/SonicTriton (heterogeneous)
  • IOMC/RandomEngine (core)
  • PerfTools/AllocMonitor (core)

@Dr15Jones, @Martin-Grunewald, @cmsbuild, @emeschi, @fwyzard, @makortel, @mmusich, @smorovic, @smuzaffar can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @fabiocos, @felicepantaleo, @fwyzard, @kpedro88, @makortel, @missirol, @mmusich, @riga, @rovere this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor

@cmsbuild, please test

@wddgit
Copy link
Contributor Author

wddgit commented Feb 26, 2025

I just noticed that FastTimerService will also see a behavior change caused by this PR if used with SubProcesses. The activity that used to be in PostBeginJob would not have been executed for SubProcess signals and now would be. Not sure if this matters as we are not currently using SubProcess and discussing removing the SubProcess feature entirely. Possibly this fixes a problem also.

The behavior of other services would not be affected by this PR, but they might be having similar problems if used with SubProcess. NVProfilerService uses the PostBeginJob signal and might or might not be affected (even if it has nothing to do with this PR).

@makortel
Copy link
Contributor

Should I develop tests for these two services?

No.

@mmusich
Copy link
Contributor

mmusich commented Feb 26, 2025

@cmsbuild, please abort

@mmusich
Copy link
Contributor

mmusich commented Feb 26, 2025

test parameters:

  • enable = hlt_p2_timing

@mmusich
Copy link
Contributor

mmusich commented Feb 26, 2025

@cmsbuild, please test

@wddgit
Copy link
Contributor Author

wddgit commented Feb 26, 2025

@mmusich Thanks.

@fwyzard
Copy link
Contributor

fwyzard commented Feb 26, 2025

hold

@fwyzard
Copy link
Contributor

fwyzard commented Feb 26, 2025

until I will have time to assess the impact on the FastTimerService

@cmsbuild
Copy link
Contributor

Pull request has been put on hold by @fwyzard
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

@cmsbuild cmsbuild added the hold label Feb 26, 2025
@wddgit
Copy link
Contributor Author

wddgit commented Feb 26, 2025

I was looking back at the comments above and they seem to be confusing, so I thought I would provide some more background information related to SubProcess.

First, I want to emphasize this only applies to jobs including one or more SubProcesses. None of this affects jobs without a SubProcess.

None of what I am explaining in this comment is new. The code has behaved like this for more than a decade.

A signal emitted in a job with SubProcesses can be propagated in two ways. "global" signals are propagated to services in the process or SubProcess where it was emitted and to child SubProcesses. So a "global" signal emitted in the top level process would be seen by services in the top level processes and all SubProcesses. But a "global" signal emitted in a SubProcess would not be seen by a service in the top level process or any higher level SubProcess.

A "local" signal goes in the opposite direction. A service in the top level process will see every "local" signal emitted in the top level process or any SubProcess. The signal is propagated upward from the SubProcess it is emitted in.

This is the place where signals are assigned as global or local:

https://cmssdt.cern.ch/lxr/source/FWCore/ServiceRegistry/src/ActivityRegistry.cc#0056

And finally we are discussing deleting the SubProcess feature entirely and currently no one uses it. So maybe I'm wasting time discussing this...

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Size: This PR adds an extra 212KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b40efd/44702/summary.html
COMMIT: c5a164e
CMSSW: CMSSW_15_1_X_2025-02-26-1100/el8_amd64_gcc12
Additional Tests: HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47467/44702/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test Zmumuall had ERRORS

Comparison Summary

Summary:

  • You potentially added 6 lines to the logs
  • Reco comparison results: 11 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3920300
  • DQMHistoTests: Total failures: 11
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3920269
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 214 log files, 184 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

The unit test failure is

===== Test "Zmumuall" ====
file has vanished: "/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_1_X_2025-02-26-1100/unit_tests/validateAlignments/test_yaml/JetHT/merge/testJob/unitTestJetHT/JetHTAnalysis_merged.root"
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1189) [sender=3.1.3]

---> test Zmumuall had ERRORS

Not sure what that means

@fwyzard
Copy link
Contributor

fwyzard commented Feb 27, 2025

And finally we are discussing deleting the SubProcess feature entirely and currently no one uses it. So maybe I'm wasting time discussing this...

If subprocesses are removed from CMSSW, the FastTimerService can be cleaned up, and these changes may become irrelevant.
As long as subprocesses are part of CMSSW, I want to make sure that the FastTimerService works well with them.

@wddgit
Copy link
Contributor Author

wddgit commented Feb 27, 2025

The postBeginJob signal has been a "global" signal since January 2011 when SubProcess was first implemented by Bill. Chris and I were around then and also others when those commits were merged and involved in the discussions. If a postBeginJob signal is emitted in a SubProcess, then a service in the top level process does not see it and the postBeginJob method is not called. Any services relying on that signal coming from SubProcesses were already broken before this PR (only for jobs including a SubProcess). They have probably always been broken. I have not surveyed all services to determine which services suffer from this problem. The commit is back on the old CVS branch:

f4d9ee0#diff-8138d6b90e119ac25a32ffb17e3b852c5c8f2588b8eb0f7300a0097fedf4e80c

This PR moves work from the postBeginJob signal to the new LookupInitializationComplete signal for the 3 services, DependencyGraph, FastTimerService, and NVProfilerService. The new signal is "local" so the functions will start being executed with a SubProcess in the job. I have not tested this, but this might be enough to fix them. Although it is probably safe to assume they were never tested with SubProcesses before so it is difficult to say with certainty that this will be enough to fix them. Maybe more needs to be done. (I verified this behavior with DependencyGraph, I didn't test FastTimerService or NVProfilerService with a SubProcess yet.)

I'm happy to fix something if it can be identified as wrong and be quickly fixed. I'm not sure it makes sense to hold up this PR for a long time for something that is already broken. We also could follow up with additional PRs to get specific services working with SubProcess if anyone has time to do that.

I'll take a look at the Zmumuall unit test failure. I was hoping someone else would comment about that. At a glance, I've no idea what that one is related to.

@wddgit
Copy link
Contributor Author

wddgit commented Feb 27, 2025

please test

Try it again. See if the unit test is reproducible.

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b40efd/44728/summary.html
COMMIT: c5a164e
CMSSW: CMSSW_15_1_X_2025-02-27-1100/el8_amd64_gcc12
Additional Tests: HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47467/44728/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3920300
  • DQMHistoTests: Total failures: 57
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3920223
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 214 log files, 184 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@wddgit
Copy link
Contributor Author

wddgit commented Feb 27, 2025

Tests pass this time. Comparison failures are ones we've seen before in other PRs:

The unit test failure we saw in the first run of the PR tests vanished. It appears unrelated to this PR to me. I cannot see how the changes in this PR in services that are mostly monitoring things could cause this error. Given that the error doesn't reproduce, I'm inclined to write it off as a testing system glitch, but if anyone has suggestions on anything I could do related to it I am all ears.

===== Test "Zmumuall" ====
file has vanished: "/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_15_1_X_2025-02-26-1100/unit_tests/validateAlignments/test_yaml/JetHT/merge/testJob/unitTestJetHT/JetHTAnalysis_merged.root"
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1189) [sender=3.1.3]

---> test Zmumuall had ERRORS

@wddgit
Copy link
Contributor Author

wddgit commented Feb 27, 2025

I see nice looking HLT circle charts by following the link above but am not familiar with them enough to say whether they are good or bad. I do not know whether they are populated by output of the FastTimerService or not.

@fwyzard
Copy link
Contributor

fwyzard commented Feb 28, 2025

The postBeginJob signal has been a "global" signal since January 2011 when SubProcess was first implemented by Bill. Chris and I were around then and also others when those commits were merged and involved in the discussions. If a postBeginJob signal is emitted in a SubProcess, then a service in the top level process does not see it and the postBeginJob method is not called. Any services relying on that signal coming from SubProcesses were already broken before this PR (only for jobs including a SubProcess). They have probably always been broken.

The DependencyGraph and the FastTimerService were not "broken", quite the opposite: they were written explicitly taking this behaviour into account:

  • the preBeginJob signal may come both for the main process and for subprocesses, and is handled accordlingly;
  • the postBeginJob signal is expected to come only for the main process.

The DependencyGraph includes this logic directly in its preBeginJob() method.

The FastTimerService relies on the handling inside the ProcessCallGraph helper object and its preBeginJob() method, and on the fact that postBeginJob() is only called once.

This PR moves work from the postBeginJob signal to the new LookupInitializationComplete signal for the 3 services, DependencyGraph, FastTimerService, and NVProfilerService. The new signal is "local" so the functions will start being executed with a SubProcess in the job. I have not tested this, but this might be enough to fix them.

No, it will actually break those services in the presence of subprocesses.

Although it is probably safe to assume they were never tested with SubProcesses before so it is difficult to say with certainty that this will be enough to fix them.

Yeah, well, please do not make these assumptions.

I'm happy to fix something if it can be identified as wrong and be quickly fixed. I'm not sure it makes sense to hold up this PR for a long time for something that is already broken. We also could follow up with additional PRs to get specific services working with SubProcess if anyone has time to do that.

It makes sense to hold this PR (and any other PRs that may break existing functionality and cause regressions) until the time they have been properly reviewed, and the impact on the existing and expected behaviour has been assessed.

@mmusich
Copy link
Contributor

mmusich commented Feb 28, 2025

I see nice looking HLT circle charts by following the link above but am not familiar with them enough to say whether they are good or bad. I do not know whether they are populated by output of the FastTimerService or not.

the input data used to populate those diagrams is fetched via FastTimerService. The test is not meant to be exhaustive and it's actually meant for other reasons. The fact that you see circles filled it means that nothing is badly broken in a way that would make that particular test unusable.
There are subtleties though (see Andrea's comment #47467 (comment) to which I completely defer as main developer and expert of the tool).

@wddgit
Copy link
Contributor Author

wddgit commented Feb 28, 2025

@fwyzard Thanks for the comments. I took a closer look.

Before I continue, I want to emphasize all of this only affects jobs using SubProcess. As far as I know, no one is currently using SubProcess.

Two apologies. First, I see that FastTimerService did work with SubProcess in the past. I was wrong when I said it didn't. Second, I am the one that broke it in July 2024. It was an unintentional side effect of PR #45434. My mistake. That PR was one of a series of PRs where we went through all the begin and end transitions and tried to give them consistent behavior, especially under exceptions. Prior to that PR each different pair of begin/end transitions behaved differently and it took a careful look at code to understand how a particular begin/end pair of transitions behaved.

So since July 2024, several services have been broken when using SubProcesses, including FastTimerService. As far as I know I am the first to notice. I noticed while testing this PR. Our PR and IB tests aren't good enough to detect the problem.

There are 2 ways forward:

  • Delete SubProcess and accept that jobs with SubProcess are broken temporarily until we do that. If we are doing that soon, this is probably the best approach.
  • I can revert the part of PR Improve Framework behavior after exceptions in begin/end transitions (Job, Stream, ProcessBlock) #45434 that is causing the problem. I would suggest doing that in a separate PR first. It also affects postEndJob, not just postBeginJob. And then I would also need to make a few changes to this PR. This isn't hard and we could still delete SubProcess in the future. Either way is OK with me.

@makortel
Copy link
Contributor

  • Delete SubProcess and accept that jobs with SubProcess are broken temporarily until we do that. If we are doing that soon, this is probably the best approach.

On a quick thought I'd be in favor just proceeding this way, but let's discuss on Monday.

@wddgit
Copy link
Contributor Author

wddgit commented Mar 4, 2025

An update. I am starting to work on a PR removing the SubProcess feature from the CMS Python code used to support configurations and also remove any SubProcess tests. That will completely disable it. Then subsequent PRs will follow that remove all traces of SubProcess from Core code and possibly CMSSW. Even before we realized there was a problem introduced last summer, Matti had proposed this in his O&C talk. I think he plans to announce it again, but if no one in CMS comes forward to say that they use SubProcess or need it for the future for some specific reason, it seems to be where we are headed.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 5, 2025

@wddgit thanks for looking into this and for the additional details.

  • Delete SubProcess and accept that jobs with SubProcess are broken temporarily until we do that. If we are doing that soon, this is probably the best approach.

OK, I am not going to be the one asking to keep support for subprocess in CMSSW if nobody is using it.

@makortel
Copy link
Contributor

makortel commented Mar 6, 2025

I think he plans to announce it again, but if no one in CMS comes forward to say that they use SubProcess or need it for the future for some specific reason, it seems to be where we are headed.

Now announced in https://cms-talk.web.cern.ch/t/removal-of-subprocess-feature/117449

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove access to partially filled PathsAndConsumes object
5 participants