-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Access PathsAndConsumesOfModules from new signal #47467
base: master
Are you sure you want to change the base?
Conversation
cms-bot internal usage |
A new Pull Request was created by @wddgit for master. It involves the following packages:
@Dr15Jones, @Martin-Grunewald, @cmsbuild, @emeschi, @fwyzard, @makortel, @mmusich, @smorovic, @smuzaffar can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
@cmsbuild, please test |
I just noticed that FastTimerService will also see a behavior change caused by this PR if used with SubProcesses. The activity that used to be in PostBeginJob would not have been executed for SubProcess signals and now would be. Not sure if this matters as we are not currently using SubProcess and discussing removing the SubProcess feature entirely. Possibly this fixes a problem also. The behavior of other services would not be affected by this PR, but they might be having similar problems if used with SubProcess. NVProfilerService uses the PostBeginJob signal and might or might not be affected (even if it has nothing to do with this PR). |
No. |
@cmsbuild, please abort |
test parameters:
|
@cmsbuild, please test
|
@mmusich Thanks. |
hold |
until I will have time to assess the impact on the FastTimerService |
Pull request has been put on hold by @fwyzard |
I was looking back at the comments above and they seem to be confusing, so I thought I would provide some more background information related to SubProcess. First, I want to emphasize this only applies to jobs including one or more SubProcesses. None of this affects jobs without a SubProcess. None of what I am explaining in this comment is new. The code has behaved like this for more than a decade. A signal emitted in a job with SubProcesses can be propagated in two ways. "global" signals are propagated to services in the process or SubProcess where it was emitted and to child SubProcesses. So a "global" signal emitted in the top level process would be seen by services in the top level processes and all SubProcesses. But a "global" signal emitted in a SubProcess would not be seen by a service in the top level process or any higher level SubProcess. A "local" signal goes in the opposite direction. A service in the top level process will see every "local" signal emitted in the top level process or any SubProcess. The signal is propagated upward from the SubProcess it is emitted in. This is the place where signals are assigned as global or local: https://cmssdt.cern.ch/lxr/source/FWCore/ServiceRegistry/src/ActivityRegistry.cc#0056 And finally we are discussing deleting the SubProcess feature entirely and currently no one uses it. So maybe I'm wasting time discussing this... |
-1 Failed Tests: UnitTests Unit TestsI found 1 errors in the following unit tests: ---> test Zmumuall had ERRORS
Comparison SummarySummary:
|
The unit test failure is
Not sure what that means |
If subprocesses are removed from CMSSW, the FastTimerService can be cleaned up, and these changes may become irrelevant. |
The postBeginJob signal has been a "global" signal since January 2011 when SubProcess was first implemented by Bill. Chris and I were around then and also others when those commits were merged and involved in the discussions. If a postBeginJob signal is emitted in a SubProcess, then a service in the top level process does not see it and the postBeginJob method is not called. Any services relying on that signal coming from SubProcesses were already broken before this PR (only for jobs including a SubProcess). They have probably always been broken. I have not surveyed all services to determine which services suffer from this problem. The commit is back on the old CVS branch: f4d9ee0#diff-8138d6b90e119ac25a32ffb17e3b852c5c8f2588b8eb0f7300a0097fedf4e80c This PR moves work from the postBeginJob signal to the new LookupInitializationComplete signal for the 3 services, DependencyGraph, FastTimerService, and NVProfilerService. The new signal is "local" so the functions will start being executed with a SubProcess in the job. I have not tested this, but this might be enough to fix them. Although it is probably safe to assume they were never tested with SubProcesses before so it is difficult to say with certainty that this will be enough to fix them. Maybe more needs to be done. (I verified this behavior with DependencyGraph, I didn't test FastTimerService or NVProfilerService with a SubProcess yet.) I'm happy to fix something if it can be identified as wrong and be quickly fixed. I'm not sure it makes sense to hold up this PR for a long time for something that is already broken. We also could follow up with additional PRs to get specific services working with SubProcess if anyone has time to do that. I'll take a look at the Zmumuall unit test failure. I was hoping someone else would comment about that. At a glance, I've no idea what that one is related to. |
please test Try it again. See if the unit test is reproducible. |
+1 Size: This PR adds an extra 16KB to repository
Comparison SummarySummary:
|
Tests pass this time. Comparison failures are ones we've seen before in other PRs:
The unit test failure we saw in the first run of the PR tests vanished. It appears unrelated to this PR to me. I cannot see how the changes in this PR in services that are mostly monitoring things could cause this error. Given that the error doesn't reproduce, I'm inclined to write it off as a testing system glitch, but if anyone has suggestions on anything I could do related to it I am all ears.
|
I see nice looking HLT circle charts by following the link above but am not familiar with them enough to say whether they are good or bad. I do not know whether they are populated by output of the FastTimerService or not. |
The
The The
No, it will actually break those services in the presence of subprocesses.
Yeah, well, please do not make these assumptions.
It makes sense to hold this PR (and any other PRs that may break existing functionality and cause regressions) until the time they have been properly reviewed, and the impact on the existing and expected behaviour has been assessed. |
the input data used to populate those diagrams is fetched via |
@fwyzard Thanks for the comments. I took a closer look. Before I continue, I want to emphasize all of this only affects jobs using SubProcess. As far as I know, no one is currently using SubProcess. Two apologies. First, I see that FastTimerService did work with SubProcess in the past. I was wrong when I said it didn't. Second, I am the one that broke it in July 2024. It was an unintentional side effect of PR #45434. My mistake. That PR was one of a series of PRs where we went through all the begin and end transitions and tried to give them consistent behavior, especially under exceptions. Prior to that PR each different pair of begin/end transitions behaved differently and it took a careful look at code to understand how a particular begin/end pair of transitions behaved. So since July 2024, several services have been broken when using SubProcesses, including FastTimerService. As far as I know I am the first to notice. I noticed while testing this PR. Our PR and IB tests aren't good enough to detect the problem. There are 2 ways forward:
|
On a quick thought I'd be in favor just proceeding this way, but let's discuss on Monday. |
An update. I am starting to work on a PR removing the SubProcess feature from the CMS Python code used to support configurations and also remove any SubProcess tests. That will completely disable it. Then subsequent PRs will follow that remove all traces of SubProcess from Core code and possibly CMSSW. Even before we realized there was a problem introduced last summer, Matti had proposed this in his O&C talk. I think he plans to announce it again, but if no one in CMS comes forward to say that they use SubProcess or need it for the future for some specific reason, it seems to be where we are headed. |
@wddgit thanks for looking into this and for the additional details.
OK, I am not going to be the one asking to keep support for subprocess in CMSSW if nobody is using it. |
Now announced in https://cms-talk.web.cern.ch/t/removal-of-subprocess-feature/117449 |
PR description:
Use the new LookupInitializationComplete signal to get access to the PathsAndConsumesOfModules object that is available to Services. Previously the PreBeginJob signal was used for this purpose, but since PR #46929 was merged, the object is only partially filled at preBeginJob. When the new signal is emitted it is fully filled. PathsAndConsumesOfModules now includes information related to EventSetup modules that previously did not exist. That is the part that is not filled at preBeginJob.
This shouldn't affect the output of anything (except DependencyGraph service used with SubProcesses, see below).
This PR removes the argument from the preBeginJob signal interface. For several of the services modified in this PR, that argument is not used and the PR just deletes it from the argument list. These changes are trivial and the fact that those services compile should be a sufficient test.
This PR modifies other affected services to use the new signal.
There is one Service where the changes are more complex. That is the NVProfilerService. The PathsAndConsumesOfModules object is used to size some vectors at PreBeginJob. Not only is this case more complicated but the current implementation appears to be incorrect. There is a bug that could cause out of bounds vector access. The size of the ProcessCallGraph was being used to size the vectors, but that size could be incorrect if modules are deleted. The framework will delete unused modules. The ID from the ModuleDescription is incremented for every module constructed, including ones that are later deleted. This ID is not modified when a module is deleted. The index used to access the vector is the ID from the ModuleDescription, but in the old version the size is a count of undeleted modules from PathsAndConsumesOfModules and that does have deleted modules removed. This is fixed in the new version where the size is derived from the number of modules constructed and it counts the ones that later get deleted.
Note that there are no tests in the CMSSW repository of NVProfilerService. In fact there are no references to it at all in CMSSW. Is there a test of NVProfilerService somewhere that I could run? Or is there an expert willing to test this?
Should I develop tests for these two services? If necessary, this could be accomplished in a separate PR...
Resolves cms-sw/framework-team#1194
PR validation:
Relied on existing unit tests which all pass.