Add support and validation to GPU in StdBase/ReReco #10799

amaltaro · 2021-09-10T14:08:34Z

Status

ready

Description

This PR implements GPU functionality within WMCore (only at the request-level, job level will be done in a different issue/PR).
Summary of changes is:

New request spec parameters called 'RequiresGPU' and 'GPUParams', where:
RequiresGPU: can be one of these values ("forbidden", "optional", "required"), with default value to forbidden, thus not using GPUs
GPUParams: a dictionary JSON encoded, with a default value to None JSON encoded. It must be provided if RequiresGPU=optional or RequiresGPU=required.

List of mandatory parameters within GPUParams is:

GPUMemoryMB (renamed from GPUMemory !): integer greater than 0
CUDACapabilities: a list of string values. Each value must match the CUDA_VERSION_REGEX regular expression and max length.
CUDARuntime: a string value matching the CUDA_VERSION_REGEX constraints.

And a list of the 3 optional parameters is:

GPUName: a string value with less than 100 chars
CUDADriverVersion: a string value matching the CUDA_VERSION_REGEX constraints.
CUDARuntimeVersion: a string value matching the CUDA_VERSION_REGEX constraints.

NOTE: full support in TaskChain and StepChain is going to be done in a different GH issue/pull request, but the bulk of the development is already provided in this PR.

Is it backward compatible (if not, which system it affects?)

NO (new feature!)

Related PRs

None

External dependencies / deployment changes

None

amaltaro · 2021-09-10T14:17:47Z

copying a few names from the GH issue discussion @justinasr @mrceyhun @fwyzard @srimanob @hufnagel

This is still a work in progress, but I wanted to draw your attention that this is happening and this is the current candidate implementation to go into the Workload Management system, hopefully in the next week.

The most important input/feedback that I would like to get from you is, whether you see any inconsistencies or use cases that are not covered with the current schema (data type + length + regular expression). Thank you very much!

UPDATE: adding Jordan as well @jordan-martins

cmsdmwmbot · 2021-09-10T14:24:05Z

Jenkins results:

Python2 Unit tests: failed
- 2 tests added
Python3 Unit tests: succeeded
- 2 tests added
- 3 changes in unstable tests
Python2 Pylint check: failed
- 40 warnings and errors that must be fixed
- 2 warnings
- 143 comments to review
Python3 Pylint check: failed
- 48 warnings and errors that must be fixed
- 2 warnings
- 166 comments to review
Pylint py3k check: failed
- 1 warnings
Pycodestyle check: succeeded
- 30 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12430/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2021-09-10T15:05:28Z

Jenkins results:

Python2 Unit tests: succeeded
- 2 tests added
Python3 Unit tests: succeeded
- 2 tests added
- 3 changes in unstable tests
Python2 Pylint check: failed
- 45 warnings and errors that must be fixed
- 3 warnings
- 147 comments to review
Python3 Pylint check: failed
- 53 warnings and errors that must be fixed
- 3 warnings
- 167 comments to review
Pylint py3k check: failed
- 1 warnings
Pycodestyle check: succeeded
- 33 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12431/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2021-09-13T21:19:23Z

Jenkins results:

Python2 Unit tests: failed
- 6 tests added
- 1 changes in unstable tests
Python3 Unit tests: failed
- 6 tests added
- 3 changes in unstable tests
Python2 Pylint check: failed
- 64 warnings and errors that must be fixed
- 15 warnings
- 267 comments to review
Python3 Pylint check: failed
- 78 warnings and errors that must be fixed
- 15 warnings
- 398 comments to review
Pylint py3k check: failed
- 3 errors and warnings that should be fixed
- 1 warnings
Pycodestyle check: succeeded
- 107 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12435/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2021-09-14T06:18:14Z

Jenkins results:

Python2 Unit tests: succeeded
- 6 tests added
Python3 Unit tests: succeeded
- 6 tests added
- 1 changes in unstable tests
Python2 Pylint check: failed
- 64 warnings and errors that must be fixed
- 15 warnings
- 267 comments to review
Python3 Pylint check: failed
- 78 warnings and errors that must be fixed
- 15 warnings
- 398 comments to review
Pylint py3k check: failed
- 3 errors and warnings that should be fixed
- 1 warnings
Pycodestyle check: succeeded
- 106 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12436/artifact/artifacts/PullRequestReport.html

amaltaro · 2021-09-14T06:54:21Z

The baseline development to support GPUs within WMCore is provided in this PR, which also supports those new workflow arguments at the ReReco spec type. TaskChain and StepChain will be addressed in different issues/PR.

I'm going to run some extra tests in the next hours, and if everything goes fine, I will request these changes to be deployed in cmsweb-testbed today, run a final validation, and deploy this to production as well on Thursday.

amaltaro · 2021-09-14T08:24:46Z

Looking at a real request JSON, I see this:

  "RequiresGPU": "forbidden",
  "GPUParams": "\"\"",

which doesn't look great with those scaped chars. Maybe we should default it to encoded None instead (thus, 'null'). Checking...

cmsdmwmbot · 2021-09-14T09:00:22Z

Jenkins results:

Python2 Unit tests: succeeded
- 6 tests added
Python3 Unit tests: succeeded
- 6 tests added
- 1 changes in unstable tests
Python2 Pylint check: failed
- 64 warnings and errors that must be fixed
- 15 warnings
- 267 comments to review
Python3 Pylint check: failed
- 78 warnings and errors that must be fixed
- 15 warnings
- 398 comments to review
Pylint py3k check: failed
- 3 errors and warnings that should be fixed
- 1 warnings
Pycodestyle check: succeeded
- 106 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12439/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2021-09-14T09:50:14Z

Jenkins results:

Python2 Unit tests: succeeded
- 6 tests added
Python3 Unit tests: succeeded
- 6 tests added
- 1 changes in unstable tests
Python2 Pylint check: failed
- 64 warnings and errors that must be fixed
- 15 warnings
- 268 comments to review
Python3 Pylint check: failed
- 78 warnings and errors that must be fixed
- 15 warnings
- 399 comments to review
Pylint py3k check: failed
- 3 errors and warnings that should be fixed
- 1 warnings
Pycodestyle check: succeeded
- 106 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12440/artifact/artifacts/PullRequestReport.html

amaltaro · 2021-09-14T09:53:44Z

I'm going to need this code merged, such that I can resume working on its implementation for TaskChain #10400 and StepChain #10401

Basic tests went fine in my VM. I'm going to squash the 5th commit with the 1st one, and from my side it's good to go.
However, Todor, please feel free to leave your review. If something needs to be addressed, I can do so in the other(s) PR to be created today touching this GPU code.

fix Lexicon logic for GPUParams and its internals add RequiresGPU vs GPUParams validation; fix Py2 compatibility Change default from empty string to None. StdBase and Lexicon Call GPU setter in StdBase.setupProcessingTask

fix WMWorkload set call update getters/setters to deal with None default instead

clean unit tests update Lexicon unit tests for new None default unit test for getter/setters methods for GPU settings WMWorkload unit test fix update unit tests for getters/setters with None

cmsdmwmbot · 2021-09-14T10:06:41Z

Jenkins results:

Python2 Unit tests: succeeded
- 6 tests added
- 1 changes in unstable tests
Python3 Unit tests: succeeded
- 6 tests added
- 2 changes in unstable tests
Python2 Pylint check: failed
- 64 warnings and errors that must be fixed
- 15 warnings
- 268 comments to review
Python3 Pylint check: failed
- 78 warnings and errors that must be fixed
- 15 warnings
- 399 comments to review
Pylint py3k check: failed
- 3 errors and warnings that should be fixed
- 1 warnings
Pycodestyle check: succeeded
- 106 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12441/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2021-09-14T10:56:21Z

src/python/WMCore/Lexicon.py

+    try:
+        data = json.loads(candidate)
+    except Exception:
+        raise AssertionError("GPUParams is not a valid JSON object")


Sorry for the question @amaltaro , but why should we raise Assertion error here? It seems strange to me to get a general exception and raise assertion error instead. And even more, from the message it seems like this is about to a specific use case related to the data structure itself.

Lexicon checks either return False or raise an AssertionError in case of failures during the input data validation.
This is also the standard behaviour of the check function in this module.

todor-ivanov · 2021-09-14T10:59:54Z

src/python/WMCore/Lexicon.py

+    return _gpuInternalParameters(data)
+
+
+CUDA_VERSION_REGEX = {"re": r"^\d+\.\d+(\.\d+)?$", "maxLength": 100}


if this one is about to be in the global scope, I'd say we should move it at the top of the file similar to others defined on lines 27-31

That could be done as well. However, looking at this module, you can see that the regular expression is usually defined closer to the function that will consume it, so I just kept the consistency.
Anyhow, this will be refactored once a decision is made on how to separate Lexicon logic from lexicon rules/regex.

todor-ivanov · 2021-09-14T11:10:49Z

src/python/WMCore/WMSpec/WMWorkload.py

+        task of this spec.
+        :param requiresGPU: string defining whether GPUs are needed. For TaskChains, it
+            could be a dictionary key'ed by the taskname.
+        :param gpuParams: GPU settings. A JSON encoded object, from either a None object


What happens if it is a json encoded from None. I can guess, of course, but it is a little bit obscure.

This:

In [2]: json.dumps(None) Out[2]: 'null'

which will be the default value in the specs/workflows. Also defined in the StdBase spec file.

todor-ivanov · 2021-09-14T11:15:55Z

src/python/WMCore/WMSpec/WMTask.py

+        all underneath CMSSW type step object.
+        :param requiresGPU: string defining whether GPUs are needed. For TaskChains, it
+            could be a dictionary key'ed by the taskname.
+        :param gpuParams: GPU settings. A JSON encoded object, from either a None object


Same comment as the one given bellow for setGPUSettings

todor-ivanov · 2021-09-14T11:21:55Z

src/python/WMCore/WMSpec/WMWorkload.py

+            taskIterator = self.taskIterator()
+
+        for task in taskIterator:
+            task.setTaskGPUSettings(requiresGPU, gpuParams)


The following comment is just for my personal education. (I am pretty sure you have already double checked this).
Isn't it a repeated recursion, when combined with the one from setTaskGPUSettings() on line: https://github.com/dmwm/WMCore/pull/10799/files#diff-81efe0a8bcf6b4cb2d5ee526c24a027563fa129b414de2aca75e78f1b38acbf1R1503

That's a good question! And indeed it's confusing!
The WMWorkload taskIterator method only iterate through the top level tasks. Each workload usually (always?) has a single top level task.
While the recursion in WMTask iterates through all sub-tasks. Example, from a Processing task, we have a Merge sub-task, then we also have a LogCollect and Cleanup sub-tasks.

todor-ivanov

Thanks for this quite big PR @amaltaro . From all what I managed to grasp in this whole picture it looks good to me. Of course I cannot catch possible errors better than the ones you already have with the tests you've done. I just left few comments inline related to some clarity of the code while reading. None of them is related to a possible problems, so they are not blockers in any case.

amaltaro · 2021-09-14T11:54:35Z

Thank you very much for this review, Todor.
I totally agree that this is too big for having a decent code review. That's why I have separated the TaskChain and StepChain cases in their own issues, otherwise it would be even bigger and more complex.

Hopefully the manual tests and unit tests will be enough to make sure this code behaves ;)

amaltaro · 2021-09-15T16:19:54Z

Here is a fairly good documentation for these GPU developments: https://github.com/dmwm/WMCore/wiki/GPU-Support

amaltaro added PR: Do not merge yet PR: Work in progress labels Sep 10, 2021

amaltaro force-pushed the fix-10388 branch from 22da2e9 to e5c1d81 Compare September 10, 2021 14:51

amaltaro force-pushed the fix-10388 branch from 90b8bbb to c6a4b2a Compare September 14, 2021 06:00

amaltaro removed the PR: Work in progress label Sep 14, 2021

amaltaro requested a review from todor-ivanov September 14, 2021 06:49

amaltaro added the PR: Update documentation label Sep 14, 2021

amaltaro mentioned this pull request Sep 14, 2021

Add support for GPU parameters at ReReco spec level #10388

Closed

amaltaro added the PR: squashing needed label Sep 14, 2021

amaltaro force-pushed the fix-10388 branch from 8bf924b to f5cf031 Compare September 14, 2021 09:32

amaltaro added 3 commits September 14, 2021 11:54

GPU getters/setters implemented for Workload/Task/Step/CMSSW

a746fe3

fix WMWorkload set call update getters/setters to deal with None default instead

unit tests for GPUParams

b778f90

clean unit tests update Lexicon unit tests for new None default unit test for getter/setters methods for GPU settings WMWorkload unit test fix update unit tests for getters/setters with None

amaltaro force-pushed the fix-10388 branch from f5cf031 to b778f90 Compare September 14, 2021 09:54

amaltaro removed PR: squashing needed PR: Do not merge yet labels Sep 14, 2021

todor-ivanov reviewed Sep 14, 2021

View reviewed changes

todor-ivanov approved these changes Sep 14, 2021

View reviewed changes

amaltaro merged commit 232f3ab into dmwm:master Sep 14, 2021

amaltaro mentioned this pull request Sep 14, 2021

Add support to GPUs in TaskChain spec #10805

Merged

This was referenced Sep 17, 2021

Propagate GPU requirements to job creation and submission #10811

Merged

Second draft of changes to generate Lexicon regexp in JSON data-format #10622

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support and validation to GPU in StdBase/ReReco #10799

Add support and validation to GPU in StdBase/ReReco #10799

amaltaro commented Sep 10, 2021 •

edited

Loading

amaltaro commented Sep 10, 2021 •

edited

Loading

cmsdmwmbot commented Sep 10, 2021

cmsdmwmbot commented Sep 10, 2021

cmsdmwmbot commented Sep 13, 2021

cmsdmwmbot commented Sep 14, 2021

amaltaro commented Sep 14, 2021

amaltaro commented Sep 14, 2021

cmsdmwmbot commented Sep 14, 2021

cmsdmwmbot commented Sep 14, 2021

amaltaro commented Sep 14, 2021

cmsdmwmbot commented Sep 14, 2021

todor-ivanov Sep 14, 2021 •

edited

Loading

amaltaro Sep 14, 2021

todor-ivanov Sep 14, 2021

amaltaro Sep 14, 2021

todor-ivanov Sep 14, 2021

amaltaro Sep 14, 2021

todor-ivanov Sep 14, 2021 •

edited

Loading

todor-ivanov Sep 14, 2021

amaltaro Sep 14, 2021

todor-ivanov left a comment

amaltaro commented Sep 14, 2021

amaltaro commented Sep 15, 2021

		return _gpuInternalParameters(data)


		CUDA_VERSION_REGEX = {"re": r"^\d+\.\d+(\.\d+)?$", "maxLength": 100}

Add support and validation to GPU in StdBase/ReReco #10799

Add support and validation to GPU in StdBase/ReReco #10799

Conversation

amaltaro commented Sep 10, 2021 • edited Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

amaltaro commented Sep 10, 2021 • edited Loading

cmsdmwmbot commented Sep 10, 2021

cmsdmwmbot commented Sep 10, 2021

cmsdmwmbot commented Sep 13, 2021

cmsdmwmbot commented Sep 14, 2021

amaltaro commented Sep 14, 2021

amaltaro commented Sep 14, 2021

cmsdmwmbot commented Sep 14, 2021

cmsdmwmbot commented Sep 14, 2021

amaltaro commented Sep 14, 2021

cmsdmwmbot commented Sep 14, 2021

todor-ivanov Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

amaltaro Sep 14, 2021

Choose a reason for hiding this comment

todor-ivanov Sep 14, 2021

Choose a reason for hiding this comment

amaltaro Sep 14, 2021

Choose a reason for hiding this comment

todor-ivanov Sep 14, 2021

Choose a reason for hiding this comment

amaltaro Sep 14, 2021

Choose a reason for hiding this comment

todor-ivanov Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

todor-ivanov Sep 14, 2021

Choose a reason for hiding this comment

amaltaro Sep 14, 2021

Choose a reason for hiding this comment

todor-ivanov left a comment

Choose a reason for hiding this comment

amaltaro commented Sep 14, 2021

amaltaro commented Sep 15, 2021

amaltaro commented Sep 10, 2021 •

edited

Loading

amaltaro commented Sep 10, 2021 •

edited

Loading

todor-ivanov Sep 14, 2021 •

edited

Loading

todor-ivanov Sep 14, 2021 •

edited

Loading