-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add support and validation to GPU in StdBase/ReReco #10799
Conversation
copying a few names from the GH issue discussion @justinasr @mrceyhun @fwyzard @srimanob @hufnagel This is still a work in progress, but I wanted to draw your attention that this is happening and this is the current candidate implementation to go into the Workload Management system, hopefully in the next week. The most important input/feedback that I would like to get from you is, whether you see any inconsistencies or use cases that are not covered with the current schema (data type + length + regular expression). Thank you very much! UPDATE: adding Jordan as well @jordan-martins |
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
The baseline development to support GPUs within WMCore is provided in this PR, which also supports those new workflow arguments at the ReReco spec type. TaskChain and StepChain will be addressed in different issues/PR. I'm going to run some extra tests in the next hours, and if everything goes fine, I will request these changes to be deployed in cmsweb-testbed today, run a final validation, and deploy this to production as well on Thursday. |
Looking at a real request JSON, I see this:
which doesn't look great with those scaped chars. Maybe we should default it to encoded |
Jenkins results:
|
Jenkins results:
|
I'm going to need this code merged, such that I can resume working on its implementation for TaskChain #10400 and StepChain #10401 Basic tests went fine in my VM. I'm going to squash the 5th commit with the 1st one, and from my side it's good to go. |
fix Lexicon logic for GPUParams and its internals add RequiresGPU vs GPUParams validation; fix Py2 compatibility Change default from empty string to None. StdBase and Lexicon Call GPU setter in StdBase.setupProcessingTask
fix WMWorkload set call update getters/setters to deal with None default instead
clean unit tests update Lexicon unit tests for new None default unit test for getter/setters methods for GPU settings WMWorkload unit test fix update unit tests for getters/setters with None
Jenkins results:
|
try: | ||
data = json.loads(candidate) | ||
except Exception: | ||
raise AssertionError("GPUParams is not a valid JSON object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the question @amaltaro , but why should we raise Assertion error here? It seems strange to me to get a general exception and raise assertion error instead. And even more, from the message it seems like this is about to a specific use case related to the data structure itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lexicon checks either return False or raise an AssertionError
in case of failures during the input data validation.
This is also the standard behaviour of the check
function in this module.
return _gpuInternalParameters(data) | ||
|
||
|
||
CUDA_VERSION_REGEX = {"re": r"^\d+\.\d+(\.\d+)?$", "maxLength": 100} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this one is about to be in the global scope, I'd say we should move it at the top of the file similar to others defined on lines 27-31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be done as well. However, looking at this module, you can see that the regular expression is usually defined closer to the function that will consume it, so I just kept the consistency.
Anyhow, this will be refactored once a decision is made on how to separate Lexicon logic from lexicon rules/regex.
task of this spec. | ||
:param requiresGPU: string defining whether GPUs are needed. For TaskChains, it | ||
could be a dictionary key'ed by the taskname. | ||
:param gpuParams: GPU settings. A JSON encoded object, from either a None object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if it is a json encoded from None. I can guess, of course, but it is a little bit obscure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This:
In [2]: json.dumps(None)
Out[2]: 'null'
which will be the default value in the specs/workflows. Also defined in the StdBase spec file.
all underneath CMSSW type step object. | ||
:param requiresGPU: string defining whether GPUs are needed. For TaskChains, it | ||
could be a dictionary key'ed by the taskname. | ||
:param gpuParams: GPU settings. A JSON encoded object, from either a None object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as the one given bellow for setGPUSettings
taskIterator = self.taskIterator() | ||
|
||
for task in taskIterator: | ||
task.setTaskGPUSettings(requiresGPU, gpuParams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following comment is just for my personal education. (I am pretty sure you have already double checked this).
Isn't it a repeated recursion, when combined with the one from setTaskGPUSettings()
on line: https://github.com/dmwm/WMCore/pull/10799/files#diff-81efe0a8bcf6b4cb2d5ee526c24a027563fa129b414de2aca75e78f1b38acbf1R1503
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question! And indeed it's confusing!
The WMWorkload taskIterator
method only iterate through the top level tasks. Each workload usually (always?) has a single top level task.
While the recursion in WMTask iterates through all sub-tasks. Example, from a Processing task, we have a Merge sub-task, then we also have a LogCollect and Cleanup sub-tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this quite big PR @amaltaro . From all what I managed to grasp in this whole picture it looks good to me. Of course I cannot catch possible errors better than the ones you already have with the tests you've done. I just left few comments inline related to some clarity of the code while reading. None of them is related to a possible problems, so they are not blockers in any case.
Thank you very much for this review, Todor. Hopefully the manual tests and unit tests will be enough to make sure this code behaves ;) |
Here is a fairly good documentation for these GPU developments: https://github.com/dmwm/WMCore/wiki/GPU-Support |
Fixes #10388
Status
ready
Description
This PR implements GPU functionality within WMCore (only at the request-level, job level will be done in a different issue/PR).
Summary of changes is:
RequiresGPU
: can be one of these values("forbidden", "optional", "required")
, with default value toforbidden
, thus not using GPUsGPUParams
: a dictionary JSON encoded, with a default value to None JSON encoded. It must be provided if RequiresGPU=optional or RequiresGPU=required.List of mandatory parameters within
GPUParams
is:GPUMemoryMB
(renamed from GPUMemory !): integer greater than 0CUDACapabilities
: a list of string values. Each value must match theCUDA_VERSION_REGEX
regular expression and max length.CUDARuntime
: a string value matching theCUDA_VERSION_REGEX
constraints.And a list of the 3 optional parameters is:
GPUName
: a string value with less than 100 charsCUDADriverVersion
: a string value matching theCUDA_VERSION_REGEX
constraints.CUDARuntimeVersion
: a string value matching theCUDA_VERSION_REGEX
constraints.NOTE: full support in TaskChain and StepChain is going to be done in a different GH issue/pull request, but the bulk of the development is already provided in this PR.
Is it backward compatible (if not, which system it affects?)
NO (new feature!)
Related PRs
None
External dependencies / deployment changes
None