A pure function is a fully-specified computation with a deterministic result based on its arguments. If the function and its arguments remain the same, then a future call to that function, even from a different process on a different computer, should be able to return the previously-computed result rather than re-computing.
Note
|
If you aren’t using the pure.magic API, a unique pipeline id is generated for
every application run, so no results will be memoized across processes. See
pipeline ids for how to change this.
|
At the time of function call, each function wrapped with mops
MemoizingPicklingRunner
(or @pure.magic
) will combine:
-
the current blob root
-
a
mops
-specific prefix,mops2-mpf
, which exists to keep the root namespace organized -
the fully-qualfied name (including module) of the function
-
the
function-logic-key
on the docstring (if any) -
and the hash of the serialized arguments (input) to the function
to produce a deterministic remote storage location for the invocation and also for the eventual results.
Why so many bits and pieces?
When you think about a function in the common sense, it’s easy to forget that, mathematically, a function isn’t just a named bit of code - it’s a unique and immutable transform, of the input domain into the output codomain.
But in code, we change the code for functions (or the other functions they call) all the time. We don’t ordinarily rename the function afterward, but the actual effect of the function has changed - technically it’s a totally new function!
mops
provides several different approaches to making it easier to map the mathematical
universe onto the everyday one that programmers inhabit. Perhaps too many approaches -
they certainly overlap a bit at points. But collectively they provide a set of tools to
making your functions and their results a bit easier to wrangle, organize, and reason
about.
In its entirety, the unique invocation of your function with its arguments hash
constitutes the memo_uri
- a mops
-controlled namespace for the specific function under
the specific circumstances of its single invocation.
Before invoking the function remotely, the MemoizingPicklingRunner
will check to see if a result
already exists at the expected path, and if it does, that result will be returned instead of running
the remote computation.
This allows function results to be reused across time regardless of who calls the function or when they call it.
Warning
|
Whether or not memoized results are available at the fully-derived In other words, no error will be raised if the result is not already present. This is a
non-destructive re-use of the namespace, because no existing results will be modified in any way -
but an existing blob store is never immune to modification if provided to |
In cases where all you want is memoization (you don’t care about transferring execution to
a truly remote execution environment), you can use @pure.magic()
with no further customization.
In order to share results across machines, you’ll want to configure at minimum the blob root to point to a shared blob store.
The other bits and pieces and their purposes are documented below.
At the time of function call, a fully-qualified memo_uri
is derived from the following
components which are either chosen explicitly by the user, determined by names in code ,
or which arise from @pure.magic
defaults`.
-
A blob root (non-optional, user-controlled, has
pure.magic
default).pure.magic
sets the default blob root as$HOME/.mops-root
. -
The
pipeline_id
(non-optional, user-controlled, defaults provided either bypure.magic
or by local system state)More documentation about pipeline ids may be be found here, but you you should conceptualize the pipeline id as representing a grouping mechanism. within your application.
-
A
function_id
(defined as a constant by derivation from the function’s__module__
+__name__
- not user-controllable).This exists to keep identical
(*args, **kwargs)
separate from each other if passed to different functions. There is no API for this; it’s derived automatically.WarningIf you rename your function, all previous memoized results will no longer be 'found' for that function. -
A
function-logic-key
(optional, user-controlled, defaults to empty string)The function logic key, if any, will be automatically extracted when present from the docstring of the top-level function or any function passed as an argument (however nested) to the top-level function.
The intent of the
function-logic-key
is to allow you to annotate your function’s logic as having changed (and therefore invalidating previous memoization) without renaming your function (as, in common software development practice, the name of a function is often a high-level name, not subject to change every time the function is modified to produce different outputs).@pure.use_runner(...) def barbaz(): """does some stuff. function-logic-key: 241024-v2 """ ...
Function logic keys MUST NOT contain spaces - any whitespace character will be interpreted as the end of the key.
A function may have both a
pipeline-id
andfunction-logic-key
annotation in its docstring. The order does not matter, but each should be on a separate line. -
Hash of function arguments (deterministically defined based on actual arguments passed to the function - not user-controllable)
This can’t be affected in any way other than passing different arguments. Don’t even think about it.
This is an example full memo_uri
with all its constituent parts labeled. You’ll find most of these
names directly in the source code. For
memoization to retrieve an existing result, the full constructed memo uri must be retrievable from the
provided storage system.
adls://thdsscratch/tmp/mops2-mpf/Peter-Gaultney/2023-04-12T15:46:24-p36529/demandforecast.extract:extract_asset_geo_level/CoastOilAsset.IVZ9KplQKlNgxQHav0jIMUS9p4Kbn3N481e0Uvs/
<blob root ---------->
<runner prefix ----------------> <pipeline_id ---------------------------> <function_id --------------------------------> <(args, kwargs) sha256 hash ------------------------>
<pipeline memospace ----------------------------------------------------->
<function memospace ---------------------------------------------------------------------------------------------------->
<invocation-unique-key ---------------------------------------------------------------------------------------------------------------------->
<memo uri -------------------------------------------------------------------------------------------------------------------------------------------------------------------->
Note that the invocation-unique-key
is a way of uniquely identifying a function invocation solely by
reference to the user-controllable, storage-agnostic elements of the memo_uri
.
In general, the blob root and pipeline id should be encoded either in your code (often preferable and less 'spooky') or in some kind of config that gets loaded into your code at runtime. So the information above is mostly about 'understanding what they do.'
However, if you want to call a function and get a result that you know already exists (was
run previously and therefore memoized by MemoizingPicklingRunner
), and you don’t wish to
change your current code, you have several options, which are documented separately here.