Skip to content

Latest commit

 

History

History
173 lines (140 loc) · 8.23 KB

memoization.adoc

File metadata and controls

173 lines (140 loc) · 8.23 KB

Memoization/caching

A pure function is a fully-specified computation with a deterministic result based on its arguments. If the function and its arguments remain the same, then a future call to that function, even from a different process on a different computer, should be able to return the previously-computed result rather than re-computing.

One scenario where this might be useful would be in the case of some kind of failure of your long-running orchestrator process. Another scenario would be the architectural approach of sharing results that you know someone else has already computed.
Note
If you aren’t using the pure.magic API, a unique pipeline id is generated for every application run, so no results will be memoized across processes. See pipeline ids for how to change this.

Conceptual system in mops:

At the time of function call, each function wrapped with mops MemoizingPicklingRunner (or @pure.magic) will combine:

  • the current blob root

  • a mops-specific prefix, mops2-mpf, which exists to keep the root namespace organized

  • the current pipeline_id

  • the fully-qualfied name (including module) of the function

  • the function-logic-key on the docstring (if any)

  • and the hash of the serialized arguments (input) to the function

to produce a deterministic remote storage location for the invocation and also for the eventual results.

Why so many bits and pieces?

When you think about a function in the common sense, it’s easy to forget that, mathematically, a function isn’t just a named bit of code - it’s a unique and immutable transform, of the input domain into the output codomain.

But in code, we change the code for functions (or the other functions they call) all the time. We don’t ordinarily rename the function afterward, but the actual effect of the function has changed - technically it’s a totally new function!

mops provides several different approaches to making it easier to map the mathematical universe onto the everyday one that programmers inhabit. Perhaps too many approaches - they certainly overlap a bit at points. But collectively they provide a set of tools to making your functions and their results a bit easier to wrangle, organize, and reason about.

In its entirety, the unique invocation of your function with its arguments hash constitutes the memo_uri - a mops-controlled namespace for the specific function under the specific circumstances of its single invocation.

Before invoking the function remotely, the MemoizingPicklingRunner will check to see if a result already exists at the expected path, and if it does, that result will be returned instead of running the remote computation.

This allows function results to be reused across time regardless of who calls the function or when they call it.

Warning

Whether or not memoized results are available at the fully-derived memo_uri will not prevent new, unmemoized results from being computed and inserted into the blob store.

In other words, no error will be raised if the result is not already present. This is a non-destructive re-use of the namespace, because no existing results will be modified in any way - but an existing blob store is never immune to modification if provided to mops.

In cases where all you want is memoization (you don’t care about transferring execution to a truly remote execution environment), you can use @pure.magic() with no further customization.

In order to share results across machines, you’ll want to configure at minimum the blob root to point to a shared blob store.

The other bits and pieces and their purposes are documented below.

Configurability

At the time of function call, a fully-qualified memo_uri is derived from the following components which are either chosen explicitly by the user, determined by names in code , or which arise from @pure.magic defaults`.

  1. A blob root (non-optional, user-controlled, has pure.magic default). pure.magic sets the default blob root as $HOME/.mops-root.

  2. The pipeline_id (non-optional, user-controlled, defaults provided either by pure.magic or by local system state)

    More documentation about pipeline ids may be be found here, but you you should conceptualize the pipeline id as representing a grouping mechanism. within your application.

  3. A function_id (defined as a constant by derivation from the function’s __module__ + __name__ - not user-controllable).

    This exists to keep identical (*args, **kwargs) separate from each other if passed to different functions. There is no API for this; it’s derived automatically.

    Warning
    If you rename your function, all previous memoized results will no longer be 'found' for that function.
  4. A function-logic-key (optional, user-controlled, defaults to empty string)

    The function logic key, if any, will be automatically extracted when present from the docstring of the top-level function or any function passed as an argument (however nested) to the top-level function.

    The intent of the function-logic-key is to allow you to annotate your function’s logic as having changed (and therefore invalidating previous memoization) without renaming your function (as, in common software development practice, the name of a function is often a high-level name, not subject to change every time the function is modified to produce different outputs).

    @pure.use_runner(...)
    def barbaz():
       """does some stuff.
    
       function-logic-key: 241024-v2
       """
       ...

    Function logic keys MUST NOT contain spaces - any whitespace character will be interpreted as the end of the key.

    A function may have both a pipeline-id and function-logic-key annotation in its docstring. The order does not matter, but each should be on a separate line.

  5. Hash of function arguments (deterministically defined based on actual arguments passed to the function - not user-controllable)

    This can’t be affected in any way other than passing different arguments. Don’t even think about it.

Memospace parts

This is an example full memo_uri with all its constituent parts labeled. You’ll find most of these names directly in the source code. For memoization to retrieve an existing result, the full constructed memo uri must be retrievable from the provided storage system.

adls://thdsscratch/tmp/mops2-mpf/Peter-Gaultney/2023-04-12T15:46:24-p36529/demandforecast.extract:extract_asset_geo_level/CoastOilAsset.IVZ9KplQKlNgxQHav0jIMUS9p4Kbn3N481e0Uvs/
<blob root ---------->
<runner prefix ----------------> <pipeline_id ---------------------------> <function_id --------------------------------> <(args, kwargs) sha256 hash ------------------------>
<pipeline memospace ----------------------------------------------------->
<function memospace ---------------------------------------------------------------------------------------------------->
                                 <invocation-unique-key ---------------------------------------------------------------------------------------------------------------------->
<memo uri -------------------------------------------------------------------------------------------------------------------------------------------------------------------->

Note that the invocation-unique-key is a way of uniquely identifying a function invocation solely by reference to the user-controllable, storage-agnostic elements of the memo_uri.

Advanced Usage

In general, the blob root and pipeline id should be encoded either in your code (often preferable and less 'spooky') or in some kind of config that gets loaded into your code at runtime. So the information above is mostly about 'understanding what they do.'

However, if you want to call a function and get a result that you know already exists (was run previously and therefore memoized by MemoizingPicklingRunner), and you don’t wish to change your current code, you have several options, which are documented separately here.