Skip to content

Latest commit

 

History

History
55 lines (44 loc) · 3.64 KB

pure_functions.adoc

File metadata and controls

55 lines (44 loc) · 3.64 KB

mops and Pure functions

The overarching approach of mops is to realize the potential for parallelizing functions using simple and scalable architectural primitives. A function which depends on nothing but its input parameter(s) and produces no meaningful result other than a returned value may be easily transposed into a different runtime context, and is therefore easily parallelized. This is the principle behind multiprocess.Pool, behind the MapReduce paradigm, and many other implementations of the same idea.

What this means for the user of this system is that you’re going to have an easy time if your computation is already encapsulated in a pure function, and a progressively harder time (more work to be done) the less pure your existing computation is. You will almost certainly need to refactor out any existing instances of impurity.

Practically speaking, some things to avoid/remove from your functions:

  • Global mutable variables.

  • Global constants (prefer passing the value of the constant as a function argument).

  • Direct use of environment variables (again, pass values instead).

  • Code that modifies input arguments, e.g. a dict or list.

  • Any kind of randomness, including datetime.now() and friends.

  • Non-random things that present to pickle as random, including the built-in set.[1]

  • References to files or other state external to the function (with exceptions for explicit passing of filesystem primitives as described here, or via core.Source)

  • References to large amounts of static reference data. If possible, select only the data you need before passing it to the function. If not, see the docs on large shared objects

  • Returning a result that does not contain everything you might possibly want to know about the completed computation.

By implementing pure functions, we can keep the What (the business logic of your function) separate from the How (the details of what environment it runs on), which not only enables plug-and-play parallelization but also makes your code much easier to read and reason about.

Ultimately, if your functions are truly pure, you won’t even need mops in the long run - you’ll be able to find other off-the-shelf libraries and frameworks that will let you parallelize your computation. mops itself strives to be an implementation detail that doesn’t tie your code to itself, which is in itself a valuable 'feature'.

Logical (as opposed to theoretical) purity

A function that is theoretically/truly pure is one that performs absolutely no side effects. In a distributed system, this may be difficult to achieve for various reasons.

A function that is logically pure may perform side effects such as logging, uploading or downloading data, writing to temporary files, performing arbitrary network communication, etc., as long as the results of those side effects are reasonably believed to be deterministic from the point of view of the caller of the function.

In other words, if the calling your 'logically pure' function will result in the same return value each time, and the side effects used to produce that result will not interfere with the operation of other functions, then you may consider your function to be pure and it should work fine with mops.


1. We expect to eventually support set transparently with a serialization shim, but this has not yet made it to the top of the backlog