Experiment design, deployment, and optimization
EXP is a python experiment management toolset created to simplify two simple use cases: design and deploy experiments in the form of python modules/files.
An experiment is a series of runs of a given configurable module for a specified set of parameters. This tool covers one of the most prevalent experiment deployment scenarios: testing a set of parameters in parallel in a local machine or homogeneous cluster. EXP also supports global optimization using gaussian processes or other surrogate models such as random forests. This can be used for instance as a tool for hyperoparameter tuning for machine learning models.
- parameter space design based on configuration files (TOML format);
- parallel experiment deployment using
multiprocessing
processes; - CUDA gpu workers one parallel process per available GPUs: uses the variable CUDA_VISIBLE_DEVICES;
- global optimization from parameter spaces (e.g. for hyperparameter tunning) using scikit-optimize.
pip install exp
pipenv install exp
with pipenv
EXP provides two CLI modules:
- exp.run:
python -m exp.run -p basic.conf -m runnable.py --workers 10
- exp.gopt:
python -m exp.gopt -p basic.conf -m runnable.py --workers 4 -n 100 --plot
for more information check each commands help:
python -m exp.run -h
The first step is to create a module to use in our experiments. A basic configurable module runnable.py
looks like this:
def run(x=1, **kwargs):
return x ** 2
This module computes the square of a parameter x
. Note that kwargs
is included to capture other parameters that the experiment runner might use (even if they are not used by your module). Since run receives a dictionary, you could also define it as follows.
def run(**kwargs):
x = kwargs.get('x',1)
return x ** 2
Next, we need a configuration file basic.conf
were the parameters are specified:
[x]
type = "range"
bounds = [-10,10]
This defines a parameter space with a single parameter x
with values in the range [-10,10]
. For how to specify parameter spaces, see the Parameter Space Specification.
Our simple module returns the x**2
, the optimizer tries to find the minimum value of this function based on the parameter space given by the configuration file. In this case, the optimizer will look at values of x
between [-10,10]
and try to find the minimum value.
python -m exp.gopt --params basic.conf --module runnable.py --n 20 --workers 4
finds a solution very close to 0
. By default, the optimizer assumes a range defines the boundaries of a real-valued variable. If you wish to optimize discrete integers use the following specification:
[x]
type = "range"
bounds = [-10,10]
dtype = "int"
The optimizer will explore discrete values between -10 and 10 inclusively. Also, using the --plot
flag displays a real-time convergence plot for the optimization process.
which in this case converges immediately because the function to be optimized is quite simple, but the goal is to optimize complex models and choosing from a large set of parameters without having to run an exhaustive search through all the possible parameter combinations.
Parameter space files use TOML format, I recommend taking a look at the specification and getting familiar with how to define values, arrays, etc. ParamSpaces in EXP has 4 types of parametes, namely:
- value: single value parameter;
- range: a range of numbers between bounds;
- random: a random real/int value between bounds;
- list: a list of values (used for example to specify categorical parameters);
Bellow, I supply an example for each type of parameter:
Single value parameter.
# this a single valued parameter with a boolean value
[some_param]
type = "value"
value = true
A parameter with a set of values within a range.
# TOML files can handle comments which is useful to document experiment configurations
[some_range_param]
type = "range"
bounds = [-10,10]
step = 1 # this is optional and assumed to be 1
dtype = "float" # also optional and assumed to be float
The commands run
and gopt
will treat this parameter definition differently. The optimizer will explore values within the bounds including the end-points. The runner will take values between bounds[0]
and bounds[1]
excluding the last end-point (much like a python range or numpy arange).
The dtype
also influences how the optimizer looks for values in the range, if set to "int"
, it explores discrete integer values within the bounds; if set to "float"
, it assumes the parameter takes a continuous value between the specified bounds.
A parameter with n
random values sampled from "uniform" or "log-uniform" between the given bounds. If used with run
, a parameter space will be populated with a list of random values according to the specification. If used with gopt
, n
is ignored and bounds are used instead, along with the prior.
For optimization purposes, this works like range, except that you can specify the prior which can be "uniform" or "log-uniform", range assumes that the values are generated from "uniform" prior, when the parameter is used for optimization.
The other difference between parameter grids and optimization is that the bounds do not include the end-points when generating parameter values for grid search. The optimizer will explore random values within the bounds specified, including the high end-point.
[random_param]
type="random"
bounds=[0,3] # optional, default range is [0,1]
prior="uniform" # optional, default value is "uniform"
dtype="float" # optional, default value is "float"
n=1 # optional, default value is 1 (number of random parameters to be sampled)
A list is just an homogeneous series of values a parameter can take.
[another_param]
type="list"
value = [1,2,3]
The array in "value"
must be homogenous, something like value=[1,2,"A"]
would throw a Not a homogeneous array error. List parameters are treated by gopt
command as a categorical parameter. This is encoded using a one-hot-encoding for optimization.
Also, for optimization purposes, a list is treated like a set, if you provide duplicate values it will only explore the unique values. For example if you want to specify a boolean parameter, use a list:
[some_boolean_decision]
type="list"
value = [true,false]
EXP also provides different tools to specify param spaces programmatically
The exp.params.ParamSpace
class provides a way to create parameter spaces and iterate over all the possible
combinations of parameters as follows:
>>>from exp.params import ParamSpace
>>>ps = ParamSpace()
>>>ps.add_value("p1",1)
>>>ps.add_list("p2",[True,False])
>>>ps.add_range("p3",low=0,high=10,dtype=int)
>>>ps.size
20
grid = ps.param_grid(runs=2)
grid
has 2*ps.size
configurations because we repeat each configuration 2
times (number of runs). Each configuration dictionary includes 2 additional parameters "id"
and "run"
which are the unique configuration id and run id respectively.
for config in grid:
# config is a dictionary with the params of a unique configuration in the parameter space
do_something(config)
ParamDict
from exp.args
module is a very simple dictionary where you can specify default values for different parameters. exp.args.Param
is a named tuple: (typefn,default,options)
where typefn
is a type function like int
or float
that transforms strings into values of the given type if necessary, default
is a default value, options
is a list of possible values for the parameter.
This is just a very simple alternative to using argparse with a lot of of parameters. Example of usage:
from exp.args import ParamDict,Namespace
# these are interpreted by a ParamDict as a exp.args.Param named tuple
param_spec = {
'x': (float, None),
'id': (int, 0),
'run': (int, 1),
'cat': (str, "A", ["A","B"])
}
def run(**kargs):
args = ParamDict(param_spec) # creates a param dict from default values and options
args.from_dict(kargs) # updates the dictionary with new values where the parameter name overlaps
ns = args.to_namespace() # creates a namespace object so you can access ns.x, ns.run etc
...
Another nice thing is that there is basic type conversions from string to boolean, int, float, etc. Depending
on the arguments received in kwargs
, ParamDict
converts the values automatically according to the parameter
specifications.