Inspired by the Star Trek universe and following the Ferengi's 3rd rule of acquisition - "Never spend more for an acquisition than you have to," and the 74th rule - "Knowledge equals profit," we introduce two batch selection strategies for cost-efficient BO to find a good cost and yield increase compromise.
Bayesian Optimization (BO) taking batch cost into account. In this revised approach to BO, we focus on optimizing chemical experiments not just for their potential improvement in yield over the previous iteration, but also for cost-efficiency for performing the experiments. Computationally simulated BO experiments result in selections that may overlook the varying costs of the chemicals involved. For instance, instead of acquiring a new substance chemists might first study a reaction under varying conditions that can easily be controlled, such as temperature. Not only will such experiments result in lower costs but also in a better informed posterior and higher confidence before acquiring new compounds. Our modified approach adds a crucial dimension to the BO: the cost and ease of availability of each compound used at each batch iteration. Thus cost-informed BO will mimic more closely the yield optimization process in a chemistry lab.
Python dependencies:
torch
gpytorch
botorch
rdkit
matplotlib
sklearn
numpy
Best to create a new envirnment and then
pip install -r requirements.txt
After setting up an envirnment with these packages, add
git clone git@github.com:janweinreich/rules_of_acquisition.git
export PYTHONPATH=$PYTHONPATH:$HOME/rules_of_acquisition
to your .bashrc file. Then, run
source ~/.bashrc
Currently tested with python 3.8.16. and botorch 0.8.1.
Perform various regression tasks on the Pd-catalyzed C-H arylation dataset [1] resulting in a scatter plot with errorbars (correlation.png
).
All regressors are compatible with botorch
:
Gaussian Process Regression: GPR.py
Try the effect of different kernels: Tanimoto
kernel performs quite well and is the default choice. Other options include RBF
, Matern
and Linear
. Random Forest Regression: ForestReg.py
. XgBoost Regression: XgBoostREG.py
Use a modified acquisition function botorch
.
Empirically we find a good choice for the acquisition function value associated to each experiment
If a ligand was already included (by buying 1 g of the substrance ) we divide by
is selected.
Corresponds to a greedy strategy where the user has to define a maximal cost for the batch using a config file exp_configs.py
. For example setting max_batch_cost=100
means per iteration 100
can be spend. If a suggested batch is more expensive, it is disregarded and the next best batch is just that can be afforded.
More on the exp_configs.py
below! If no batch can be afforded, take one where no new compunds are bought meaning measure difference temperatures or concentrations.
Contains all experiments that are shown in the SI of the paper. Documentation for that is not up to date (see below).
- Objective: To maximize acquisition value within budget limits.
- Process:
- Begin with the best batch as per the acquisition function.
- Sequentially evaluate affordability of subsequent batches.
- Objective: To find a feasible batch within budget constraints.
- Process:
- Content: Space for experimental or outdated items.
- Description: Implements the Greedy algorithm with fixed sample costs.
- Subfolders:
values
: Focuses on ligands with the lowest yield across conditions.distance
: Divides dataset by proximity to the best ligand in feature space, starting with the furthest half.- Start with the desired batch size.
- Increase batch size incrementally if the current batch is not affordable.
- Subselect from suggested batches to fit the budget.
- Description: Greedy algorithm considering variable costs. Acquiring a sample reduces its cost to zero for subsequent changes like temperature or solvent adjustments.
- Start with the desired batch size.
- Increase batch size incrementally if the current batch is not affordable.
- Subselect from suggested batches to fit the budget.
- Description: Cost based on similarity to previously synthesized compounds. Assumes new compounds are synthesized, not purchased, with cost reflecting synthetic difficulty.
We welcome contributions and suggestions!
This project is licensed under the MIT License
[1] Shields, B. J.; Stevens, J.; Li, J.; Parasram, M.; Damani, F.; Alvarado, J. I. M.; Janey, J. M.; Adams, R. P.; Doyle, A. G. Nature 2021, 590, 89–96