This repository contains the human subject study data and the code used in the paper Designing Decision Support Systems Using Counterfactual Prediction Sets, which was presented at the ICML 2023 AI & HCI Workshop (Best Paper Award).
The dataset as well as a more detailed description are under study_data/
.
Experiments ran on python 3.10.9. To install the required libraries run:
pip install -r requirements.txt
-
algorithms/successive_elimination.py
implements vanilla successive elimination, assumption-free counterfactual successive elimination and counterfactual successive elimination. -
algorithms/ucb.py
implements vanilla$\texttt{UCB1}$ , assumption-free counterfactual$\texttt{UCB1}$ and counterfactual$\texttt{UCB1}$ .
config.py
includes the configuration of the experimental setup.utils.py
implements data splitting as well as helper functions.models/model.py
implements the model used by all decision support systems.conformal_prediction.py
implements all conformal predictors given a model and a fixed calibration set.
robustness/
includes the datasets with violations of interventional monotonicity, as well as the script to produce them:
datasets/permutation_violations
is the directory under which the datasets with violations are savedcreate_violations.py
creates datasets with monotonicity violations on different amounts (30%, 60%, 100%) of data
scripts/
includes the following scripts to execute the algorithms:
-
batch_run.py
executes all bandit algorithms for 30 realizations given the same calibration set. -
run_bandit.py
executes one realization of one algorithm given a calibration set. It can be also optionally used to compute the empirical expert success probability for each decision support system, i.e., for each$\alpha$ value, under the strict implementation, given a fixed calibration set.
scripts/
includes the following scripts to evaluate the performance of experts under the strict and lenient implementation:
-
test_other.py
computes the empirical expert success probability for each decision support system, i.e., for each$\alpha$ value, under the lenient implementation, given a fixed calibration set. -
misplaced_trust_loss.py
computes the number of experts predictions in which the prediction sets do not contain the true label and the experts succeed, and total number of expert predictions in which the experts misplace their trust for each$\alpha$ value under the lenient implementation given a fixed calibration set.
plotters/
includes the following scripts to produce the plots in the paper:
-
monotonicity.py
produces the plots related to the empirical expert success probability per prediction set size for images of similar difficulty across all experts and across experts with the same level of competence. -
regret.py
produces the empirical expected regret plot for all bandit algorithms. -
lenient.py
produces the plots comparing the expert performance under the strict and lenient implementation for each$\alpha$ value. -
strict.py
produces the plots showing the expert performance under the strict implementation for each$\alpha$ value.
To execute all bandit algorithms for 30 realizations given a fixed calibration set run:
python -m scripts.batch_run
To execute a single bandit algorithm run:
python -m scripts.run_bandit --alg
<algorithm> --seed_run
<seed_run> --cal_run
<cal_run>
where <algorithm> can be one of:
-
SE
: vanilla successive elimination. -
SE_no_mon
: assumption-free counterfactual successive elimination. -
SE_ours
: counterfactual successive elimination. -
UCB
: vanilla$\texttt{UCB1}$ . -
UCB_no_mon
: assumption-free counterfactual$\texttt{UCB1}$ . -
UCB_ours
: counterfactual$\texttt{UCB1}$ .
<seed_run> and <cal_run> can be any integer and fix the random seeds for random procedures in the realization of the algorithm, and for randomly selecting the calibration set respectively. After the execution of each algorithm we save under results/
<algorithm>/
the arms that the algorithm pulled during the execution (required to compute the empirical expected regret).
To produce the plots about the empirical success probability of experts per prediction set size for images of similar difficulty across all experts and across experts with the same level of competence run:
python -m plotters.monotonicity
To compute and plot the empirical expected regret for all bandit algorithms run:
python -m scripts.eval_plot_regret
To compute and plot the empirical expert success probability for each
python -m scripts.eval_plot_strict_vs_lenient
All experiments use the data collected from the human subject study, which are in study_data/
. We include a detailed description and the license of the data in study_data/README.md
.
To create the datasets with violations of interventional monotonicity for
python -m robustness.create_violations
To execute all bandit algorithms for 30 realizations given a fixed calibration set on datasets with violations on interventional monotonicity run:
python -m scripts.batch_run --pv
<frac>
where <frac> is the fraction of the data with violations, i.e., the
To produce the plots about the empirical success probability of experts per prediction set size for images of similar difficulty across all experts for the datasets with interventional monotonicity violations, run:
python -m plotters.monotonicity --pv
<frac>
where <frac> is the same as above.
To compute and plot the empirical expert success probability for each
python -m scripts.eval_plot_strict --pv
<frac>
where <frac> is the same as above.
If you use parts of the code/data in this repository for your own research purposes, please consider citing:
@article{straitouri2023designing,
title={Designing Decision Support Systems Using Counterfactual Prediction Sets},
author={Straitouri, Eleni and Gomez-Rodriguez, Manuel},
journal={arXiv preprint arXiv:2306.03928},
year={2023}
}