Baleen: ML Admission & Prefetching for Flash Caches
Paper (Preprint) | Code | Data | Video walkthrough | Reproduce on Chameleon
This repository is targeted at those seeking to reproduce the results found in the Baleen paper and contains a frozen copy of the code. If you are looking to use Baleen, please go to https://github.com/wonglkd/BCacheSim/ for the latest version.
Scope: this repository contains Python code to reproduce the simulator results in the Baleen paper. The testbed code modified a proprietary internal version of CacheLib and will not be released at this time, pending a rebase on the open-source version of CacheLib. Another key difference is that Meta's exact constants for the disk head time function will not be released, meaning that results will not be exactly the same; instead, we use constants (seek time and bandwidth) measured on the hard disks in our university testbed.
Nomenclature: Some terms were renamed after coding for better clarity in the paper. However, they mean the same thing.
- Service Time (in the code) was renamed to Disk Head Time (in the paper)
- Chunks (in the code) are called segments (in the paper)
We have verified that our instructions work on Chameleon, and have recorded a video showing the setup of the environment and the reproduction of the instructions below (YouTube: http://tiny.cc/BaleenArtifactYT). This video shows the setup on Chameleon, the running of the instructions below and the running of all notebooks successfully run with no error cells.
Time estimate: 60 mins (20 mins interactive).
Time estimate: 30 minutes (10 mins interactive).
The recommended way is to use Chameleon Trovi, an academic cloud. Note that you will require an allocation; if you are affiliated with FAST, you can request to be added to the associated project (CHI-231080). To do this (and for any other issues with Chameleon), please contact the helpdesk at help@chameleoncloud.org.
- Launch artifact on Trovi
- (Optional) Open notebook
chameleon/1-getting-started.ipynb
which will walk you through the Getting Started section of this README. You may run one cell at a time, or click Run -> Run All Cells to execute all commands. If processes get killed, you need a dedicated server. - (Recommended) The shared JupyterHub has limited RAM/disk. Run notebook
chameleon/2-start-dedicated-server.ipynb
, which provisions a beefier node (for 7 days) that you can create a SSH tunnel to.
Alternatively, you may do a manual install. These commands are also available in getting-started.sh for your convenience.
- Clone the repository (if not already done)
git clone --recurse-submodules https://github.com/wonglkd/Baleen-FAST24.git
cd Baleen-FAST24
Note: this repository uses submodules. As a reminder, when you pull, you'll likely want to use git pull --recurse-submodules
.
- Install Python dependencies with Conda/Mamba/Micromamba or pip. (We developed with Micromamba 1.4.1.)
conda env create -f BCacheSim/install/env_cachelib-py-3.11.yaml
conda activate cachelib-py-3.11
# PyPy is optional (for faster non-ML runs)
# conda env create -f BCacheSim/install/env_cachelib-pypy-3.8.yaml
Alternatively, use pip:
python3 -m pip install --user -r BCacheSim/install/requirements.txt
- Download trace files (see here for more details on the traces)
cd data
bash get-tectonic.sh
Time estimate: 30 minutes (10 mins interactive).
- Manually run the simulator with the baseline RejectX. (4 mins)
./BCacheSim/run_py.sh py -B -m BCacheSim.cachesim.simulate_ap --config runs/example/rejectx/config.json
- Manually train Baleen's ML models (25 secs) and run the simulator with Baleen (~30 mins).
./BCacheSim/run_py.sh py -B -m BCacheSim.episodic_analysis.train --exp example --policy PolicyUtilityServiceTimeSize2 --region Region1 --sample-ratio 0.1 --sample-start 0 --trace-group 201910 --supplied-ea physical --target-wrs 34 50 100 75 20 10 60 90 30 --target-csizes 366.475 --output-base-dir runs/example/baleen --eviction-age 5892.856 --rl-init-kwargs filter_=prefetch --train-target-wr 35.599 --train-models admit prefetch --train-split-secs-start 0 --train-split-secs-end 86400 --ap-acc-cutoff 15 --ap-feat-subset meta+block+chunk
./BCacheSim/run_py.sh py -B -m BCacheSim.cachesim.simulate_ap --config runs/example/baleen/prefetch_ml-on-partial-hit/config.json
- Use notebooks/example/example.ipynb to view and plot results.
This section assumes you have completed the 'Getting Started' section and have installed the code and downloaded the traces.
As it requires too much computation time to rerun every single experiment, we suggest the following steps to maximize the use of reviewers' time in evaluating our paper. We supply our traces, code, and the intermediate results from our experimental runs.
Roadmap for evaluation:
- Test out Baleen's ML training & simulator (in Getting Started).
- What: simulate RejectX baseline, train Baleen models, simulate Baleen
- Expected results: notebooks/example/example.ipynb
- Plot graphs using our intermediate results.
- Example: notebooks/paper-figs/fig-01bc,17-202309.ipynb
- Select additional simulations to run if desired.
- See notebooks/reproduce/, in particular commands.ipynb and reproduce_commands.sh
- data: traces that are used as input
- runs: where experiment results are stored
- tmp: temporary directory for ML models, generated episode files
- notebooks: Jupyter notebooks for experiments
- notebooks/figs: Output directory for figures
624 machine-days were used for the final runs to generate the results used in the paper. Each simulation of a ML policy takes at least 30 minutes, multiplied by 7 traces and 10 samples each.
notebooks/reproduce/exps-cluster-sample.ipynb will be useful to allow you to run experiments efficiently, but with more dependencies required (brooce, redis).
If you face any issues, please try the following things:
- Making sure you have the latest version of the repository
git pull --recurse-submodules
- Making sure you have the latest copy of the data.
cd data
bash clean.sh
bash get-tectonic.sh
- If you need to get an allocation on Chameleon or face any difficulties with the platform itself, please contact their helpdesk.
Please raise a GitHub issue. Support is best effort; you may also email me (contact details at https://wonglkd.fi-de.net).
Baleen: ML Admission & Prefetching for Flash Caches
Daniel Lin-Kit Wong, Hao Wu, Carson Molder, Sathya Gunasekar, Jimmy Lu, Snehal Khandkar, Abhinav Sharma, Daniel S. Berger, Nathan Beckmann, Gregory R. Ganger
USENIX FAST 2024