spack · alecbcs · Feb 21, 2024 · Dec 22, 2023 · Jan 11, 2024 · Jan 12, 2024
diff --git a/docs/arch.md b/docs/arch.md
@@ -0,0 +1,5 @@
+# Architecture
+
+@alecbcs mocked up a wonderful flow chart showing our vision for this project!
+
+![flow chart showing spack-gantry architecture](./img/arch.png)
diff --git a/docs/context.md b/docs/context.md
@@ -0,0 +1,47 @@
+# Context
+
+Build definitions compiled in Spack's CI are manually allotted to categories that reflect the resources expected to be consumed. These amounts are used to allocate multiple jobs as efficiently as possible on a CI runner. Without accurate information about how much memory or CPU a compile will use, there is opportunity for misallocation, which can have effects on the following components of the CI system:
+
+- Cost per job
+- Build walltime
+- Efficiency of VM packing
+- Utilization of resources per job
+- Build failures due to lack of memory
+- Build error rate
+- Overall throughput
+
+Also, jobs with mismatched time estimates are being allocated to instances inappropriately, leading to situations where many small jobs complete, while a larger job uses the instance for the rest of its duration without using every available cycle. We are currently retrying jobs up to 3 times in order to work around stochastic CI failures, which leads to more potential waste if the error was valid. Instead, we would like to retry jobs if the cause of termination was resource contention.
+
+Due to a problems of scale and inability to manually determine the resource demand of a given build, we have decided that an automated framework that captures historical utilization and outputs predictions for future usage is the best course of action. 
+
+With this setup, we can transition to a system where build jobs will request the appropriate amount of resources, which will reduce waste and contention with other jobs within the same namespace. Additionally, by amassing a vast repository of build attributes and historical usage, we can further analyze the behavior of these jobs and perform experiments within the context of the CI. For instance, understanding why certain packages are more variable in their memory usage during compilation, or determining if there is a "sweet spot" that minimizes resource usage but leads to an optimal build time for a given configuration (i.e. a scaling study).
+
+A corollary to this is building a system that handles job failures with some intelligence. For instance, if a build was OOM killed, `gantry` would submit the same job and supply it with more memory. Jobs that fail due to other reasons would be resolved through other channels.
+
+## Current resource allocation
+
+Each build job comes with a memory and CPU request. Kubernetes will use these numbers to allocate the job onto a specific VM. No limits are sent, meaning that a compilation could crowd out other jobs and that there are no consequences for going over what they are expected to utilize.
+
+-----
+
+To illustrate the problem, let's go through some usage numbers (across all builds):
+
+**Memory**
+
+- avg/request = 0.26
+- max/request = 0.64
+
+**CPU**
+
+- avg/request = 1.25
+- max/request = 2.69
+
+There is a lot of misallocation going on here. As was said above, limits are not enforced, so request is the closest we're going to get to a useful comparison of usage. Bottom line, we are using a lot less memory than we request, and more CPU than we ask for.
+
+## Cost per job
+
+Work is being performed to determine the best way to model the cost per job and node. 
+
+**Notes**:
+- waste needs to be quantified in the cost per job
+- instance type should be controlled for, as we don't want the number of cores to be the variable in the cost function, since the performance can vary drastically
diff --git a/docs/data-collection.md b/docs/data-collection.md
@@ -0,0 +1,18 @@
+# Data Collection
+
+Job metadata is retrieved through the Spack Prometheus service (https://prometheus.spack.io).
+
+Gantry exposes a webhook handler at `/v1/collection` which will accept a job status payload from Gitlab and collect build attributes and usage, submitting to the database.
+
+See `/db/schema.sql` for a full list of the data that is being collected.
+
+## Units
+
+Memory usage is stored in bytes, while CPU usage is stored in cores. Pay special attention if you are interacting with relevant fields if you are performing calculations or sending data from these fields to Kubernetes or another external service. They may expect these values in different units.
+
+------
+
+Links to documentation for metrics available:
+- [node](https://github.com/kubernetes/kube-state-metrics/blob/main/docs/node-metrics.md)
+- [pod](https://github.com/kubernetes/kube-state-metrics/blob/main/docs/pod-metrics.md)
+- [container](https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md)
diff --git a/docs/home.md b/docs/home.md
@@ -0,0 +1,11 @@
+# `spack-gantry`
+
+
+
+## Table of Contents
+
+1. [Context](context.md)
+2. [Data Collection](data-collection.md)
+3. [Architecture](arch.md)
+4. [Prediction](prediction.md)
+
diff --git a/docs/img/arch.png b/docs/img/arch.png
diff --git a/docs/prediction.md b/docs/prediction.md
@@ -0,0 +1,79 @@
+# Prediction
+
+The basic idea here is: given metadata about a future build, recommend resource requests and limits, as well as number of build jobs.
+
+The goal is that this eventually becomes a self-learning system which can facilitate better predictions over time without much interference.
+
+At the moment, this isn't accomplished through some super fancy model. Here's our approach for each type of prediction:
+
+**Requests**
+
+Optimizing for the mean usage / predicted mean as close to 1 as possible.
+
+Formula: avg(mean_usage) for past N builds.
+
+**Limits**
+
+For CPU: optimize for the number of cores for best efficiency
+For RAM: optimize for lack of killing (max usage / predicted max < 1)
+
+This one is a bit tricker to implement because we would like to limit OOM kills as much as possible, but we've thought about allocating 10-15% above historical maxima for usage.
+
+We could also figure out the upper threshold by calculating
+
+= skimpiest predicted limit / maximum usage for that job
+
+However, when doing this, I've stumbled upon packages that have unpredictable usage patterns that can swing from 200-400% of each other (with no discernable differences).
+
+More research and care will be needed when we finally decide to implement limit prediction.
+
+### Predictors
+
+We've done some analysis to determine the best predictors of resource usage. Because we need to return a result regardless of the confidence we have in it, we've developed a priority list of predictors to match on.
+
+1. `("pkg_name", "pkg_version", "compiler_name", "compiler_version")`
+2. `("pkg_name", "compiler_name", "compiler_version")`
+3. `("pkg_name", "pkg_version", "compiler_name")`
+4. `("pkg_name", "compiler_name")`
+5. `("pkg_name", "pkg_version")`
+6. `("pkg_name",)`
+
+Variants are always included as a predictor.
+
+Our analysis shows that the optimal number of builds to include in the prediction function is five, though we prefer four if the program will drop down to the next set in the list.
+
+We do not use PR builds as part of the training data, as they are potential vectors for manipulation and can be error prone. The predictions will apply to both PR and develop jobs.
+
+## Plan
+
+1. In the pilot phase, we will only be implementing predictions for requests, and ensuring that they will only increase compared to current allocations.
+
+2. If we see success in the pilot, we'll implement functionality which retries jobs with higher memory allocations if they've been shown to fail due to OOM kills.
+
+3. Then, we will "drop the floor" and allow the predictor to allocate less memory than the package is used to. At this step, requests will be fully implemented.
+
+4. Limits for CPU and memory will be implemented.
+
+5. Next, we want to introduce some experimentation in the system and perform a [scaling study](#fuzzing).
+
+6. Design a scheduler that decides which instance type a job should be placed on based on cost and expected usage and runtime.
+
+## Evaluation
+
+The success of our predictions can be evaluated against a number of factors:
+
+- How much cheaper is the job?
+- Closeness of request or limit to actual usage
+- Jobs being killed due to resource contention
+- Error distribution of prediction
+- How much waste is there per build type?
+
+## Fuzzing
+
+10-15% of builds would be randomly selected to have their CPU limit modified up or down. This would happen a few times for the build, and see if we can find an optimal efficiency for the job, which would be used to define future CPU limit and number of build jobs.
+
+We're essentially adding variance to the resource allocation and seeing how the system responds.
+
+This is a strong scaling study, and the plot of interest is the efficiency curve.
+
+Efficiency defined as cores/build time.