Merge pull request #119 from NiklasPfister/Development

Version update of adaXT from 1.3.0 to 1.4.0
NiklasPfister · Jan 6, 2025 · 1463539 · 1463539
2 parents c377b44 + 58f1c48
commit 1463539
Show file tree

Hide file tree

Showing 57 changed files with 1,915 additions and 786 deletions.
diff --git a/.github/workflows/github-pages.yml b/.github/workflows/github-pages.yml
@@ -3,13 +3,16 @@ on:
   push:
     branches:
       - main
+      - Development
 permissions:
   contents: write
 jobs:
   deploy:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v4
+        with:
+          ref: "Development"
       - name: Configure Git Credentials
         run: |
           git config user.name github-actions[bot]

diff --git a/.gitignore b/.gitignore
@@ -183,3 +183,5 @@ scrap.pyx
 *.pyd
 *.html
 src/adaXT/decision_tree/setup.py
+startup.sh
+test.py
diff --git a/Makefile b/Makefile
@@ -5,11 +5,11 @@ build_ext:
 	python setup.py build_ext --inplace
 
 clean:
-	rm -f ./src/adaXT/decision_tree/*.so ./src/adaXT/decision_tree/*.html ./src/adaXT/decision_tree/*.cpp
-	rm -f ./src/adaXT/criteria/*.so ./src/adaXT/criteria/*.html ./src/adaXT/criteria/*.cpp
-	rm -f ./src/adaXT/predict/*.so ./src/adaXT/predict/*.html ./src/adaXT/predict/*.cpp
-	rm -f ./src/adaXT/leaf_builder/*.so ./src/adaXT/leaf_builder/*.html ./src/adaXT/leaf_builder/*.cpp
-	rm -f ./src/adaXT/*.so ./src/adaXT/*.html ./src/adaXT/*.cpp
+	find ./src | grep -i .so | xargs rm -rf
+	find ./src | grep -i .cpp | xargs rm -rf
+	find ./src | grep -i .html | xargs rm -rf
+	find ./src | grep -i egg-info | xargs rm -rf
+	find ./src | grep -i pycache | xargs rm -rf
 
 lint:
 	cython-lint src/* --max-line-length=127
diff --git a/README.md b/README.md
@@ -32,7 +32,8 @@ implemented:
 
 Beyond these pre-defined tree types, adaXT offers a simple interface to extend
 or modify most components of the tree models. For example, it is easy to create
-a [custom criteria](/docs/user_guide/creatingCriteria.md) function that is used
+a [custom criteria](https://NiklasPfister.github.io/adaXT/user_guide/creatingCriteria/)
+function that is used
 to create splits.
 
 ### Getting started

diff --git a/docs/api_docs/DecisionTree.md b/docs/api_docs/DecisionTree.md
@@ -6,12 +6,12 @@ then be applied to data.
 
 - [Criteria](Criteria.md)
 - [LeafBuilder](LeafBuilder.md)
-- [Prediction](Prediction.md)
+- [Predictor](Predictor.md)
 
 Instead of the user specifying all three components individually, it is also
 possible to only specify the `tree_type`, which then internally selects the
 corresponding default components for several established tree-algorithms, see
-[user guide](/docs/user_guide/decision_tree.md).
+[user guide](../user_guide/decision_tree.md).
 
 For more advanced modifications, it might be necessary to change how the
 splitting is performed. This can be done by passing a custom

diff --git a/docs/api_docs/Nodes.md b/docs/api_docs/Nodes.md
@@ -3,10 +3,4 @@
 These are the collection of different implemented Nodes used by the
 [DecisionTree](DecisionTree.md).
 
-::: adaXT.decision_tree
-    options:
-      members:
-        - Node
-        - LeafNode 
-        - DecisionNode
-        - LinearRegressionLeafNode
+::: adaXT.decision_tree.nodes
diff --git a/docs/api_docs/Parallel.md b/docs/api_docs/Parallel.md
@@ -0,0 +1,11 @@
+# ParallelModel class
+
+This model is created together with the
+[RandomForest](RandomForest.md). It is later passed to the
+[Predictor](Predictor.md) class as input to the static
+method [forest_predictor](../api_docs/Predictor.md#adaXT.predictor.predictor.Predictor.forest_predictor).
+
+::: adaXT.parallel
+    options:
+      members:
+        - ParallelModel
diff --git a/docs/api_docs/Prediction.md b/docs/api_docs/Prediction.md
diff --git a/docs/api_docs/Predictor.md b/docs/api_docs/Predictor.md
@@ -0,0 +1,13 @@
+# Predictor Class
+
+The predictor class is used for customizing how tree.predict functions. The
+defaults can be seen below.
+
+::: adaXT.predictor.predictor
+    options:
+      members:
+        - Predictor
+        - PredictorClassification
+        - PredictorRegression
+        - PredictorLocalPolynomial
+        - PredictorQuantile
diff --git a/docs/api_docs/RandomForest.md b/docs/api_docs/RandomForest.md
@@ -3,14 +3,18 @@
 This is the class used to construct a random forest. Random forests consist of
 multiple individual decision trees that are trained on subsets of the data and
 then combined via averaging. This can greatly improve the generalization
-performance by avoiding the tendency of decision trees to overfit to the
+performance by avoiding the tendency of decision trees to over fit to the
 training data. Since random forest learn individual trees many of the
 parameters and functionality in this class overlaps with the
 [DecisionTree](DecisionTree.md) class.
 
 The RandomForest can be imported as follows:
+
 ```python
 from adaXT.random_forest import RandomForest
 ```
 
-::: adaXT.random_forest.RandomForest
+::: adaXT.random_forest.random_forest
+    options:
+      members:
+        - RandomForest
diff --git a/docs/api_docs/tree_utils.md b/docs/api_docs/tree_utils.md
@@ -7,4 +7,8 @@ All methods are available in the decision tree module.
 ```python
 import adaXT.decision_tree.tree_utils
 ```
-::: adaXT.decision_tree.tree_utils
+
+::: adaXT.decision_tree.tree_utils
+    options:
+      members:
+        - plot_tree
diff --git a/docs/assets/figures/DecisionTreePlot.png b/docs/assets/figures/DecisionTreePlot.png
diff --git a/docs/user_guide/creatingCriteria.md b/docs/user_guide/creatingCriteria.md
@@ -1,7 +1,7 @@
 # Creating a custom criteria
 
 In this section we explain how to create a custom criteria function by walking
-through the required steps. The [Criteria](/docs/api_docs/Criteria.md) class is
+through the required steps. The [Criteria](../api_docs/Criteria.md) class is
 implemented as a Cython
 [extension types](https://cython.readthedocs.io/en/latest/src/tutorial/cdef_classes.html)
 -- also known as a cdef class. While this ensures that the criteria evaluations
@@ -43,7 +43,7 @@ which is as follows:
 
 The variable `indices` refers to the sample indices for which the impurity value
 should be computed. To access the feature and response you can make use of
-`self.x` and `self.y`, respectively. More specifically, `self.x[indices] ` and
+`self.x` and `self.y`, respectively. More specifically, `self.x[indices]` and
 `self.y[indices]` are the feature and response samples for which the impurity
 needs to be computed. With this in place you should be able to implement almost
 any criteria function you can imagine. Keep in mind that the `impurity` method

diff --git a/docs/user_guide/creatingPrediction.md b/docs/user_guide/creatingPrediction.md
diff --git a/docs/user_guide/creatingPredictor.md b/docs/user_guide/creatingPredictor.md
@@ -0,0 +1,201 @@
+# Creating a custom Predictor
+
+## General overview of the Predictor
+
+Like other elements in adaXT, it is possible to create a custom
+[Predictor](../api_docs/Predictor.md). You can start by creating a new
+.pyx file using the following template:
+
+```cython
+from adaXT.predictor cimport Predictor
+
+cdef class MyPredictorClass(Predictor):
+
+  cdef:
+    # attribute_type attribute_name
+
+  def __init__(
+    self,
+    double[:, ::1] X,
+    double[:, ::1] Y,
+    object root,
+    **kwargs):
+  super().__init__(X, Y, root, **kwargs)
+  # Any custom initialization you would need for your predictor class
+  # If you don't have any, you don't need to define the __init__ function.
+  
+
+  def predict(self, cnp.ndarray X, **kwargs) -> np.ndarray: 
+    # Define your own custom predict function
+
+  @staticmethod
+  def forest_predict(cnp.ndarray X_old, cnp.ndarray Y_old, cnp.ndarray X_new,
+                      trees: list[DecisionTree], parallel: ParallelModel,
+                      **kwargs) -> np.ndarray:
+    # Define special handling for the RandomForest predict.
+    # If it is not defined, then the RandomForest will take the mean of all the
+    # predict for its estimators.
+  
+
+```
+
+The template includes three main components:
+
+1. \_\_init\_\_ function: This function is used to initialize the class. Because Cython
+   removes a lot of the boilerplate with default Python classes Cython, you cannot
+   add attributes to a cdef class without explicitly defining them. The \_\_init\_\_
+   function allows you to initialize these attributes after you have defined them above.
+   If you do not need additional attributes, you can skip this step.
+2. predict method: This method is used to compute predictions for the given input X
+   values. It is a standard Python method and can be used like any other. Within this
+   method, you have access to the general attributes of the
+   [Predictor](../api_docs/Predictor.md) class, including the number of features and
+   the root node object, which can be used to traverse the tree.
+3. forest_predict method: This static method aggregates predictions across multiple
+   trees for forest predictions. It enables parallel processing across trees. If your
+   custom Predictor simply averages tree predictions, you can inherit this method
+   from the base Predictor class.
+
+## Example of creating a Predictor
+
+To illustrate each component, we go over the PredictorQuantile class, which is used
+for quantile regression trees and forests. It does not add any additional attributes
+so the \_\_init\_\_ function is not needed in this case.
+
+### The predict method
+
+In quantile regression we want to predict the quantiles of the conditional distribution
+instead of just the conditional mean as in regular regression. For a single tree this can
+be done with the following predict method:
+
+```cython
+cdef class PredictorQuantile(Predictor):
+    def predict(self, cnp.ndarray X, **kwargs) -> np.ndarray:
+        cdef:
+            int i, cur_split_idx, n_obs
+            double cur_threshold
+            object cur_node
+            cnp.ndarray prediction
+        if "quantile" not in kwargs.keys():
+            raise ValueError(
+                        "quantile called without quantile passed as argument"
+                    )
+        quantile = kwargs['quantile']
+        # Make sure that x fits the dimensions.
+        n_obs = X.shape[0]
+        # Check if quantile is an array
+        if isinstance(quantile, Sequence):
+            prediction = np.empty((n_obs, len(quantile)), dtype=DOUBLE)
+        else:
+            prediction = np.empty(n_obs, dtype=DOUBLE)
+
+        for i in range(n_obs):
+            cur_node = self.root
+            while isinstance(cur_node, DecisionNode):
+                cur_split_idx = cur_node.split_idx
+                cur_threshold = cur_node.threshold
+                if X[i, cur_split_idx] < cur_threshold:
+                    cur_node = cur_node.left_child
+                else:
+                    cur_node = cur_node.right_child
+
+            prediction[i] = np.quantile(self.Y.base[cur_node.indices, 0], quantile)
+        return prediction
+
+```
+
+Here, we first define the types of the variables used. This allows Cython to
+optimize the code, which leads to a faster prediction runtime.
+
+Next, we check the kwargs for the key `quantile`. Any keyword arguments passed
+to the DecisionTree.predict is passed directly to the Predictor.predict, meaning
+that we can access the desired quantile from the predict signature without having
+to change anything else. As we want to allow for multiple quantiles to be
+predicted at the same time, we have to initalize the `prediction` variable differently
+depending on whether `quantile` is a Sequence or just a single element.
+
+Finally, we iterate over the tree: For every observation, we go to the root node
+and loop as long as we are in a DecisionNode. In each step, we check if we split
+to the left or the right, and traverse down the tree. Once `cur_node` is no longer
+an instance of the DecisionNode, we know that we have reached a LeafNode.
+We can access all Y values via `self.Y.base` ('.base' has to be added,
+as we are indexing with a list of elements) and the indices of the elements
+within the LeafNode via `cur_node.indices`. As we only have a single Y output
+value, we simply want the first column of Y. This is then repeated for the rest
+of the provided X values.
+
+### The forest_predict method
+
+The forest_predict method looks a lot more intimidating, but is just as
+straightforward as the predict method. Here is the code:
+
+```cython
+def predict_quantile(
+    tree: DecisionTree, X: np.ndarray, n_obs: int
+) -> list:
+    # Check if quantile is an array
+    indices = []
+
+    for i in range(n_obs):
+        cur_node = tree.root
+        while isinstance(cur_node, DecisionNode):
+            cur_split_idx = cur_node.split_idx
+            cur_threshold = cur_node.threshold
+            if X[i, cur_split_idx] < cur_threshold:
+                cur_node = cur_node.left_child
+            else:
+                cur_node = cur_node.right_child
+
+        indices.append(cur_node.indices)
+    return indices
+
+cdef class PredictorQuantile(Predictor):
+  @staticmethod
+  def forest_predict(cnp.ndarray X_old, cnp.ndarray Y_old, cnp.ndarray X_new,
+                      trees: list[DecisionTree], parallel: ParallelModel,
+                      **kwargs) -> np.ndarray:
+      cdef:
+          int i, j, n_obs, n_trees
+          list prediction_indices, pred_indices_combined, indices_combined
+      if "quantile" not in kwargs.keys():
+          raise ValueError(
+              "quantile called without quantile passed as argument"
+          )
+      quantile = kwargs['quantile']
+      n_obs = X_new.shape[0]
+      prediction_indices = parallel.async_map(predict_quantile,
+                                              map_input=trees, X=X_new,
+                                              n_obs=n_obs)
+      # In case the leaf nodes have multiple elements and not just one, we
+      # have to combine them together
+      n_trees = len(prediction_indices)
+      pred_indices_combined = []
+      for i in range(n_obs):
+          indices_combined = []
+          for j in range(n_trees):
+              indices_combined.extend(prediction_indices[j][i])
+          pred_indices_combined.append(indices_combined)
+      ret = np.quantile(Y_old[pred_indices_combined], quantile)
+      return np.array(ret, dtype=DOUBLE)
+```
+
+The forest_predict method is a staticmethod, meaning that it is tied to the
+Predictor class itself and not a specific instance of the class. The reason for
+this is that it allows us to fully control the parallization over trees. For
+the PredictorQuantile, for example, we want to be able to control this ourselves.
+
+As before we define the variables used and check the input for the kwarg
+`quantile`. However, this time we needed to define a globally available function
+`predict_quantile` at the top level of the file. It has to be a globally available
+for the multiprocessing to work probably. This function traverses a given tree,
+and finds the LeafNode each element of X would end up in and adds the indices
+of the elements already in the LeafNode. We then call `predict_quantile`
+using the parallel.async_map, which is adaXTs way of making
+parallelization more manageable. It makes use of the
+[Parallel](../api_docs/Parallel.md) class. The async_map calls
+`predict_quantile` with all the trees in parallel, and returns the result. This
+means, that `prediction_indices` will contain a list with the length equal
+to the number of trees in the forest. Each element of the list will be a single
+trees prediction for the input array X. We then create a list
+`pred_indices_combined` where we combine all the predictions for X.
+To get the final result, we then just call numpy's quantile implementation.
-Original file line number
+Diff line change
@@ Expand Up / @@ -183,3 +183,5 @@ scrap.pyx @@
     *.pyd
     *.html
     src/adaXT/decision_tree/setup.py
+    startup.sh
+    test.py