Skip to content

Commit

Permalink
Merge pull request #119 from NiklasPfister/Development
Browse files Browse the repository at this point in the history
Version update of adaXT from 1.3.0 to 1.4.0
  • Loading branch information
svbrodersen authored Jan 6, 2025
2 parents c377b44 + 58f1c48 commit 1463539
Show file tree
Hide file tree
Showing 57 changed files with 1,915 additions and 786 deletions.
3 changes: 3 additions & 0 deletions .github/workflows/github-pages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,16 @@ on:
push:
branches:
- main
- Development
permissions:
contents: write
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
ref: "Development"
- name: Configure Git Credentials
run: |
git config user.name github-actions[bot]
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -183,3 +183,5 @@ scrap.pyx
*.pyd
*.html
src/adaXT/decision_tree/setup.py
startup.sh
test.py
10 changes: 5 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ build_ext:
python setup.py build_ext --inplace

clean:
rm -f ./src/adaXT/decision_tree/*.so ./src/adaXT/decision_tree/*.html ./src/adaXT/decision_tree/*.cpp
rm -f ./src/adaXT/criteria/*.so ./src/adaXT/criteria/*.html ./src/adaXT/criteria/*.cpp
rm -f ./src/adaXT/predict/*.so ./src/adaXT/predict/*.html ./src/adaXT/predict/*.cpp
rm -f ./src/adaXT/leaf_builder/*.so ./src/adaXT/leaf_builder/*.html ./src/adaXT/leaf_builder/*.cpp
rm -f ./src/adaXT/*.so ./src/adaXT/*.html ./src/adaXT/*.cpp
find ./src | grep -i .so | xargs rm -rf
find ./src | grep -i .cpp | xargs rm -rf
find ./src | grep -i .html | xargs rm -rf
find ./src | grep -i egg-info | xargs rm -rf
find ./src | grep -i pycache | xargs rm -rf

lint:
cython-lint src/* --max-line-length=127
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ implemented:

Beyond these pre-defined tree types, adaXT offers a simple interface to extend
or modify most components of the tree models. For example, it is easy to create
a [custom criteria](/docs/user_guide/creatingCriteria.md) function that is used
a [custom criteria](https://NiklasPfister.github.io/adaXT/user_guide/creatingCriteria/)
function that is used
to create splits.

### Getting started
Expand Down
4 changes: 2 additions & 2 deletions docs/api_docs/DecisionTree.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@ then be applied to data.

- [Criteria](Criteria.md)
- [LeafBuilder](LeafBuilder.md)
- [Prediction](Prediction.md)
- [Predictor](Predictor.md)

Instead of the user specifying all three components individually, it is also
possible to only specify the `tree_type`, which then internally selects the
corresponding default components for several established tree-algorithms, see
[user guide](/docs/user_guide/decision_tree.md).
[user guide](../user_guide/decision_tree.md).

For more advanced modifications, it might be necessary to change how the
splitting is performed. This can be done by passing a custom
Expand Down
8 changes: 1 addition & 7 deletions docs/api_docs/Nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,4 @@
These are the collection of different implemented Nodes used by the
[DecisionTree](DecisionTree.md).

::: adaXT.decision_tree
options:
members:
- Node
- LeafNode
- DecisionNode
- LinearRegressionLeafNode
::: adaXT.decision_tree.nodes
11 changes: 11 additions & 0 deletions docs/api_docs/Parallel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ParallelModel class

This model is created together with the
[RandomForest](RandomForest.md). It is later passed to the
[Predictor](Predictor.md) class as input to the static
method [forest_predictor](../api_docs/Predictor.md#adaXT.predictor.predictor.Predictor.forest_predictor).

::: adaXT.parallel
options:
members:
- ParallelModel
12 changes: 0 additions & 12 deletions docs/api_docs/Prediction.md

This file was deleted.

13 changes: 13 additions & 0 deletions docs/api_docs/Predictor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Predictor Class

The predictor class is used for customizing how tree.predict functions. The
defaults can be seen below.

::: adaXT.predictor.predictor
options:
members:
- Predictor
- PredictorClassification
- PredictorRegression
- PredictorLocalPolynomial
- PredictorQuantile
8 changes: 6 additions & 2 deletions docs/api_docs/RandomForest.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,18 @@
This is the class used to construct a random forest. Random forests consist of
multiple individual decision trees that are trained on subsets of the data and
then combined via averaging. This can greatly improve the generalization
performance by avoiding the tendency of decision trees to overfit to the
performance by avoiding the tendency of decision trees to over fit to the
training data. Since random forest learn individual trees many of the
parameters and functionality in this class overlaps with the
[DecisionTree](DecisionTree.md) class.

The RandomForest can be imported as follows:

```python
from adaXT.random_forest import RandomForest
```

::: adaXT.random_forest.RandomForest
::: adaXT.random_forest.random_forest
options:
members:
- RandomForest
6 changes: 5 additions & 1 deletion docs/api_docs/tree_utils.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,8 @@ All methods are available in the decision tree module.
```python
import adaXT.decision_tree.tree_utils
```
::: adaXT.decision_tree.tree_utils

::: adaXT.decision_tree.tree_utils
options:
members:
- plot_tree
Binary file added docs/assets/figures/DecisionTreePlot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/user_guide/creatingCriteria.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Creating a custom criteria

In this section we explain how to create a custom criteria function by walking
through the required steps. The [Criteria](/docs/api_docs/Criteria.md) class is
through the required steps. The [Criteria](../api_docs/Criteria.md) class is
implemented as a Cython
[extension types](https://cython.readthedocs.io/en/latest/src/tutorial/cdef_classes.html)
-- also known as a cdef class. While this ensures that the criteria evaluations
Expand Down Expand Up @@ -43,7 +43,7 @@ which is as follows:

The variable `indices` refers to the sample indices for which the impurity value
should be computed. To access the feature and response you can make use of
`self.x` and `self.y`, respectively. More specifically, `self.x[indices] ` and
`self.x` and `self.y`, respectively. More specifically, `self.x[indices]` and
`self.y[indices]` are the feature and response samples for which the impurity
needs to be computed. With this in place you should be able to implement almost
any criteria function you can imagine. Keep in mind that the `impurity` method
Expand Down
3 changes: 0 additions & 3 deletions docs/user_guide/creatingPrediction.md

This file was deleted.

201 changes: 201 additions & 0 deletions docs/user_guide/creatingPredictor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# Creating a custom Predictor

## General overview of the Predictor

Like other elements in adaXT, it is possible to create a custom
[Predictor](../api_docs/Predictor.md). You can start by creating a new
.pyx file using the following template:

```cython
from adaXT.predictor cimport Predictor
cdef class MyPredictorClass(Predictor):
cdef:
# attribute_type attribute_name
def __init__(
self,
double[:, ::1] X,
double[:, ::1] Y,
object root,
**kwargs):
super().__init__(X, Y, root, **kwargs)
# Any custom initialization you would need for your predictor class
# If you don't have any, you don't need to define the __init__ function.
def predict(self, cnp.ndarray X, **kwargs) -> np.ndarray:
# Define your own custom predict function
@staticmethod
def forest_predict(cnp.ndarray X_old, cnp.ndarray Y_old, cnp.ndarray X_new,
trees: list[DecisionTree], parallel: ParallelModel,
**kwargs) -> np.ndarray:
# Define special handling for the RandomForest predict.
# If it is not defined, then the RandomForest will take the mean of all the
# predict for its estimators.
```

The template includes three main components:

1. \_\_init\_\_ function: This function is used to initialize the class. Because Cython
removes a lot of the boilerplate with default Python classes Cython, you cannot
add attributes to a cdef class without explicitly defining them. The \_\_init\_\_
function allows you to initialize these attributes after you have defined them above.
If you do not need additional attributes, you can skip this step.
2. predict method: This method is used to compute predictions for the given input X
values. It is a standard Python method and can be used like any other. Within this
method, you have access to the general attributes of the
[Predictor](../api_docs/Predictor.md) class, including the number of features and
the root node object, which can be used to traverse the tree.
3. forest_predict method: This static method aggregates predictions across multiple
trees for forest predictions. It enables parallel processing across trees. If your
custom Predictor simply averages tree predictions, you can inherit this method
from the base Predictor class.

## Example of creating a Predictor

To illustrate each component, we go over the PredictorQuantile class, which is used
for quantile regression trees and forests. It does not add any additional attributes
so the \_\_init\_\_ function is not needed in this case.

### The predict method

In quantile regression we want to predict the quantiles of the conditional distribution
instead of just the conditional mean as in regular regression. For a single tree this can
be done with the following predict method:

```cython
cdef class PredictorQuantile(Predictor):
def predict(self, cnp.ndarray X, **kwargs) -> np.ndarray:
cdef:
int i, cur_split_idx, n_obs
double cur_threshold
object cur_node
cnp.ndarray prediction
if "quantile" not in kwargs.keys():
raise ValueError(
"quantile called without quantile passed as argument"
)
quantile = kwargs['quantile']
# Make sure that x fits the dimensions.
n_obs = X.shape[0]
# Check if quantile is an array
if isinstance(quantile, Sequence):
prediction = np.empty((n_obs, len(quantile)), dtype=DOUBLE)
else:
prediction = np.empty(n_obs, dtype=DOUBLE)
for i in range(n_obs):
cur_node = self.root
while isinstance(cur_node, DecisionNode):
cur_split_idx = cur_node.split_idx
cur_threshold = cur_node.threshold
if X[i, cur_split_idx] < cur_threshold:
cur_node = cur_node.left_child
else:
cur_node = cur_node.right_child
prediction[i] = np.quantile(self.Y.base[cur_node.indices, 0], quantile)
return prediction
```

Here, we first define the types of the variables used. This allows Cython to
optimize the code, which leads to a faster prediction runtime.

Next, we check the kwargs for the key `quantile`. Any keyword arguments passed
to the DecisionTree.predict is passed directly to the Predictor.predict, meaning
that we can access the desired quantile from the predict signature without having
to change anything else. As we want to allow for multiple quantiles to be
predicted at the same time, we have to initalize the `prediction` variable differently
depending on whether `quantile` is a Sequence or just a single element.

Finally, we iterate over the tree: For every observation, we go to the root node
and loop as long as we are in a DecisionNode. In each step, we check if we split
to the left or the right, and traverse down the tree. Once `cur_node` is no longer
an instance of the DecisionNode, we know that we have reached a LeafNode.
We can access all Y values via `self.Y.base` ('.base' has to be added,
as we are indexing with a list of elements) and the indices of the elements
within the LeafNode via `cur_node.indices`. As we only have a single Y output
value, we simply want the first column of Y. This is then repeated for the rest
of the provided X values.

### The forest_predict method

The forest_predict method looks a lot more intimidating, but is just as
straightforward as the predict method. Here is the code:

```cython
def predict_quantile(
tree: DecisionTree, X: np.ndarray, n_obs: int
) -> list:
# Check if quantile is an array
indices = []
for i in range(n_obs):
cur_node = tree.root
while isinstance(cur_node, DecisionNode):
cur_split_idx = cur_node.split_idx
cur_threshold = cur_node.threshold
if X[i, cur_split_idx] < cur_threshold:
cur_node = cur_node.left_child
else:
cur_node = cur_node.right_child
indices.append(cur_node.indices)
return indices
cdef class PredictorQuantile(Predictor):
@staticmethod
def forest_predict(cnp.ndarray X_old, cnp.ndarray Y_old, cnp.ndarray X_new,
trees: list[DecisionTree], parallel: ParallelModel,
**kwargs) -> np.ndarray:
cdef:
int i, j, n_obs, n_trees
list prediction_indices, pred_indices_combined, indices_combined
if "quantile" not in kwargs.keys():
raise ValueError(
"quantile called without quantile passed as argument"
)
quantile = kwargs['quantile']
n_obs = X_new.shape[0]
prediction_indices = parallel.async_map(predict_quantile,
map_input=trees, X=X_new,
n_obs=n_obs)
# In case the leaf nodes have multiple elements and not just one, we
# have to combine them together
n_trees = len(prediction_indices)
pred_indices_combined = []
for i in range(n_obs):
indices_combined = []
for j in range(n_trees):
indices_combined.extend(prediction_indices[j][i])
pred_indices_combined.append(indices_combined)
ret = np.quantile(Y_old[pred_indices_combined], quantile)
return np.array(ret, dtype=DOUBLE)
```

The forest_predict method is a staticmethod, meaning that it is tied to the
Predictor class itself and not a specific instance of the class. The reason for
this is that it allows us to fully control the parallization over trees. For
the PredictorQuantile, for example, we want to be able to control this ourselves.

As before we define the variables used and check the input for the kwarg
`quantile`. However, this time we needed to define a globally available function
`predict_quantile` at the top level of the file. It has to be a globally available
for the multiprocessing to work probably. This function traverses a given tree,
and finds the LeafNode each element of X would end up in and adds the indices
of the elements already in the LeafNode. We then call `predict_quantile`
using the parallel.async_map, which is adaXTs way of making
parallelization more manageable. It makes use of the
[Parallel](../api_docs/Parallel.md) class. The async_map calls
`predict_quantile` with all the trees in parallel, and returns the result. This
means, that `prediction_indices` will contain a list with the length equal
to the number of trees in the forest. Each element of the list will be a single
trees prediction for the input array X. We then create a list
`pred_indices_combined` where we combine all the predictions for X.
To get the final result, we then just call numpy's quantile implementation.
Loading

0 comments on commit 1463539

Please # to comment.