This is an accurate (or as accurate as possible) re-implementation of the classic C4.5 decision tree algorithm. The implementation includes:
- separate-and-conquer recursive partitioning method
- error-based pruning
- copes with both numeric and categorial attributes directly
- support for missing values
Current assumptions (when using the build_decision_tree
function directly):
- class attribute is the last attribute of the dataframe
- data types on the dataframe are correctly set for each column:
str
for categoritcal attributes andfloat
for numeric attributes
There are two options to run the algorihtm, either directly from the command-line or imported as a library.
You may need to install additional dependencies to be able to run the algorithm — easiest way to do that is running the following in your terminal:
pip install -r requirements.txt
The command-line allows you to run the algorithm and provide its parameters. The complete set of options can be viewed by running:
python p45.py
This will display a list of commands and their short description:
usage: p45.py [-h] [-m cases] [--seed seed] [--unpruned] [--csv] [-t <test file>] <training file>
P45: A Python C4.5 implementation.
positional arguments:
<training file> training file
optional arguments:
-h, --help show this help message and exit
-m cases minimum number of cases for at least two branches of a split
--seed seed random seed value
--unpruned disables pruning
--csv reads input as a CSV file
-t <test file> test file
The minimum set of parameters is to specify the training file:
python p45.py iris.arff
After a successful run, the algorithm produces a decision tree:
P45 [Release 1.0] Thu April 7 06:00:00 2022
-----------------
Options:
Pruning=True
Cases=2
Seed=0
Class specified by attribute 'Class'
Read 150 cases (4 predictor attributes) from:
-> iris.arff
Decision tree:
petal-width <= 0.6: Iris-setosa (50.0)
petal-width > 0.6:
| petal-width > 1.7: Iris-virginica (46.0/1.0)
| petal-width <= 1.7:
| | petal-length <= 4.9: Iris-versicolor (48.0/1.0)
| | petal-length > 4.9:
| | | petal-width <= 1.5: Iris-virginica (3.0)
| | | petal-width > 1.5: Iris-versicolor (3.0/1.0)
Time: 2.7 secs
You can also add P45
as a dependency to your project:
import p45
Then you can you can use the build_decision_tree
function to create a decision tree. The return of this function is a Node
object that represents the root node of the tree. From the root node, you can classify new instances by calling the predict
or probabilities
functions.
Currently the code can take a while to run when using large datasets. The most likely reason for this is the (over-)use of dataframe operations to split the data during the recursive tree creation procedure.