The following workflow will demonstrate how to use a CNN to do deep learning in KNIME for image classification of a sign language dataset.
Breast cancer classification: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Breast Cancer Binary Classification Workflow: https://tinyurl.com/59xfxsay
The breast cancer data is a csv file with 32 columns containing important information about each cancer tumour. The diagnosis column contains the labels, either M for malignant or B for benign.
Class distribution: 357 benign, 212 malignant.
Use the CSV Reader nodes to read both the data file.
Use the missing value node to filter out any rows that might have missing data. Alternatively, you can fill in missing data by using maximum, mean, median, minimum, etc. The method you choose will depend on your dataset type.
Under the column settings tab, you can also individually select what will happen to missing values in each column. For now, we are just going to remove the row if there are any missing values.
The category to number node works similar to the many to one node, it will convert all the different labels in one column to a number. We pass the diagnosis column into this node to obtain M = 0, B = 1.
Partitioning the data into test, train, and validation allows us to effectively evaluate the model’s performance during and after training.
How we partition the data is also very important. Depending on the type of data set, we may need to use stratified sampling or we might need to take from the top. “Use random seed” is selected to specify a seed for random number generation for the partitioning. Setting this option results in the same records being assigned to the same set on successive runs.
We are going to partition the data 70-30 using stratified sampling from the diagnosis column. The column that we take the stratified sampling from needs to be a string column.
The input layer will have the same shape as the amount of columns in the csv data file.
The convolution block will be made out of a dense layer, batch normalization layer and dropout layer. The first block will have a dense layer of 256 units.
The batch normalization layer will have an axis of -1 and tick “center” and “scale.”
The second convolutional block has the same components as the first block, with almost all the same configurations. For the dense layer in the 2nd convolutional block, you can either keep it 256 or decrease it to 128.
The final dense layer will have an activation function of sigmoid and 1 unit, as we only have one classification problem.
You can use the DL Python Network Editor node in order to check for a summary of the model, by connecting this to the end of the CNN. It will show the total trainable and non-trainable parameters.
We are going to use mean absolute error as our standard loss function to get a forecast of the predicted mean temperature.
The epoch required for the deep learning model will vary according to the application, so it is best to try with smaller numbers first as training can take a very long time. But for now, let’s try using an epoch of 100. When possible, keep the random seed number used throughout the workflow the same, and although it is preferable to use Adam as the optimizer feel free to experiment and use other optimizers.
This executor node will take in the partitioned test set and use the model that was just trained to fit the new data. Follow the configurations in the image.
The output of the Keras Network Executor would look something like this. The original labels are under the diagnosis column, and the output_0 column is the probability of the breast cancer data being Benign.
If the predicted_0 column has a less than 0.5 probability, then the rule engine will append a prediction column with the input as M for malignant, otherwise it is B for benign.
The scorer will use the selected columns to determine the model’s accuracy. Follow the configurations. To see the confusion matrix, right click on the node and select “confusion matrix.”