The following workflow will demonstrate how to use a CNN to do deep learning in KNIME for image classification of a skin cancer dataset.
Skin cancer classification: https://www.kaggle.com/fanconic/skin-cancer-malignant-vs-benign
CNN Meta node for skin cancer Workflow: https://tinyurl.com/2p97rckw
Skin Cancer (Transfer Learning) Workflow: https://tinyurl.com/72br59ss
Here we have two image readers as opposed to one because the source files have already segregated the test and training data hence we have to read those two files separately. They both go through the same rule engine, where we append their class column with the following syntax. The rule engine can take multiple expressions, so this node is used to see whether the file path has the word “benign” or “malignant” in order to append a new “Class” column.
In this case, the statement “TRUE => “M”” means that if the input RowID does NOT equal benign or have benign in it’s path file name, then the second statement is true and M is appended instead of B.
The category to number node takes columns with nominal data and maps every category to an integer. Use the arrows to move the Class column to the green include box and tick “append columns.” The output will show that “Benign” = 0, and “Malignant” = 1.
We then normalize the image between 0 and 1. The reason we normalize the images is to make the model converge faster. When the data is not normalized, the shared weights of the network have different calibrations for different features, which can make the cost function to converge very slowly and ineffectively. We put in the following expression,
into the expressions box. You can then choose to replace or append column. The result pixel type should be FLOATTYPE.
Next, the image resizer should be used to resize the image. You can put the size you want (in pixels) for the image. We are resizing it 224x224x3. Channel is to indicate whether the picture should be in RGB (3) or B&W (1).
Partitioning the data into test, train, and validation allows us to effectively evaluate the model’s performance during and after training. As we already have our test data ready, we just need one partitioning node to partition the training and validation set. How we partition the data is also very important. Depending on the type of data set, we may need to use stratified sampling or we might need to take from the top. “Use random seed” is selected to specify a seed for random number generation for the partitioning. Setting this option results in the same records being assigned to the same set on successive runs.Partition the data 80-20 using stratified sampling from the Class column.
Next the CNN is made using the method we explained in the sign language workflow. Just drag and drop the necessary nodes and configure the same as what is listen in the description. It is important to include dropout nodes after certain layers in order to prevent overfitting. Another way to prevent overfitting would be to include regularisation techniques that could be modified in the Keras Dense layers. The final dense node has a unit of 1 because we only have 2 classes under 1 column, and so the sigmoid activation would be better suited for this scenario.
The Keras Network Learner node will be configured similarly to our sign language workflow. Under the input data, select CONVERSION = FROM IMAGE and include IMAGE inside the greEn box. Under the target data tab, select CONVERSION = FROM NUMBER(INTEGER) and make sure the CLASS(TO NUMBER) COLUMN is included under the green box. Select the STANDARD LOSS FUNCTION = BINARY CROSS ENTROPY as we only have 2 classes and this would be best suited for our current dataset. You can choose for the epochs to be any number. With 15 epochs you will still get a good accuracy of more than 80% but with 20 epochs it is possible to achieve 86%.
We include one more rule engine to convert the probability columns back to their classes. Since we are using sigmoid activation in the final dense layer, then the output will be shown a bit differently than our previous workflow. Instead of being either 0 or 1 in their class columns, if the output probability <= 0.5, then it is classified as benign. If the output probability > 0.5, then it is classified as malignant.
For transfer learning, since the model is already trained online we are just going to import it into KNIME and edit a few of the last layers to better suit our dataset and workflow. We use the DL Python Network Creator and Editor Nodes to do this. Using the script, we are importing the VGG16 model into KNIME. We changed the input shape to suit our skin cancer images and did not include the top.
Since we did not include the top when importing the VGG16 model, we create a simple network using the following Python script.
The configurations for the learner are quite straightforward. Since we have a binary classification, we will use binary cross entropy as our loss function.
The epoch required for the deep learning model will vary according to the application, so it is best to try with smaller numbers first as training can take a very long time. Keep the random seed number used throughout the workflow the same, and although it is preferable to use Adam as the optimizer feel free to experiment and use other optimizers. For now, follow the configurations in the image.
Turning down the learning rate reduces the random fluctuations in the error due to the different gradients on different mini-batches. You can later untick this configuration to see how it will affect the outcome.
The keras network executor will take in the test set and use the model that was just trained and fit the new data. Follow the configurations in the image.
In this section we evaluate the accuracy of the model based on the partitioned test data.
The rule engine can take multiple expressions, so this node is used to convert the numbers back to letters, which will then be used to score against the original class. Follow the syntax in the image.
The scorer will use the selected columns to determine the model’s accuracy. Follow the configurations. To see the confusion matrix, right click on the node and select “confusion matrix.”