Skip to content

Latest commit

 

History

History
122 lines (67 loc) · 7.46 KB

Non-Image Binary Classification - Breast Cancer (CNN).md

File metadata and controls

122 lines (67 loc) · 7.46 KB

Non-Image Classification - Breast Cancer using a CNN

The following workflow will demonstrate how to use a CNN to do deep learning in KNIME for image classification of a sign language dataset.

Dataset Link

Breast cancer classification: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Workflow Link

Breast Cancer Binary Classification Workflow: https://tinyurl.com/59xfxsay

Sign Language Alphabet Recognition - Deep Learning

The breast cancer data is a csv file with 32 columns containing important information about each cancer tumour. The diagnosis column contains the labels, either M for malignant or B for benign.

Class distribution: 357 benign, 212 malignant.

image

Data Acquisition and Pre-Processing

image

Use the CSV Reader nodes to read both the data file.

image

Use the missing value node to filter out any rows that might have missing data. Alternatively, you can fill in missing data by using maximum, mean, median, minimum, etc. The method you choose will depend on your dataset type.

image

Under the column settings tab, you can also individually select what will happen to missing values in each column. For now, we are just going to remove the row if there are any missing values.

The category to number node works similar to the many to one node, it will convert all the different labels in one column to a number. We pass the diagnosis column into this node to obtain M = 0, B = 1.

image

Partitioning

Partitioning the data into test, train, and validation allows us to effectively evaluate the model’s performance during and after training.

image

How we partition the data is also very important. Depending on the type of data set, we may need to use stratified sampling or we might need to take from the top. “Use random seed” is selected to specify a seed for random number generation for the partitioning. Setting this option results in the same records being assigned to the same set on successive runs.

image

We are going to partition the data 70-30 using stratified sampling from the diagnosis column. The column that we take the stratified sampling from needs to be a string column.

Convolutional Neural Network (CNN)

image

The input layer will have the same shape as the amount of columns in the csv data file.

image

image

The convolution block will be made out of a dense layer, batch normalization layer and dropout layer. The first block will have a dense layer of 256 units.

image

image

The batch normalization layer will have an axis of -1 and tick “center” and “scale.”

image

image

The second convolutional block has the same components as the first block, with almost all the same configurations. For the dense layer in the 2nd convolutional block, you can either keep it 256 or decrease it to 128.

The final dense layer will have an activation function of sigmoid and 1 unit, as we only have one classification problem.

image

image

You can use the DL Python Network Editor node in order to check for a summary of the model, by connecting this to the end of the CNN. It will show the total trainable and non-trainable parameters.

image

image

Training & Evaluation

image

We are going to use mean absolute error as our standard loss function to get a forecast of the predicted mean temperature.

image

image

The epoch required for the deep learning model will vary according to the application, so it is best to try with smaller numbers first as training can take a very long time. But for now, let’s try using an epoch of 100. When possible, keep the random seed number used throughout the workflow the same, and although it is preferable to use Adam as the optimizer feel free to experiment and use other optimizers.

image

This executor node will take in the partitioned test set and use the model that was just trained to fit the new data. Follow the configurations in the image.

image

image

The output of the Keras Network Executor would look something like this. The original labels are under the diagnosis column, and the output_0 column is the probability of the breast cancer data being Benign.

image

If the predicted_0 column has a less than 0.5 probability, then the rule engine will append a prediction column with the input as M for malignant, otherwise it is B for benign.

image

image

The scorer will use the selected columns to determine the model’s accuracy. Follow the configurations. To see the confusion matrix, right click on the node and select “confusion matrix.”

image

image