-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcovid-19 effect on liver cancer pred (r).rmd
126 lines (95 loc) · 4.11 KB
/
covid-19 effect on liver cancer pred (r).rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
title: "Covid-19 effect on Liver Cancer (Handling Missing records)"
author: "Oghenekeno Eribewe"
date: "`r Sys.Date()`"
output:
word_document: default
html_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, include=FALSE}
library(Boruta) #feature selection
library(dplyr) # data manipulation
library(Hmisc) # miscellaneous functions
library(ggplot2) #data visualization
library(caretEnsemble) #ensembles of caret models
library(tidyr) #tidy messy data
library(rlang) #core r $ tidyverse features
library(RANN) # fast NN Search
library(mice)
```
#### Brief overview of the data set
Loading the data set gotten from <https://www.kaggle.com/datasets/fedesoriano/covid19-effect-on-liver-cancer-prediction-dataset> with strings loaded as factors and header set to true as the data comes with headers.
```{r}
df = read.csv("covid-liver.csv", header = T, stringsAsFactors = T)
```
Next, we take a brief view and stats of the data to analyse the missing values and proceed with what ever solution we can come up with to sort those missing values.
```{r}
#View(df)
head(df, 5)
summary(df)
```
It is observed that there are quite a number of fields with missing values, so we need to visualize these fields to observe the levels of these missing values in the different fields.
```{r}
# checking the sum of NA's in all columns
# sapply(covidLiver, anyNA)
df.Na = colSums(is.na(df))
print(df.Na)
# visualizing the number of columns with the quantity of missing values
barplot(df.Na, xlab = "Count of missing values in the field", ylab = "Fields", main = "Visual of all fields with the number of missing values")
```
Quite a number of the fields have missing values consisting of more than 50% of the field and there is no way we can predict those missing values as there is not enough remaining data in the fields to make such prediction/assumption. Hence, there is need to remove these fields.
```{r}
# removing the columns using their field numbers
df[,c(13,17,19,20,21,22,24,25,27)] = NULL
summary(df)
```
Now, we can fix up the remaining fields. We start by fixing the fields with very small amount of missing values as it won't require too much technicalities Starting from the size, treatment_grps, PS, and Prev_Known_Cirrhosis fields.
```{r}
# quick visual of the distribution of the values in the field
hist(x = df$Size, freq = TRUE)
```
The distribution is positively skewed, and so the best option would be to impute the median as it would be a more representative value than the mean.
```{r}
df$Size = impute((df$Size), median)
summary(df$Size)
```
Next, we impute the mode for the treatment_grps as it only has 2 missing values and we do not want to drop the rows
```{r}
# next we handle the field treatment_grps
df$Treatment_grps = impute((df$Treatment_grps), mode)
summary(df$Treatment_grps)
```
Next, we visualise the distribution for the PS as it only has 2 missing values
```{r}
# quick visual of the distribution of the values in the field
hist(x = df$PS)
# the median seems ideal to impute as the distribution is awkwardly skewed
df$PS = impute((df$PS), median)
summary(df$PS)
```
```{r}
summary(df)
```
Next, we impute the mode for the Prev_known_cirrhosis as it only has 5 missing values and we do not want to drop the rows
```{r}
# next we handle the field Prev_known_cirrhosis
df$Prev_known_cirrhosis = impute((df$Prev_known_cirrhosis), mode)
summary(df$Prev_known_cirrhosis)
```
From the mice library, we use the random forest mice function to predict the missing categorical values.
```{r, include=FALSE}
imputed_data = mice(df, m=5, method = "rf")
summary(imputed_data)
```
So we use the 5th cycle out of the 5 cycles of predictions made.
```{r}
cleaned_df = complete(imputed_data, 5)
summary(cleaned_df)
df1 = cleaned_df
write.csv(df1, "clean-CovidLiver.csv", row.names = FALSE)
```
In summary, we have carried out simple data processing to handle the messy dataset to generate a clean dataset which can now be used for proper analysis and exploration.