covid-19 effect on liver cancer pred (r).rmd

---
title: "Covid-19 effect on Liver Cancer (Handling Missing records)"
author: "Oghenekeno Eribewe"
date: "`r Sys.Date()`"
output:
  word_document: default
  html_document: default
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r, include=FALSE}
library(Boruta) #feature selection
library(dplyr) # data manipulation
library(Hmisc) # miscellaneous functions
library(ggplot2) #data visualization
library(caretEnsemble) #ensembles of caret models
library(tidyr) #tidy messy data
library(rlang) #core r $ tidyverse features
library(RANN) # fast NN Search
library(mice)
```

#### Brief overview of the data set

Loading the data set gotten from <https://www.kaggle.com/datasets/fedesoriano/covid19-effect-on-liver-cancer-prediction-dataset> with strings loaded as factors and header set to true as the data comes with headers.

```{r}
df = read.csv("covid-liver.csv", header = T, stringsAsFactors = T)
```

Next, we take a brief view and stats of the data to analyse the missing values and proceed with what ever solution we can come up with to sort those missing values.

```{r}
#View(df)
head(df, 5)
summary(df)
```

It is observed that there are quite a number of fields with missing values, so we need to visualize these fields to observe the levels of these missing values in the different fields.

```{r}
# checking the sum of NA's in all columns
# sapply(covidLiver, anyNA)
df.Na = colSums(is.na(df))
print(df.Na)

# visualizing the number of columns with the quantity of  missing values
barplot(df.Na, xlab = "Count of missing values in the field", ylab = "Fields", main = "Visual of all fields with the number of missing values")
```

Quite a number of the fields have missing values consisting of more than 50% of the field and there is no way we can predict those missing values as there is not enough remaining data in the fields to make such prediction/assumption. Hence, there is need to remove these fields.

```{r}
# removing the columns using their field numbers
df[,c(13,17,19,20,21,22,24,25,27)] = NULL
summary(df)
```

Now, we can fix up the remaining fields. We start by fixing the fields with very small amount of missing values as it won't require too much technicalities Starting from the size, treatment_grps, PS, and Prev_Known_Cirrhosis fields.

```{r}
# quick visual of the distribution of the values in the field
hist(x = df$Size, freq = TRUE)
```

The distribution is positively skewed, and so the best option would be to impute the median as it would be a more representative value than the mean.

```{r}
df$Size = impute((df$Size), median)
summary(df$Size)
```

Next, we impute the mode for the treatment_grps as it only has 2 missing values and we do not want to drop the rows

```{r}
# next we handle the field treatment_grps
df$Treatment_grps = impute((df$Treatment_grps), mode)
summary(df$Treatment_grps)
```

Next, we visualise the distribution for the PS as it only has 2 missing values

```{r}
# quick visual of the distribution of the values in the field
hist(x = df$PS)

# the median seems ideal to impute as the distribution is awkwardly skewed
df$PS = impute((df$PS), median)
summary(df$PS)
```

```{r}
summary(df)
```

Next, we impute the mode for the Prev_known_cirrhosis as it only has 5 missing values and we do not want to drop the rows

```{r}
# next we handle the field Prev_known_cirrhosis
df$Prev_known_cirrhosis = impute((df$Prev_known_cirrhosis), mode)
summary(df$Prev_known_cirrhosis)
```

From the mice library, we use the random forest mice function to predict the missing categorical values.

```{r, include=FALSE}
imputed_data = mice(df, m=5, method = "rf")
summary(imputed_data)
```

So we use the 5th cycle out of the 5 cycles of predictions made.

```{r}
cleaned_df = complete(imputed_data, 5)
summary(cleaned_df)

df1 = cleaned_df

write.csv(df1, "clean-CovidLiver.csv", row.names = FALSE)
```

In summary, we have carried out simple data processing to handle the messy dataset to generate a clean dataset which can now be used for proper analysis and exploration.