-
Notifications
You must be signed in to change notification settings - Fork 0
/
Missing Data_Assign 1.rmd
93 lines (59 loc) · 3.98 KB
/
Missing Data_Assign 1.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
output:
word_document: default
---
# Paula McCree Bailey
## Module 4 - Assignment 1
### Missing Data
```{r load library and data, message=FALSE}
#install.packages("VIM")
#install.packages("mice")
library(tidyverse)
library(VIM)
library(mice)
grades = read.csv("class-grades.csv")
summary(grades)
```
**Task 1** How much data is missing and in what variables?
**There are a total of 11 missing variables. The variables with missing data are Tutorial, Midterm, TakeHome, and Final.**
**Task 2** Use the VIM package to visualize missingness. Does there appear to be systematic missingness? In other words, are there students that are musing multiple pieces of data?
**Yes, there is one student that is missing multiple pieces of data. This student is missing midterm and take home grades.**
```{r view missingness}
vim_plot = aggr(grades, numbers = TRUE, prop = c(TRUE, FALSE),cex.axis=.7)
```
**Task 3** Use row-wise deletion of missing values to create a new data frame. How many rows remain in
this data frame?
**89 rows remain after using row-wise deletion of missing values.**
```{r row-wise deletion}
grades_delRows = grades %>% drop_na(Final)
grades_delRows = grades_delRows %>% drop_na(TakeHome)
grades_delRows = grades_delRows %>% drop_na(Midterm)
grades_delRows = grades_delRows %>% drop_na(Tutorial)
```
**Task 4** Use column-wise deletion of missing values to create a new data frame (from the original data
frame not from the data frame created in Task 3). How many columns remain in this data frame?
**2 columns remain after using column-wise deletion of missing values.**
```{r column-wise deletio}
grades_delCols = grades
grades_delCols = grades_delCols %>% dplyr::select(-Final, - TakeHome, -Midterm, -Tutorial)
```
**Task 5** Which approach (Task 3 or Task 4) seems preferable for this dataset? Briefly discuss your answer.
**The row-wise deletion of missing values is the better approach. With row-wise, you still retain most of the dataset only loosing 10 observations. However, with column-wise, you are only left with 2 columns - prefix and assignment. It would be difficult to analysis data with just these 2 variables.**
**Task 6** Impute the missing values in the dataset using the mice package.
```{r Impute missing value}
grades_imp = mice(grades, m=1, method = "pmm", seed = 12345)
#in line above: m=1 -> runs one imputation, seed sets the random number seed to get repeatable results
summary(grades_imp)
densityplot(grades_imp)
#red imputed, blue original, only shows density plots when more than 1 value the variable was imputed
#note that the density plots are fairly uninteresting given the small amount of missing data
grades_complete = complete(grades_imp)
summary(grades_complete)
```
**Task 7** Briefly discuss potential issues that could be encountered when working with missing data.
**Potential issues that could be encountered when working with missing data **
**a) There could be so much data missing from the dataset that you are unable to analysis the data. This is similar to our homework. After removing all the columns with missing data, there was nothing remaining to analyze.**
**b) If you decided to take an approach to impute the missing date, it could change the original intent of the data - maybe skewing it in a different direction.**
Describe situations where imputation may not be advisable.
**A situation where imputation may not be advisable is when that variable will be a key predictor for medical research, and drug (medical) trials.**
**For example, I've been always told type I diabetes is more likely to occur in African american families, and the disease is known to skip every other generation in a family. So, if your Mom had it, you may not get it; however your children maybe more likely to get type 1. If you're trying to gauge the validity that type 1 skips generations & your survey doesn't ask about race or about other members of your household, it would not be advisable to use imputed data based on the survey dataset to fill in the missing data.**