-
Notifications
You must be signed in to change notification settings - Fork 110
/
Copy pathampute.Rd
242 lines (208 loc) · 11.7 KB
/
ampute.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/ampute.R
\name{ampute}
\alias{ampute}
\title{Generate missing data for simulation purposes}
\usage{
ampute(
data,
prop = 0.5,
patterns = NULL,
freq = NULL,
mech = "MAR",
weights = NULL,
std = TRUE,
cont = TRUE,
type = NULL,
odds = NULL,
bycases = TRUE,
run = TRUE
)
}
\arguments{
\item{data}{A complete data matrix or data frame. Values should be numeric.
Categorical variables should have been transformed to dummies.}
\item{prop}{A scalar specifying the proportion of missingness. Should be a value
between 0 and 1. Default is a missingness proportion of 0.5.}
\item{patterns}{A matrix or data frame of size #patterns by #variables where
\code{0} indicates that a variable should have missing values and \code{1} indicates
that a variable should remain complete. The user may specify as many patterns as
desired. One pattern (a vector) is possible as well. Default
is a square matrix of size #variables where each pattern has missingness on one
variable only (created with \code{\link{ampute.default.patterns}}). After the
amputation procedure, \code{\link{md.pattern}} can be used to investigate the
missing data patterns in the data.}
\item{freq}{A vector of length #patterns containing the relative frequency with
which the patterns should occur. For example, for three missing data patterns,
the vector could be \code{c(0.4, 0.4, 0.2)}, meaning that of all cases with
missing values, 40 percent should have pattern 1, 40 percent pattern 2 and 20
percent pattern 3. The vector should sum to 1. Default is an equal probability
for each pattern, created with \code{\link{ampute.default.freq}}.}
\item{mech}{A string specifying the missingness mechanism, either "MCAR"
(Missing Completely At Random), "MAR" (Missing At Random) or "MNAR" (Missing Not At
Random). Default is a MAR missingness mechanism.}
\item{weights}{A matrix or data frame of size #patterns by #variables. The matrix
contains the weights that will be used to calculate the weighted sum scores. For
a MAR mechanism, the weights of the variables that will be made incomplete should be
zero. For a MNAR mechanism, these weights could have any possible value. Furthermore,
the weights may differ between patterns and between variables. They may be negative
as well. Within each pattern, the relative size of the values are of importance.
The default weights matrix is made with \code{\link{ampute.default.weights}} and
returns a matrix with equal weights for all variables. In case of MAR, variables
that will be amputed will be weighted with \code{0}. For MNAR, variables
that will be observed will be weighted with \code{0}. If the mechanism is MCAR, the
weights matrix will not be used.}
\item{std}{Logical. Whether the weighted sum scores should be calculated with
standardized data or with non-standardized data. The latter is especially advised when
making use of train and test sets in order to prevent leakage.}
\item{cont}{Logical. Whether the probabilities should be based on a continuous
or a discrete distribution. If TRUE, the probabilities of being missing are based
on a continuous logistic distribution function. \code{\link{ampute.continuous}}
will be used to calculate and assign the probabilities. These probabilities will then
be based on the argument \code{type}. If FALSE, the probabilities of being missing are
based on a discrete distribution (\code{\link{ampute.discrete}}) based on the \code{odds}
argument. Default is TRUE.}
\item{type}{A string or vector of strings containing the type of missingness for each
pattern. Either \code{"LEFT"}, \code{"MID"}, \code{"TAIL"} or '\code{"RIGHT"}.
If a single missingness type is given, all patterns will be created with the same
type. If the missingness types should differ between patterns, a vector of missingness
types should be given. Default is RIGHT for all patterns and is the result of
\code{\link{ampute.default.type}}.}
\item{odds}{A matrix where #patterns defines the #rows. Each row should contain
the odds of being missing for the corresponding pattern. The number of odds values
defines in how many quantiles the sum scores will be divided. The odds values are
relative probabilities: a quantile with odds value 4 will have a probability of
being missing that is four times higher than a quantile with odds 1. The
number of quantiles may differ between the patterns, specify NA for cells remaining empty.
Default is 4 quantiles with odds values 1, 2, 3 and 4 and is created by
\code{\link{ampute.default.odds}}.}
\item{bycases}{Logical. If TRUE, the proportion of missingness is defined in
terms of cases. If FALSE, the proportion of missingness is defined in terms of
cells. Default is TRUE.}
\item{run}{Logical. If TRUE, the amputations are implemented. If FALSE, the
return object will contain everything except for the amputed data set.}
}
\value{
Returns an S3 object of class \code{\link{mads}} (multivariate
amputed data set)
}
\description{
This function generates multivariate missing data under a MCAR, MAR or MNAR
missing data mechanism. Imputation of data sets containing missing values can
be performed with \code{\link{mice}}.
}
\details{
This function generates missing values in complete data sets. Amputation of complete
data sets is useful for the evaluation of imputation techniques, such as multiple
imputation (performed with function \code{\link{mice}} in this package).
The basic strategy underlying multivariate imputation was suggested by
Don Rubin during discussions in the 90's. Brand (1997) created one particular
implementation, and his method found its way into the FCS paper
(Van Buuren et al, 2006).
Until recently, univariate amputation procedures were used to generate missing
data in complete, simulated data sets. With this approach, variables are made
incomplete one variable at a time. When more than one variable needs to be amputed,
the procedure is repeated multiple times.
With the univariate approach, it is difficult to relate the missingness on one
variable to the missingness on another variable. A multivariate amputation procedure
solves this issue and moreover, it does justice to the multivariate nature of
data sets. Hence, \code{ampute} is developed to perform multivariate amputation.
The idea behind the function is the specification of several missingness
patterns. Each pattern is a combination of variables with and without missing
values (denoted by \code{0} and \code{1} respectively). For example, one might
want to create two missingness patterns on a data set with four variables. The
patterns could be something like: \code{0,0,1,1} and \code{1,0,1,0}.
Each combination of zeros and ones may occur.
Furthermore, the researcher specifies the proportion of missingness, either the
proportion of missing cases or the proportion of missing cells, and the relative
frequency each pattern occurs. Consequently, the data is split into multiple subsets,
one subset per pattern. Now, each case is candidate for a certain missingness pattern,
but whether the case will have missing values eventually depends on other specifications.
The first of these specifications is the missing mechanism. There are three possible
mechanisms: the missingness depends completely on chance (MCAR), the missingness
depends on the values of the observed variables (i.e. the variables that remain
complete) (MAR) or on the values of the variables that will be made incomplete (MNAR).
When the user specifies the missingness mechanism to be \code{"MCAR"}, the candidates
have an equal probability of becoming incomplete. For a \code{"MAR"} or \code{"MNAR"} mechanism,
weighted sum scores are calculated. These scores are a linear combination of the
variables.
In order to calculate the weighted sum scores, the data is standardized. For this reason,
the data has to be numeric. Second, for each case, the values in
the data set are multiplied with the weights, specified by argument \code{weights}.
These weighted scores will be summed, resulting in a weighted sum score for each case.
The weights may differ between patterns and they may be negative or zero as well.
Naturally, in case of a MAR mechanism, the weights corresponding to the
variables that will be made incomplete, have a 0. Note that this may be
different for each pattern. In case of MNAR missingness, especially
the weights of the variables that will be made incomplete are of importance. However,
the other variables may be weighted as well.
It is the relative difference between the weights that will result in an effect
in the sum scores. For example, for the first missing data
pattern mentioned above, the weights for the third and fourth variables could
be set to 2 and 4. However, weight values of 0.2 and 0.4 will have the exact
same effect on the weighted sum score: the fourth variable is weighted twice as
much as variable 3.
Based on the weighted sum scores, either a discrete or continuous distribution
of probabilities is used to calculate whether a candidate will have missing values.
For a discrete distribution of probabilities, the weighted sum scores are
divided into subgroups of equal size (quantiles). Thereafter, the user
specifies for each subgroup the odds of being missing. Both the number of
subgroups and the odds values are important for the generation of missing data.
For example, for a RIGHT-like mechanism, scoring in one of the
higher quantiles should have high missingness odds, whereas for a MID-like
mechanism, the central groups should have higher odds. Again, not the size of
the odds values are of importance, but the relative distance between the values.
The continuous distributions of probabilities are based on the logistic distribution function.
The user can specify the type of missingness, which, again, may differ between patterns.
For an example and more explanation about how the arguments interact with
each other, we refer to the vignette:
\href{https://rianneschouten.github.io/mice_ampute/vignette/ampute.html}{Generate missing values with ampute}.
}
\examples{
# start with a complete data set
compl_boys <- cc(boys)[1:3]
# Perform amputation with default settings
mads_boys <- ampute(data = compl_boys)
mads_boys$amp
# Change default matrices as desired
my_patterns <- mads_boys$patterns
my_patterns[1:3, 2] <- 0
my_weights <- mads_boys$weights
my_weights[2, 1] <- 2
my_weights[3, 1] <- 0.5
# Rerun amputation
my_mads_boys <- ampute(
data = compl_boys, patterns = my_patterns, freq =
c(0.3, 0.3, 0.4), weights = my_weights, type = c("RIGHT", "TAIL", "LEFT")
)
my_mads_boys$amp
}
\references{
Brand, J.P.L. (1999) \emph{Development, implementation and
evaluation of multiple imputation strategies for the statistical analysis of
incomplete data sets.} pp. 110-113. Dissertation. Rotterdam: Erasmus University.
Schouten, R.M., Lugtig, P and Vink, G. (2018)
Generating missing values for simulation purposes: A multivariate
amputation procedure.
\emph{Journal of Statistical Computation and Simulation}, 88(15): 1909-1930.
\doi{10.1080/00949655.2018.1491577}
Schouten, R.M. and Vink, G. (2018) The Dance of the Mechanisms: How Observed
Information Influences the Validity of Missingness Assumptions.
\emph{Sociological Methods and Research}, 50(3): 1243-1258.
\doi{10.1177/0049124118799376}
Van Buuren, S., Brand, J.P.L., Groothuis-Oudshoorn, C.G.M., Rubin, D.B. (2006)
Fully conditional specification in multivariate imputation.
\emph{Journal of Statistical Computation and Simulation}, 76(12): 1049-1064.
\doi{10.1080/10629360600810434}
Van Buuren, S. (2018).
\emph{Flexible Imputation of Missing Data. Second Edition.}
Chapman & Hall/CRC. Boca Raton, FL.
Vink, G. (2016) Towards a standardized evaluation of multiple imputation routines.
}
\seealso{
\code{\link{mads}}, \code{\link{bwplot.mads}},
\code{\link{xyplot.mads}}
}
\author{
Rianne Schouten, Gerko Vink, Peter Lugtig, 2016
}