-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtests.Rmd
503 lines (358 loc) · 14.8 KB
/
tests.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
---
title: 'SFB 1036 Course on Data Analysis: Lesson 3'
author: "Simon Anders"
output:
html_document:
df_print: paged
---
```{r echo=FALSE,results='hide'}
set.seed(12345)
```
# Tests and p values
## The Lady Tasting Tea
In the early 1930, at a summer party of the [Rothamstad Research Station][https://en.wikipedia.org/wiki/Rothamsted_Research] in England, a discussion
took place about tea, namely whether it makes a difference to the taste whether one
pours first the milk or first the tea into the cup. A lady claimed that she could tell
the difference which was not believed. See [here](https://en.wikipedia.org/wiki/Lady_tasting_tea)
for Wikipedia's account of the anecdote, which is considered the occasion where
[R. A. Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) came up with the idea of the
null hypothesis and the p value.
A randomized trial was set up to test her claim: Eight cups were taken, into four of them
tea was poured first ("T") and then milk, into the other four milk first ("M"), then tea.
We represent this in R as a vector:
```{r}
cups <- c( "T", "T", "T", "T", "M", "M", "M", "M" )
```
We can use the `sample` function to randomly shuffle ("permute") the order of the cups:
```{r}
sample( cups )
```
Every time you call `sample` you get a random permutation:
```{r}
sample( cups )
sample( cups )
```
Presented with these eight cups in random order, can the lady pick the four cups which had the tea poured in
first? If so, does this consitute sufficient evidence that we should believe her
claim that she can taste a difference? WHat if she only get three of the four cups
right?
[R. A. Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher), who was present at
the party, agued that we have to calculate the probability that she could achieve
this result by chance, by just picking four cups at random and thus making a lucky guess.
So, he started with formulating what he called the ``null hypothesis'': The lady's
pick of four cups is entirely random (because she cannot taste a difference).
Let R select four cups at random. Again, we use the `sample` function, which now
picks a "random sample" of 4 elements from our vector `cups` with 8 elements
```{r}
s <- sample( cups, 4 )
s
```
How many of the cups in this sample `s` have the tea first?
```{r}
sum( s == "T" )
```
We now want to see how often it happens that one gets 3 or more cups with tea first.
Does `s` contain three or more "T" cups?
```{r}
sum( s == "T" ) >= 3
```
Let's repeat these steps 100 times
```{r}
replicate( 100,
sum( sample( cups, 4 ) == "T" ) >= 3 )
```
What fraction of these results were `TRUE`?
Let's do it again, this time even 10,000 times and use the `mean` function to get the fraction of `TRUE`s
```{r}
mean( replicate( 10000, sum( sample( cups, 4 ) == "T" ) >= 3 ) )
```
Were 10,000 samples enough to get a stable estimate of the probability? Let's try again:
```{r}
mean( replicate( 10000, sum( sample( cups, 4 ) == "T" ) >= 3 ) )
```
Yes, we got the same first two digits. It seems that 24% is a good answer to the question?
WHat is the probability of getting at least 3 cups right?
And how often, that one gets all four right?
```{r}
mean( replicate( 10000, sum( sample( cups, 4 ) == "T" ) == 4 ) )
```
The lady did in fact get all four cups right. The probability for her managing to do so under the assumption
of the null hypothesis (that she was guessing blindly) is less than 2%. The value we just obtaines is the what we call the *p value* associated with the outcome of our experiment.
We asked our computer to test 10,000 random samples to get the p value. But in 1930, FIsher did not have a
computer, and probably also wouldnt have had the patience to make thousands of tried. How did he get the p value?
If we randomly rearrange ("permute") the eight cups, there are only 70 possible ways to permute them. One way to see
that is to ask R to list them all. (You need to install the `combinat` package to run this command.)
```{r}
sapply( unique( combinat::permn( cups ) ), paste, collapse=" " )
```
You can also arrive at the number 70 by calculating the [binomial coefficient]() $\binom{8}{4}$,
as we have 4 "T" cups among our 8 tea cups. (See the Wikipedia article on [combinations](https://en.wikipedia.org/wiki/Combination) if you want to know what that means. )
Let's say we always pick the first 4 cups: How many of the 70 permutations above have "T" at the
four first places? Only one. Hence we probability to get all four cups correct is 1 in 70. The
exact value of the p value we tried to calculate above by random tries is 1/70:
```{r}
1/70
```
There are 16 permutations that have 3 "T" cups and 1 "M" cups among the first four. So, the probability to get exactly 3 cups right is 16/70, and to get *at least* 3 right is 17/70. Hence, if the lady had gotten only 3 cups
right, the p value for that ourcome would have been
```{r}
17/70
```
The general way to calculate this is the [hypergeometric distribution](https://en.wikipedia.org/wiki/Hypergeometric_distribution).
For example, the probability to get exactly 3 cups right (16/70) is
```{r}
16/70
dhyper( 3, 4, 4, 4 )
```
The `dhyper` function is called thus:
dhyper(
number of cups correctly said to have milk first,
number of cups with milk first,
number of cups with tea first,
number of cups said to have milk first )
If you look at its help page, you will read the same, but talking about black and white balls
rather than tea and milk.
This kind of problem is quite common, and so, Fisher turned his approach into a general
method, called "Fisher's test":
We first set up what is called a "contingency table", with rows for the actual truth
and columns for the guess:
```
| guess
truth | tea first milk first | Sum
-----------+------------------------+------
tea first | 3 1 | 4
milk first | 1 3 | 4
-----------+------------------------+------
Sum | 4 4 | 8
```
The margins are just the sums, we don't need them. We put just the core part into
a variable. We use `rbind` (row bind) to enter the two matrix rows and bind them
together into a matrix
```{r}
m <- rbind( c(3,1), c(1,3) )
m
```
Now, we can add this to the the FIsher test function
```{r}
fisher.test( m, alternative = "greater" )
```
And, again, we get our p value of 17/70.
## A loaded die
A similar example: A die is thrown 60 times. We would expect to see a six 10 times, but we see it
15 times. Is the die loaded?
Our null hypothesis: The die is fair, i.e., the probability for a six is 1/6
Then, the probability to see $k$ sixes among the 60 throws is given by the [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution). Instead of copying the
formula from WIkipedia, we simply ask R what the probability is to see exactly 15 sixes:
```{r}
dbinom( 15, 60, 1/6 )
```
In our experiment to see whether the die is loaded, we give the outcome by counting
how many sixes we have. The number of sixes is our *test statistic*. The test statistic can
take on 61 possible values, the numbers from 0 to 60. In R, we write this with a colon:
```{r}
0:60
```
For each value, `dbinom` can give us the probability of obtaining this value under the assumption
of our null hypothesis (namely, that the die is fair and the probability for a six is 1/6):
```{r}
dbinom( 0:60, 60, 1/6 )
```
Let's plot this
```{r}
plot(
x = 0:60,
y = dbinom( 0:60, 60, 1/6 ) )
```
The probability to then see 15 or more sixes is obtained by calculating `dbinom` for all the
values from 15 to 60 and summing them up
```{r}
sum( dbinom( 15:60, 60, 1/6 ) )
```
There is also the `pbinom` function which can tell us the probability to see 14 or fewer sixes.
We shoudl hence get the same value this way:
```{r}
1 - pbinom( 14, 60, 1/6 )
```
Did we use these functions correctly? Is the the binomial distribution the right one to use? To
be sure, let's simulate:
```{r}
mean( replicate( 100000, sum( sample( 1:6, 60, replace=TRUE ) == 6 ) >= 15 ) )
```
We get aproximately the same p value.
How did this complicates command work? To understand it, read it from inside out. Start with the middle part:
```{r}
sample( 1:6, 60, replace=TRUE )
```
This gives us 60 throws of the die. The `replace=TRUE` tells `sample` that it can report the same number several times, i.e., that it should pick among each of the six faces of the die (1:6) anew for each throw, rather than removing from the pool the numbers that have already been picked.
Then, we check, how many sixes there are in such a sample of 60 throws
```{r}
sum( sample( 1:6, 60, replace=TRUE ) == 6 )
```
We do this many times, to see how many sixes one might see
```{r}
replicate( 100, sum( sample( 1:6, 60, replace=TRUE ) == 6 ) )
```
We then check for each outcome wherher it is 15 or more and get the fraction of TRUE results
```{r}
mean( replicate( 10000, sum( sample( 1:6, 60, replace=TRUE ) == 6 ) >= 15 ) )
```
This is indeed the same as
```{r}
sum( dbinom( 15:60, 60, 1/6 ) )
```
or, if we use the *binomial test* function
```{r}
binom.test( 15, 60, 1/6, alternative = "greater" )
```
### Luck or foul play?
What if we see 25 times a six in our 60n throws?
```{r}
sum( dbinom( 25:60, 60, 1/6 ) )
```
If our opponent got 15 sixes (p=0.065), we might believe that he was lucky, but if he got
25 sixes (p=0.0000042), we would be justified in being quite sure that he cheated.
Here are the p values for various outcomes
```{r message=FALSE}
library(tidyverse)
```
```{r}
tibble( min_num_of_sixes = 0:60 ) %>%
mutate( probability = pbinom( min_num_of_sixes, 60, 1/6, lower.tail=FALSE ) ) %>%
ggplot() +
geom_point( aes( x = min_num_of_sixes, y = probability) ) +
scale_y_log10()
```
When do we start supecting that the die might be loaded? When are we sure?
Is $p<=0.05$ really a good threshold? Does this depend on the stakes? Or on our prior beliefs?
See, e.g., [xkcd #1132](https://www.xkcd.com/1132/)
## Correlation test
One more example:
We have two sets of numbers:
```{r}
x <- c(0.32, 0.39, 0.11, 0.6, 0.94, 0.27, 0.19, 0.66, 0.91, 0.41)
y <- c(0.3, 0.77, 0.09, 0.76, 0.62, 0.19, 0.07, 0.89, 0.92, 0.39)
```
They seem correlated
```{r}
plot( x, y, asp=1 )
```
The Pearson correlation coefficient is
```{r}
cor( x, y )
```
Could such a high correlation coefficient arise by chance?
If we shuffle one of the vectors, we should destroy the correlation
```{r}
cor( x, sample(y) )
```
But it won't be zero. What values should we expect to seet to arise by chance if we
pair the values at random?
```{r}
hist( replicate( 10000, cor( x, sample(y) ) ) )
```
How many of these are above the correlation value we have seen above?
```{r}
mean( abs( replicate( 10000, cor( x, sample(y) ) ) ) > cor( x, y ) )
```
Note the `abs`, which removes the minus signs. We check whetehr the correlation is more extreme than
the real ones to get a p value.
R has, again, a way to calculate this number without having to try out thousands of correlations:
```{r}
cor.test( x, y )
```
# Multiple testing
Two further points why a p value threshold at 0.05 must not be used with a "one size fits all" attitude:
- If many researchers test the same or hypothesis (or a set of similar hypotheses), then one in 20
research teams will get a p value below 1/20=5%, even if all the hypotheses are wrong. See [xkcd #882](https://xkcd.com/882/).
Only the "successful" team will publish a paper, while we will never hear about the other 19 teams. Hence, fo all the findings
reported in the literature with p values to just barely below 0.05 level, we may expect that very many, if not most of them,
are wrong.
- [Sally Clark](https://en.wikipedia.org/wiki/Sally_Clark) lost two babies to sudden infant death syndrome ([SIDS](https://en.wikipedia.org/wiki/Sudden_infant_death_syndrome)), i.e. to unexplained sudden death while sleeping.
As SIDS only happens to around one in 8000 babies, the court considered it impossibly unlikely that both babies died of natural
causes and convicted her for murder. Two subsequent SIDS cases have very low probability, $1 / 8000^2 = 1.6\cdot 10^{-8}$, but given that
millions of babies are born every year in Europe, we should fully expect that once in a while, a mother suffers this fate twice,
without concluding that she must be a murderer. Hence, even an extremely low p value can be misleading.
We will have to discuss much more about these issues.
# Further material (not tidied up yet)
```
# Now continuous data
# Load NHANES again
library( tidyverse )
inner_join(
haven::read_xpt( "NHANES/DEMO_I.XPT" ),
haven::read_xpt( "NHANES/BMX_I.XPT" ),
by="SEQN" ) %>%
select(
subjectId = SEQN,
age = RIDAGEYR,
sex = RIAGENDR,
height = BMXHT,
weight = BMXWT,
ethnicity = RIDRETH3,
bornInUS = DMDBORN4,
) %>%
mutate(
sex = fct_recode( factor(sex), male="1", female="2" ),
ethnicity = fct_recode( factor(ethnicity), mexican="1", hispanic="2",
white="3", black="4", asian="6", other="7" ),
bornInUS = fct_recode( factor(bornInUS), yes="1", no="2" ),
) ->
nhanes
nhanes %>%
filter( age>20, sex=="male", ethnicity=="mexican", !is.na(height) ) %>%
ggplot +
geom_density( aes( x=height, col=bornInUS ) )
library(ggbeeswarm)
nhanes %>%
filter( age>20, sex=="male", ethnicity=="mexican", !is.na(height) ) %>%
ggplot +
geom_beeswarm( aes( y=height, x=bornInUS ) )
nhanes %>%
filter( age>20, sex=="male", ethnicity=="mexican", bornInUS=="yes", !is.na(height) ) %>%
pull( height ) ->
height_A
nhanes %>%
filter( age>20, sex=="male", ethnicity=="mexican", bornInUS=="no", !is.na(height) ) %>%
pull( height ) ->
height_B
t.test( height_A, height_B )
smplA <- sample( height_A, 20 )
smplB <- sample( height_B, 20 )
t.test( smplA, smplB )
mean( sample( height_B, 20 ) ) - mean( sample( height_A, 20 ) )
smplA <- sample( height_A, 50 )
smplB <- sample( height_B, 50 )
t.test( smplA, smplB )
t.test( smplA, smplB, var.equal=TRUE )
# our test statistic is
t <- ( mean(smplB) - mean(smplA) ) / ( ( sd(smplA) + sd(smplB) )/2 )
# common mean
mu0 <- mean( c( smplA, smplB ) )
mu0
# SDs
sd0 <- ( sd( smplA ) + sd( smplB ) )/2
# Null hypothesis: The mean is the same in both groups
# For example, like this:
smplA0 <- rnorm( 50, mu0, sd0 )
smplB0 <- rnorm( 50, mu0, sd0 )
# test statistic for this null sample
c( mean(smplB0) - mean(smplA0) ) / ( ( sd(smplA0) + sd(smplB0 ) )/2 )
#What is the null distriution of our test statistic?
t0 <- replicate( 10000, {
smplA0 <- rnorm( 50, mu0, sd0 )
smplB0 <- rnorm( 50, mu0, sd0 )
c( mean(smplB0) - mean(smplA0) ) / ( ( sd(smplA0) + sd(smplB0 ) )/2 )
} )
hist(t0)
t
table( abs(t0) > abs(t) )
mean( abs(t0) > abs(t) )
# Alternative: Permutation test
t0 <- replicate( 10000, {
smpl_AB0 <- sample( c( smplA, smplB ) )
smplA0 <- smpl_AB0[1:50]
smplB0 <- smpl_AB0[51:100]
c( mean(smplB0) - mean(smplA0) ) / ( ( sd(smplA0) + sd(smplB0 ) )/2 )
} )
mean( abs(t0) > abs(t) )
```