-
Notifications
You must be signed in to change notification settings - Fork 85
/
Copy path09-HypothesisTesting.Rmd
669 lines (465 loc) · 47.1 KB
/
09-HypothesisTesting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
---
output:
html_document: default
bookdown::gitbook:
lib_dir: "book_assets"
includes:
in_header: google_analytics.html
pdf_document: default
---
# Hypothesis testing
```{r echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(ggplot2)
library(cowplot)
set.seed(123456) # set random seed to exactly replicate results
library(knitr)
# load the NHANES data library
library(NHANES)
# drop duplicated IDs within the NHANES dataset
NHANES <-
NHANES %>%
dplyr::distinct(ID,.keep_all = TRUE)
NHANES_adult <-
NHANES %>%
drop_na(PhysActive,BMI) %>%
subset(Age >= 18)
```
In the first chapter we discussed the three major goals of statistics:
- Describe
- Decide
- Predict
In this chapter we will introduce the ideas behind the use of statistics to make decisions -- in particular, decisions about whether a particular hypothesis is supported by the data.
## Null Hypothesis Statistical Testing (NHST)
The specific type of hypothesis testing that we will discuss is known (for reasons that will become clear) as *null hypothesis statistical testing* (NHST). If you pick up almost any scientific or biomedical research publication, you will see NHST being used to test hypotheses, and in their introductory psychology textbook, Gerrig & Zimbardo (2002) referred to NHST as the “backbone of psychological research”. Thus, learning how to use and interpret the results from hypothesis testing is essential to understand the results from many fields of research.
It is also important for you to know, however, that NHST is deeply flawed, and that many statisticians and researchers (including myself) think that it has been the cause of serious problems in science, which we will discuss in Chapter \@ref(doing-reproducible-research). For more than 50 years, there have been calls to abandon NHST in favor of other approaches (like those that we will discuss in the following chapters):
- “The test of statistical significance in psychological research may be taken as an instance of a kind of essential mindlessness in the conduct of research” (Bakan, 1966)
- Hypothesis testing is “a wrongheaded view about what constitutes scientific progress” (Luce, 1988)
NHST is also widely misunderstood, largely because it violates our intuitions about how statistical hypothesis testing should work. Let's look at an example to see this.
## Null hypothesis statistical testing: An example
There is great interest in the use of body-worn cameras by police officers, which are thought to reduce the use of force and improve officer behavior. However, in order to establish this we need experimental evidence, and it has become increasingly common for governments to use randomized controlled trials to test such ideas. A randomized controlled trial of the effectiveness of body-worn cameras was performed by the Washington, DC government and DC Metropolitan Police Department in 2015/2016. Officers were randomly assigned to wear a body-worn camera or not, and their behavior was then tracked over time to determine whether the cameras resulted in less use of force and fewer civilian complaints about officer behavior.
Before we get to the results, let's ask how you would think the statistical analysis might work. Let's say we want to specifically test the hypothesis of whether the use of force is decreased by the wearing of cameras. The randomized controlled trial provides us with the data to test the hypothesis -- namely, the rates of use of force by officers assigned to either the camera or control groups. The next obvious step is to look at the data and determine whether they provide convincing evidence for or against this hypothesis. That is: What is the likelihood that body-worn cameras reduce the use of force, given the data and everything else we know?
It turns out that this is *not* how null hypothesis testing works. Instead, we first take our hypothesis of interest (i.e. that body-worn cameras reduce use of force), and flip it on its head, creating a *null hypothesis* -- in this case, the null hypothesis would be that cameras do not reduce use of force. Importantly, we then assume that the null hypothesis is true. We then look at the data, and determine how likely the data would be if the null hypothesis were true. If the the data are sufficiently unlikely under the null hypothesis that we can reject the null in favor of the *alternative hypothesis* which is our hypothesis of interest. If there is not sufficient evidence to reject the null, then we say that we retain (or "fail to reject") the null, sticking with our initial assumption that the null is true.
Understanding some of the concepts of NHST, particularly the notorious "p-value", is invariably challenging the first time one encounters them, because they are so counter-intuitive. As we will see later, there are other approaches that provide a much more intuitive way to address hypothesis testing (but have their own complexities). However, before we get to those, it's important for you to have a deep understanding of how hypothesis testing works, because it's clearly not going to go away any time soon.
## The process of null hypothesis testing
We can break the process of null hypothesis testing down into a number of steps:
1. Formulate a hypothesis that embodies our prediction (*before seeing the data*)
2. Specify null and alternative hypotheses
3. Collect some data relevant to the hypothesis
4. Fit a model to the data that represents the alternative hypothesis and compute a test statistic
5. Compute the probability of the observed value of that statistic assuming that the null hypothesis is true
5. Assess the “statistical significance” of the result
For a hands-on example, let's use the NHANES data to ask the following question: Is physical activity related to body mass index? In the NHANES dataset, participants were asked whether they engage regularly in moderate or vigorous-intensity sports, fitness or recreational activities (stored in the variable $PhysActive$). The researchers also measured height and weight and used them to compute the *Body Mass Index* (BMI):
$$
BMI = \frac{weight(kg)}{height(m)^2}
$$
### Step 1: Formulate a hypothesis of interest
We hypothesize that BMI is greater for people who do not engage in physical activity, compared to those who do.
### Step 2: Specify the null and alternative hypotheses
For step 2, we need to specify our null hypothesis (which we call $H_0$) and our alternative hypothesis (which we call $H_A$). $H_0$ is the baseline against which we test our hypothesis of interest: that is, what would we expect the data to look like if there was no effect? The null hypothesis always involves some kind of equality (=, $\le$, or $\ge$). $H_A$ describes what we expect if there actually is an effect. The alternative hypothesis always involves some kind of inequality ($\ne$, >, or <). Importantly, null hypothesis testing operates under the assumption that the null hypothesis is true unless the evidence shows otherwise.
We also have to decide whether we want to test a *directional* or *non-directional* hypotheses. A non-directional hypothesis simply predicts that there will be a difference, without predicting which direction it will go. For the BMI/activity example, a non-directional null hypothesis would be:
$H0: BMI_{active} = BMI_{inactive}$
and the corresponding non-directional alternative hypothesis would be:
$HA: BMI_{active} \neq BMI_{inactive}$
A directional hypothesis, on the other hand, predicts which direction the difference would go. For example, we have strong prior knowledge to predict that people who engage in physical activity should weigh less than those who do not, so we would propose the following directional null hypothesis:
$H0: BMI_{active} \ge BMI_{inactive}$
and directional alternative:
$HA: BMI_{active} < BMI_{inactive}$
As we will see later, testing a non-directional hypothesis is more conservative, so this is generally to be preferred unless there is a strong *a priori* reason to hypothesize an effect in a particular direction. Hypotheses, including whether they are directional or not, should always be specified prior to looking at the data!
### Step 3: Collect some data
In this case, we will sample 250 individuals from the NHANES dataset. Figure \@ref(fig:bmiSample) shows an example of such a sample, with BMI shown separately for active and inactive individuals, and Table \@ref(tab:summaryTable) shows summary statistics for each group.
```{r summaryTable, echo=FALSE}
# sample 250 adults from NHANES and compute mean BMI separately for active
# and inactive individuals
sampSize <- 250
NHANES_sample <-
NHANES_adult %>%
sample_n(sampSize)
sampleSummary <-
NHANES_sample %>%
group_by(PhysActive) %>%
summarize(
N = length(BMI),
mean = mean(BMI),
sd = sd(BMI)
)
# calculate the mean difference in BMI between active
# and inactive individuals; we'll use this later to calculate the t-statistic
meanDiff <-
sampleSummary %>%
select(
PhysActive,
mean
) %>%
spread(PhysActive, mean) %>%
mutate(
meanDiff = No - Yes
) %>%
pull(meanDiff)
# calculate the summed variances in BMI for active
# and inactive individuals; we'll use this later to calculate the t-statistic
sumVariance <-
sampleSummary %>%
select(
PhysActive,
N,
sd
) %>%
gather(column, stat, N:sd) %>%
unite(temp, PhysActive, column) %>%
spread(temp, stat) %>%
mutate(
sumVariance = No_sd**2 / No_N + Yes_sd**2 / Yes_N
) %>%
pull(sumVariance)
s1 = sampleSummary$sd[1]
s2 = sampleSummary$sd[2]
n1 = sampleSummary$N[1]
n2 = sampleSummary$N[2]
welch_df = (s1/n1 + s2/n2)**2 / ((s1/n1)**2/(n1-1) + (s2/n2)**2/(n2-1))
# print sampleSummary table
kable(sampleSummary, digits=4,caption='Summary of BMI data for active versus inactive individuals')
```
```{r bmiSample, echo=FALSE,fig.cap="Box plot of BMI data from a sample of adults from the NHANES dataset, split by whether they reported engaging in regular physical activity.",fig.width=4,fig.height=4,out.height='50%'}
ggplot(NHANES_sample,aes(PhysActive,BMI)) +
geom_boxplot() +
xlab('Physically active?') +
ylab('Body Mass Index (BMI)')
```
### Step 4: Fit a model to the data and compute a test statistic
We next want to use the data to compute a statistic that will ultimately let us decide whether the null hypothesis is rejected or not. To do this, the model needs to quantify the amount of evidence in favor of the alternative hypothesis, relative to the variability in the data. Thus we can think of the test statistic as providing a measure of the size of the effect compared to the variability in the data. In general, this test statistic will have a probability distribution associated with it, because that allows us to determine how likely our observed value of the statistic is under the null hypothesis.
For the BMI example, we need a test statistic that allows us to test for a difference between two means, since the hypotheses are stated in terms of mean BMI for each group. One statistic that is often used to compare two means is the *t* statistic, first developed by the statistician William Sealy Gossett, who worked for the Guiness Brewery in Dublin and wrote under the pen name "Student" - hence, it is often called "Student's *t* statistic". The *t* statistic is appropriate for comparing the means of two groups when the sample sizes are relatively small and the population standard deviation is unknown. The *t* statistic for comparison of two independent groups is computed as:
$$
t = \frac{\bar{X_1} - \bar{X_2}}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}
$$
where $\bar{X}_1$ and $\bar{X}_2$ are the means of the two groups, $S^2_1$ and $S^2_2$ are the estimated variances of the groups, and $n_1$ and $n_2$ are the sizes of the two groups. Because the variance of a difference between two independent variables is the sum of the variances of each individual variable ($var(A - B) = var(A) + var(B)$), we add the variances for each group divided by their sample sizes in order to compute the standard error of the difference. Thus, one can view the the *t* statistic as a way of quantifying how large the difference between groups is in relation to the sampling variability of the difference between means.
The *t* statistic is distributed according to a probability distribution known as a *t* distribution. The *t* distribution looks quite similar to a normal distribution, but it differs depending on the number of degrees of freedom. When the degrees of freedom are large (say 1000), then the *t* distribution looks essentially like the normal distribution, but when they are small then the *t* distribution has longer tails than the normal (see Figure \@ref(fig:tVersusNormal)). In the simplest case, where the groups are the same size and have equal variance, the degrees of freedom for the *t* test is the number of observations minus 2, since we have computed two means and thus given up two degrees of freedom. In this case it's pretty clear from the box plot that the inactive group is more variable than then active group, and the numbers in each group differ, so we need to use a slightly more complex formula for the degrees of freedom, which is often referred to as a "Welch t-test". The formula is:
$$
\mathrm{d.f.} = \frac{\left(\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}\right)^2}{\frac{\left(S_1^2/n_1\right)^2}{n_1-1} + \frac{\left(S_2^2/n_2\right)^2}{n_2-1}}
$$
This will be equal to $n_1 + n_2 - 2$ when the variances and sample sizes are equal, and otherwise will be smaller, in effect imposing a penalty on the test for differences in sample size or variance. For this example, that comes out to `r I(sprintf("%0.2f", welch_df))` which is slightly below the value of 248 that one would get by subtracting 2 from the sample size.
```{r tVersusNormal,echo=FALSE,fig.cap="Each panel shows the t distribution (in blue dashed line) overlaid on the normal distribution (in solid red line). The left panel shows a t distribution with 4 degrees of freedom, in which case the distribution is similar but has slightly wider tails. The right panel shows a t distribution with 1000 degrees of freedom, in which case it is virtually identical to the normal.",fig.width=8,fig.height=4,out.height='50%'}
distDfNormal <- data.frame(x=seq(-4,4,0.01)) %>%
mutate(normal=dnorm(x), Distribution='Normal')
distDft4 <- data.frame(x=seq(-4,4,0.01)) %>%
mutate(normal=dt(x, df=4), Distribution='t (df=4)')
distDft1000 <- data.frame(x=seq(-4,4,0.01)) %>%
mutate(normal=dt(x, df=1000), Distribution='t (df=1000)')
p1 <- ggplot(rbind(distDfNormal, distDft4),aes(x=x, y=normal, color=Distribution)) +
geom_line(aes(linetype=Distribution), size=2) +
ggtitle('df = 4') +
ylab('density') +
ylim(0, 0.5) +
theme(text = element_text(size=14)) +
theme(legend.position=c(0.2, 0.8),
legend.title = element_text(size = 20),
legend.text = element_text(size = 20)
)
p2 <-ggplot(rbind(distDfNormal, distDft1000),aes(x=x, y=normal, color=Distribution)) +
geom_line(aes(linetype=Distribution), size=2) +
ggtitle('df = 1000') +
ylab('density') +
ylim(0, 0.5) +
theme(text = element_text(size=14)) +
theme(legend.position=c(0.2, 0.8),
legend.title = element_text(size = 20),
legend.text = element_text(size = 20)
)
plot_grid(p1,p2)
```
### Step 5: Determine the probability of the observed result under the null hypothesis
This is the step where NHST starts to violate our intuition. Rather than determining the likelihood that the null hypothesis is true given the data, we instead determine the likelihood under the null hypothesis of observing a statistic at least as extreme as one that we have observed --- because we started out by assuming that the null hypothesis is true! To do this, we need to know the expected probability distribution for the statistic under the null hypothesis, so that we can ask how likely the result would be under that distribution. Note that when I say "how likely the result would be", what I really mean is "how likely the observed result or one more extreme would be". There are (at least) two reasons that we need to add this caveat. The first is that when we are talking about continuous values, the probability of any particular value is zero (as you might remember if you've taken a calculus class). More importantly, we are trying to determine how weird our result would be if the null hypothesis were true, and any result that is more extreme will be even more weird, so we want to count all of those weirder possibilities when we compute the probability of our result under the null hypothesis.
We can obtain this "null distribution" either using a theoretical distribution (like the *t* distribution), or using randomization. Before we move to our BMI example, let's start with some simpler examples.
#### P-values: A very simple example {#pvalues-very-simple}
Let's say that we wish to determine whether a particular coin is biased towards landing heads. To collect data, we flip the coin 100 times, and let's say we count 70 heads. In this example, $H_0: P(heads) \le 0.5$ and $H_A: P(heads) > 0.5$, and our test statistic is simply the number of heads that we counted. The question that we then want to ask is: How likely is it that we would observe 70 or more heads in 100 coin flips if the true probability of heads is 0.5? We can imagine that this might happen very occasionally just by chance, but doesn't seem very likely. To quantify this probability, we can use the *binomial distribution*:
$$
P(X \le k) = \sum_{i=0}^k \binom{N}{k} p^i (1-p)^{(n-i)}
$$
This equation will tell us the probability of a certain number of heads ($k$) or fewer, given a particular probability of heads ($p$) and number of events ($N$). However, what we really want to know is the probability of a certain number or more, which we can obtain by subtracting from one, based on the rules of probability:
$$
P(X \ge k) = 1 - P(X < k)
$$
```{r coinFlips,echo=FALSE,fig.cap="Distribution of numbers of heads (out of 100 flips) across 100,000 simulated runs with the observed value of 70 flips represented by the vertical line.",fig.width=4,fig.height=4,out.height='50%'}
# simulate tossing of 100,000 flips of 100 coins to identify empirical
# probability of 70 or more heads out of 100 flips
# create function to toss coins
tossCoins <- function() {
flips <- runif(100) > 0.5
return(sum(flips))
}
# compute the probability of 69 or fewer heads, when P(heads)=0.5
p_lt_70 <- pbinom(69, 100, 0.5)
# the probability of 70 or more heads is simply the complement of p_lt_70
p_ge_70 <- 1 - p_lt_70
# use a large number of replications since this is fast
coinFlips <- replicate(100000, tossCoins())
p_ge_70_sim <- mean(coinFlips >= 70)
ggplot(data.frame(coinFlips),aes(coinFlips)) +
geom_histogram(binwidth = 1) +
geom_vline(xintercept = 70,color='red',size=1)
```
Using the binomial distribution, the probability of 69 or fewer heads given P(heads)=0.5 is `r I(sprintf("%0.6f", p_lt_70))`, so the probability of 70 or more heads is simply one minus that value (`r I(sprintf("%0.6f", p_ge_70))`).
This computation shows us that the likelihood of getting 70 or more heads if the coin is indeed fair is very small.
Now, what if we didn't have a standard function to tell us the probability of that number of heads? We could instead determine it by simulation -- we repeatedly flip a coin 100 times using a true probability of 0.5, and then compute the distribution of the number of heads across those simulation runs. Figure \@ref(fig:coinFlips) shows the result from this simulation. Here we can see that the probability computed via simulation (`r I(sprintf('%0.6f',p_ge_70_sim))`) is very close to the theoretical probability (`r I(sprintf("%0.6f", p_ge_70))`).
```{r echo=FALSE}
tStat <-
meanDiff / sqrt(sumVariance)
pvalue_tdist <-
pt(tStat, df = welch_df, lower.tail = FALSE)
pvalue_tdist_twotailed <-
pt(tStat, df = welch_df, lower.tail = FALSE) +
pt(-1 * tStat, df = welch_df, lower.tail = TRUE)
```
#### Computing p-values using the *t* distribution {#pvalues-tdist}
Now let's compute a p-value for our BMI example using the *t* distribution. First we compute the *t* statistic using the values from our sample that we calculated above, where we find that `r I(sprintf('t = %0.2f',tStat))`. The question that we then want to ask is: What is the likelihood that we would find a *t* statistic of this size, if the true difference between groups is zero or less (i.e. the directional null hypothesis)?
We can use the *t* distribution to determine this probability. Above we noted that the appropriate degrees of freedom (after correcting for differences in variance and sample size) was `r I(sprintf('t = %0.2f',welch_df))`. We can use a function from our statistical software to determine the probability of finding a value of the *t* statistic greater than or equal to our observed value. We find that `r I(sprintf("p(t > %0.2f, df = %0.2f) = %0.6f", tStat, welch_df, pvalue_tdist))`, which tells us that our observed *t* statistic value of `r I(tStat)` is relatively unlikely if the null hypothesis really is true.
In this case, we used a directional hypothesis, so we only had to look at one end of the null distribution. If we wanted to test a non-directional hypothesis, then we would need to be able to identify how unexpected the size of the effect is, regardless of its direction. In the context of the t-test, this means that we need to know how likely it is that the statistic would be as extreme in either the positive or negative direction. To do this, we multiply the observed *t* value by -1, since the *t* distribution is centered around zero, and then add together the two tail probabilities to get a *two-tailed* p-value: `r I(sprintf("p(t > %0.2f or t< %0.2f, df = %0.2f) = %0.6f", tStat, -1 * tStat, welch_df, pvalue_tdist_twotailed))`. Here we see that the p value for the two-tailed test is twice as large as that for the one-tailed test, which reflects the fact that an extreme value is less surprising since it could have occurred in either direction.
How do you choose whether to use a one-tailed versus a two-tailed test? The two-tailed test is always going to be more conservative, so it's always a good bet to use that one, unless you had a very strong prior reason for using a one-tailed test. In that case, you should have written down the hypothesis before you ever looked at the data. In Chapter \@ref(doing-reproducible-research) we will discuss the idea of pre-registration of hypotheses, which formalizes the idea of writing down your hypotheses before you ever see the actual data. You should *never* make a decision about how to perform a hypothesis test once you have looked at the data, as this can introduce serious bias into the results.
#### Computing p-values using randomization
So far we have seen how we can use the t-distribution to compute the probability of the data under the null hypothesis, but we can also do this using simulation. The basic idea is that we generate simulated data like those that we would expect under the null hypothesis, and then ask how extreme the observed data are in comparison to those simulated data. The key question is: How can we generate data for which the null hypothesis is true? The general answer is that we can randomly rearrange the data in a particular way that makes the data look like they would if the null was really true. This is similar to the idea of bootstrapping, in the sense that it uses our own data to come up with an answer, but it does it in a different way.
#### Randomization: a simple example
Let's start with a simple example. Let's say that we want to compare the mean squatting ability of football players with cross-country runners, with $H_0: \mu_{FB} \le \mu_{XC}$ and $H_A: \mu_{FB} > \mu_{XC}$. We measure the maximum squatting ability of 5 football players and 5 cross-country runners (which we will generate randomly, assuming that $\mu_{FB} = 300$, $\mu_{XC} = 140$, and $\sigma = 30$). The data are shown in Table \@ref(tab:squatPlot).
```{r squatPlot,echo=FALSE,fig.cap="Left: Box plots of simulated squatting ability for football players and cross-country runners.Right: Box plots for subjects assigned to each group after scrambling group labels.",fig.width=8,fig.height=4,out.height='50%'}
# generate simulated data for squatting ability across football players
# and cross country runners
# reset random seed for this example
set.seed(1234)
# create a function to round values to nearest product of 5,
# to keep example simple
roundToNearest5 <- function(x, base = 5) {
return(base * round(x / base))
}
# create and show data frame containing simulated data
squatDf <- tibble(
group = as.factor(c(rep("FB", 5), rep("XC", 5))),
squat = roundToNearest5(c(rnorm(5) * 30 + 300, rnorm(5) * 30 + 140))
)
squatDf <- squatDf %>%
mutate(shuffledSquat = sample(squat))
kable(squatDf, caption='Squatting data for the two groups')
p1 <- ggplot(squatDf,aes(x=group,y=squat)) +
geom_boxplot() +
ylab('max squat (lbs)')
# create a scrambled version of the group membership variable
p2 <- ggplot(squatDf,aes(x=group,y=shuffledSquat)) +
geom_boxplot() +
ylab('max squat (lbs)')
plot_grid(p1, p2)
```
From the plot on the left side of Figure \@ref(fig:squatPlot) it's clear that there is a large difference between the two groups. We can do a standard t-test to test our hypothesis; for this example we will use the `t.test()` command in R, which gives the following result:
```{r echo=FALSE}
# compute and print t statistic comparing two groups
tt <-
t.test(
squat ~ group,
data = squatDf,
alternative = "greater"
)
print(tt)
```
If we look at the p-value reported here, we see that the likelihood of such a difference under the null hypothesis is very small, using the *t* distribution to define the null.
Now let's see how we could answer the same question using randomization. The basic idea is that if the null hypothesis of no difference between groups is true, then it shouldn't matter which group one comes from (football players versus cross-country runners) -- thus, to create data that are like our actual data but also conform to the null hypothesis, we can randomly reorder the data for the individuals in the dataset, and then recompute the difference between the groups. The results of such a shuffle are shown in the column labeled "shuffleSquat" in Table \@ref(tab:squatPlot), and the boxplots of the resulting data are in the right panel of Figure \@ref(fig:squatPlot).
```{r shuffleHist, echo=FALSE,fig.cap="Histogram of t-values for the difference in means between the football and cross-country groups after randomly shuffling group membership. The vertical line denotes the actual difference observed between the two groups, and the dotted line shows the theoretical t distribution for this analysis.",fig.width=4,fig.height=4,out.height='50%'}
# shuffle data 10,000 times and compute distribution of t values
nRuns <- 10000
shuffleAndMeasure <- function(df) {
dfScram <-
df %>%
mutate(
squat = sample(squat)
)
tt <- t.test(
squat ~ group,
data = dfScram,
alternative = "greater",
var.equal = TRUE
)
return(tt$statistic)
}
shuffleDiff <- replicate(nRuns, shuffleAndMeasure(squatDf))
# compute p value using randomization
pvalRandomization <- mean(shuffleDiff >= tt$statistic)
ggplot(data.frame(shuffleDiff),aes(shuffleDiff)) +
geom_histogram(aes(y=..density..),bins=50, color='gray', fill='gray') +
geom_vline(xintercept = tt$statistic,color='red') +
xlab('t values after random shuffling') +
stat_function(fun = dt, args = list(df = 8),n = 50,size=1.5,linetype='dotted')
```
After scrambling the data, we see that the two groups are now much more similar, and in fact the cross-country group now has a slightly higher mean. Now let's do that 10000 times and store the *t* statistic for each iteration; if you are doing this on your own computer, it will take a moment to complete. Figure \@ref(fig:shuffleHist) shows the histogram of the *t* values across all of the random shuffles. As expected under the null hypothesis, this distribution is centered at zero (the mean of the distribution is `r I(sprintf("%0.3f", mean(shuffleDiff)))`). From the figure we can also see that the distribution of *t* values after shuffling roughly follows the theoretical *t* distribution under the null hypothesis (with mean=0), showing that randomization worked to generate null data. We can compute the p-value from the randomized data by measuring how many of the shuffled values are at least as extreme as the observed value: `r I(sprintf('p(t > %0.2f, df = 8) using randomization = %0.5f',tt$statistic, pvalRandomization))`. This p-value is very similar to the p-value that we obtained using the *t* distribution, and both are quite extreme, suggesting that the observed data are very unlikely to have arisen if the null hypothesis is true - and in this case we *know* that it's not true, because we generated the data.
##### Randomization: BMI/activity example
```{r echo=FALSE}
# create function to shuffle BMI data
shuffleBMIstat <- function() {
bmiDataShuffled <-
NHANES_sample %>%
select(BMI, PhysActive) %>%
mutate(
BMI = sample(BMI)
)
# compute the difference
simResult <- t.test(
BMI ~ PhysActive,
data = bmiDataShuffled,
)
return(simResult$statistic)
}
# run function 5000 times and save output
nRuns <- 5000
meanDiffSimDf <-
data.frame(
meanDiffSim = replicate(nRuns, shuffleBMIstat())
)
# compute the empirical probability of t values larger than observed
# value under the randomization null
bmtTTest <-
t.test(
BMI ~ PhysActive,
data = NHANES_sample,
alternative = "greater"
)
bmiPvalRand <-
mean(meanDiffSimDf$meanDiffSim >= bmtTTest$statistic)
```
Now let's use randomization to compute the p-value for the BMI/activity example. In this case, we will randomly shuffle the `PhysActive` variable and compute the difference between groups after each shuffle, and then compare our observed *t* statistic to the distribution of *t* statistics from the shuffled datasets. Figure \@ref(fig:simDiff) shows the distribution of *t* values from the shuffled samples, and we can also compute the probability of finding a value as large or larger than the observed value. The p-value obtained from randomization (`r sprintf("%.6f",bmiPvalRand)`) is very similar to the one obtained using the *t* distribution (`r sprintf("%.6f",bmtTTest$p.value)`). The advantage of the randomization test is that it doesn't require that we assume that the data from each of the groups are normally distributed, though the t-test is generally quite robust to violations of that assumption. In addition, the randomization test can allow us to compute p-values for statistics when we don't have a theoretical distribution like we do for the t-test.
```{r simDiff,echo=FALSE,fig.cap="Histogram of t statistics after shuffling of group labels, with the observed value of the t statistic shown in the vertical line, and values at least as extreme as the observed value shown in lighter gray",fig.width=4,fig.height=4,out.height='50%'}
meanDiffSimDf %>%
ggplot(aes(meanDiffSim)) +
geom_histogram(bins = 200) +
geom_vline(xintercept = bmtTTest$statistic, color = "blue") +
xlab("T stat: BMI difference between groups") +
geom_histogram(
data = meanDiffSimDf %>%
filter(meanDiffSim >= bmtTTest$statistic),
aes(meanDiffSim),
bins = 200,
fill = "gray"
)
```
We do have to make one main assumption when we use the randomization test, which we refer to as *exchangeability*. This means that all of the observations are distributed in the same way, such that we can interchange them without changing the overall distribution. The main place where this can break down is when there are related observations in the data; for example, if we had data from individuals in 4 different families, then we couldn't assume that individuals were exchangeable, because siblings would be closer to each other than they are to individuals from other families. In general, if the data were obtained by random sampling, then the assumption of exchangeability should hold.
### Step 6: Assess the "statistical significance" of the result
The next step is to determine whether the p-value that results from the previous step is small enough that we are willing to reject the null hypothesis and conclude instead that the alternative is true. How much evidence do we require? This is one of the most controversial questions in statistics, in part because it requires a subjective judgment -- there is no "correct" answer.
Historically, the most common answer to this question has been that we should reject the null hypothesis if the p-value is less than 0.05. This comes from the writings of Ronald Fisher, who has been referred to as "the single most important figure in 20th century statistics" [@efron1998]:
> “If P is between .1 and .9 there is certainly no reason to suspect the hypothesis tested. If it is below .02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 ... it is convenient to draw the line at about the level at which we can say: Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials” [@fisher1925statistical]
However, Fisher never intended $p < 0.05$ to be a fixed rule:
> "no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas" [@fish:1956]
Instead, it is likely that p < .05 became a ritual due to the reliance upon tables of p-values that were used before computing made it easy to compute p values for arbitrary values of a statistic. All of the tables had an entry for 0.05, making it easy to determine whether one's statistic exceeded the value needed to reach that level of significance.
The choice of statistical thresholds remains deeply controversial, and recently (Benjamin et al., 2018) it has been proposed that the default threshold be changed from .05 to .005, making it substantially more stringent and thus more difficult to reject the null hypothesis. In large part this move is due to growing concerns that the evidence obtained from a significant result at $p < .05$ is relatively weak; we will return to this in our later discussion of reproducibility in Chapter \@ref(doing-reproducible-research).
#### Hypothesis testing as decision-making: The Neyman-Pearson approach
Whereas Fisher thought that the p-value could provide evidence regarding a specific hypothesis, the statisticians Jerzy Neyman and Egon Pearson disagreed vehemently. Instead, they proposed that we think of hypothesis testing in terms of its error rate in the long run:
> "no test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis. But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not often be wrong" [@Neyman289]
That is: We can’t know which specific decisions are right or wrong, but if we follow the rules, we can at least know how often our decisions will be wrong in the long run.
To understand the decision making framework that Neyman and Pearson developed, we first need to discuss statistical decision making in terms of the kinds of outcomes that can occur. There are two possible states of reality ($H_0$ is true, or $H_0$ is false), and two possible decisions (reject $H_0$, or retain $H_0$). There are two ways in which we can make a correct decision:
- We can reject $H_0$ when it is false (in the language of signal detection theory, we call this a *hit*)
- We can retain $H_0$ when it is true (somewhat confusingly in this context, this is called a *correct rejection*)
There are also two kinds of errors we can make:
- We can reject $H_0$ when it is actually true (we call this a *false alarm*, or *Type I error*)
- We can retain $H_0$ when it is actually false (we call this a *miss*, or *Type II error*)
Neyman and Pearson coined two terms to describe the probability of these two types of errors in the long run:
- P(Type I error) = $\alpha$
- P(Type II error) = $\beta$
That is, if we set $\alpha$ to .05, then in the long run we should make a Type I error 5% of the time. Whereas it's common to set $\alpha$ as .05, the standard value for an acceptable level of $\beta$ is .2 - that is, we are willing to accept that 20% of the time we will fail to detect a true effect when it truly exists. We will return to this later when we discuss statistical power in Section \@ref(statistical-power), which is the complement of Type II error.
### What does a significant result mean?
There is a great deal of confusion about what p-values actually mean (Gigerenzer, 2004). Let's say that we do an experiment comparing the means between conditions, and we find a difference with a p-value of .01. There are a number of possible interpretations that one might entertain.
#### Does it mean that the probability of the null hypothesis being true is .01?
No. Remember that in null hypothesis testing, the p-value is the probability of the data given the null hypothesis ($P(data|H_0)$). It does not warrant conclusions about the probability of the null hypothesis given the data ($P(H_0|data)$). We will return to this question when we discuss Bayesian inference in a later chapter, as Bayes theorem lets us invert the conditional probability in a way that allows us to determine the probability of the hypothesis given the data.
#### Does it mean that the probability that you are making the wrong decision is .01?
No. This would be $P(H_0|data)$, but remember as above that p-values are probabilities of data under $H_0$, not probabilities of hypotheses.
#### Does it mean that if you ran the study again, you would obtain the same result 99% of the time?
No. The p-value is a statement about the likelihood of a particular dataset under the null; it does not allow us to make inferences about the likelihood of future events such as replication.
#### Does it mean that you have found a practically important effect?
No. There is an essential distinction between *statistical significance* and *practical significance*. As an example, let's say that we performed a randomized controlled trial to examine the effect of a particular diet on body weight, and we find a statistically significant effect at p<.05. What this doesn't tell us is how much weight was actually lost, which we refer to as the *effect size* (to be discussed in more detail in Chapter \@ref(ci-effect-size-power)). If we think about a study of weight loss, then we probably don't think that the loss of one ounce (i.e. the weight of a few potato chips) is practically significant. Let's look at our ability to detect a significant difference of 1 ounce as the sample size increases.
```{r echo=FALSE, warning=FALSE, message=FALSE}
# create simulated data for weight loss trial
weightLossTrial <- function(nPerGroup, weightLossOz = 1) {
# mean and SD in Kg based on NHANES adult dataset
kgToOz <- 35.27396195 # conversion constant for Kg to Oz
meanOz <- 81.78 * kgToOz
sdOz <- 21.29 * kgToOz
# create data
controlGroup <- rnorm(nPerGroup) * sdOz + meanOz
expGroup <- rnorm(nPerGroup) * sdOz + meanOz - weightLossOz
ttResult <- t.test(expGroup, controlGroup)
return(c(
nPerGroup, weightLossOz, ttResult$p.value,
diff(ttResult$estimate)
))
}
nRuns <- 1000
sampSizes <- 2**seq(5,17) # powers of 2
simResults <- c() ## create an empty list to add results onto
for (i in 1:length(sampSizes)) {
tmpResults <- replicate(
nRuns,
weightLossTrial(sampSizes[i], weightLossOz = 10)
)
summaryResults <- c(
tmpResults[1, 1], tmpResults[2, 1],
sum(tmpResults[3, ] < 0.05),
mean(tmpResults[4, ])
)
simResults <- rbind(simResults, summaryResults)
}
simResultsDf <-
as.tibble(simResults) %>%
rename(
sampleSize = V1,
effectSizeLbs = V2,
nSigResults = V3,
meanEffect = V4
) %>%
mutate(pSigResult = nSigResults / nRuns)
```
Figure \@ref(fig:sigResults) shows how the proportion of significant results increases as the sample size increases, such that with a very large sample size (about 262,000 total subjects), we will find a significant result in more than 90% of studies when there is a 1 ounce difference in weight loss between the diets. While these are statistically significant, most physicians would not consider a weight loss of one ounce to be practically or clinically significant. We will explore this relationship in more detail when we return to the concept of *statistical power* in Section \@ref(statistical-power), but it should already be clear from this example that statistical significance is not necessarily indicative of practical significance.
```{r sigResults, echo=FALSE,fig.cap="The proportion of signifcant results for a very small change (1 ounce, which is about .001 standard deviations) as a function of sample size.",fig.width=8,fig.height=4,out.height='50%'}
ggplot(simResultsDf,aes(sampleSize,pSigResult)) +
geom_line() +
scale_x_continuous(trans='log2',breaks=simResultsDf$sampleSize) +
theme(axis.text.x = element_text( angle=45,vjust=0.5)) +
ylim(0,1) + ylab('Proportion of significant results') +
xlab('sample size per group') +
theme(panel.grid.major = element_line(colour = "gray",size=0.25)) +
geom_hline(yintercept = 0.05,linetype='dashed')
```
## NHST in a modern context: Multiple testing
So far we have discussed examples where we are interested in testing a single statistical hypothesis, and this is consistent with traditional science which often measured only a few variables at a time. However, in modern science we can often measure millions of variables per individual. For example, in genetic studies that quantify the entire genome, there may be many millions of measures per individual, and in the brain imaging research that my group does, we often collect data from more than 100,000 locations in the brain at once. When standard hypothesis testing is applied in these contexts, bad things can happen unless we take appropriate care.
Let's look at an example to see how this might work. There is great interest in understanding the genetic factors that can predispose individuals to major mental illnesses such as schizophrenia, because we know that about 80% of the variation between individuals in the presence of schizophrenia is due to genetic differences. The Human Genome Project and the ensuing revolution in genome science has provided tools to examine the many ways in which humans differ from one another in their genomes. One approach that has been used in recent years is known as a *genome-wide association study* (GWAS), in which the genome of each individual is characterized at one million or more places to determine which letters of the genetic code they have at each location, focusing on locations where humans tend to differ frequently. After these have been determined, the researchers perform a statistical test at each location in the genome to determine whether people diagnosed with schizoprenia are more or less likely to have one specific version of the genetic sequence at that location.
Let's imagine what would happen if the researchers simply asked whether the test was significant at p<.05 at each location, when in fact there is no true effect at any of the locations. To do this, we generate a large number of simulated *t* values from a null distribution, and ask how many of them are significant at p<.05. Let's do this many times, and each time count up how many of the tests come out as significant (see Figure \@ref(fig:nullSim)).
```{r echo=FALSE}
set.seed(5)
# simulate 1500 studies with 10,000 tests each, thresholded at p < .05
nRuns <- 1000 # number of simulated studies to run
nTests <- 1000000 # number of simulated genes to test in each run
uncAlpha <- 0.05 # alpha level
uncOutcome <- replicate(nRuns, sum(rnorm(nTests) < qnorm(uncAlpha)))
#sprintf("mean proportion of significant tests per run: %0.2f", mean(uncOutcome) / nTests)
# compute proportion of studies with at least one false positive result,
# known as the familywise error rate
#sprintf("familywise error rate: %0.3f", mean(uncOutcome > 0))
# compute Bonferroni-corrected alpha
corAlpha <- 0.05 / nTests
corOutcome <- replicate(nRuns, sum(rnorm(nTests) < (qnorm(corAlpha))))
# sprintf("corrected familywise error rate: %0.3f", mean(corOutcome > 0))
```
```{r nullSim,echo=FALSE,fig.cap="Left: A histogram of the number of significant results in each set of one million statistical tests, when there is in fact no true effect. Right: A histogram of the number of significant results across all simulation runs after applying the Bonferroni correction for multiple tests.",fig.width=8,fig.height=4,out.height='50%'}
p1 <- data.frame(nsig=uncOutcome) %>%
ggplot(aes(nsig)) +
geom_histogram(bins=50) +
xlab(sprintf('Number of significant results (out of %d)',nTests)) +
theme(plot.margin = unit(c(0,1.5,0,0), "cm"))
p2 <- ggplot(data.frame(nsig=corOutcome),aes(nsig)) +
geom_histogram(bins=50) +
xlab(sprintf('Number of significant results (out of %d)',nTests)) +
theme(plot.margin = unit(c(0,1,0,0), "cm"))
plot_grid(p1, p2)
```
This shows that about 5% of all of the tests were significant in each run, meaning that if we were to use p < .05 as our threshold for statistical significance, then even if there were no truly significant relationships present, we would still "find" about 500 genes that were seemingly significant in each study (the expected number of significant results is simply $n * \alpha$). That is because while we controlled for the error per test, we didn't control the error rate across our entire *family* of tests (known as the *familywise error*), which is what we really want to control if we are going to be looking at the results from a large number of tests. Using p<.05, our familywise error rate in the above example is one -- that is, we are pretty much guaranteed to make at least one error in any particular study.
A simple way to control for the familywise error is to divide the alpha level by the number of tests; this is known as the *Bonferroni* correction, named after the Italian statistician Carlo Bonferroni. Using the data from our example above, we see in Figure \@ref(fig:nullSim) that only about 5 percent of studies show any significant results using the corrected alpha level of 0.000005 instead of the nominal level of .05. We have effectively controlled the familywise error, such that the probability of making *any* errors in our study is controlled at right around .05.
## Learning objectives
* Identify the components of a hypothesis test, including the parameter of interest, the null and alternative hypotheses, and the test statistic.
* Describe the proper interpretations of a p-value as well as common misinterpretations
* Distinguish between the two types of error in hypothesis testing, and the factors that determine them.
* Describe how resampling can be used to compute a p-value.
* Describe the problem of multiple testing, and how it can be addressed
* Describe the main criticisms of null hypothesis statistical testing
## Suggested readings
- [Mindless Statistics, by Gerd Gigerenzer](https://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf)