Sample and effect sizes in hypothesis testing

One of the most troubling thing about p-values is the fact that if you gather more data the p-value naturally decreases. So, give me enough data and I will reject any null hypothesis, sort of. It makes sense that with more data I would be able to measure smaller effects. However, the whole effect size - data size coupling is what bothers me the most. Let’s make a quick experiment.

First of all, the libraries we’ll use have to be loaded.

# Libraries

library(dplyr)
library(purrr)
library(ggplot2)

theme_set(theme_minimal())

Let’s create a function that, given a sample size and effect size, returns the p-value of the t-test that compares two normal distributions with variance 1 and difference in means equal to the given effect size.

# Given sample size n and effect, compute the p-value of 
# t-test generating the respective samples
t_test_value <- function(n, effect) {
  t.test(rnorm(n), rnorm(n, effect))$p.value
}

# Surprisingly logarithmic sequences don't exist in base R
lseq <- function(from, to, length.out) {
  # logarithmic spaced sequence
  # blatantly stolen from library("emdbook"), because need only this
  exp(seq(log(from), log(to), length.out = length.out))
}

We can run this function for different sample and effect sizes and see when we would reject assuming a type 1 error of 5%.

# Experiment 

# Create pairs of samples sizes and effects
results_tbl <- expand.grid(
  as.integer(lseq(50, 1e5, 100)),
  5:100 * 0.001
)
names(results_tbl) <- c("n", "effect")

# We make use of the elegant purrr
# to apply the funciton in the grid
results_tbl$p_values <- map2_dbl(
  .x = results_tbl$n,
  .y = results_tbl$effect,
  t_test_value
  )

results_tbl$hypothesis <- if_else(
  results_tbl$p_values < 0.05, 
  'rejects', 
  'accepts'
  )

# Plot
ggplot(results_tbl, aes(x = n, y = effect, fill = hypothesis)) + 
  geom_tile() + 
  scale_x_log10()

We can see that, given enough data, we reject for almost all the effect sizes. And conversely, given an effect size, it’s only a matter of getting enough data to reject the null hypothesis. I find this very troubling.

I even think one can analytically derive the equation of the curves in the effect and sample size space that have constant p-value, which should approximate the line that splits the red and blue dots in the picture, but I’ll leave that to the reader.