I’m still reading the great Statistical Rethinking book by Richard McElreath, and I’m learning plenty of things. One thing it quickly mentions is that you shouldn’t worry about having non-normal data when using a linear regression. Linear regression can be defined as

y ∼ N(a0 + a1x1 + … + anxn, σ2)

This doesn’t imply that y has to be normally distributed in our data. It only means that y needs to be normally distributed conditional to x. Let’s do an experiment to show this.

We’ll generate x using a beta distribution, thus generating a very skewed distribution for x. We’ll generate y by adding some small noise to x.

# Libraries

library(dplyr)
library(ggplot2)

theme_set(theme_minimal())
n <- 100000

# Create x, noise and y
x <- rbeta(n, 2, 5)
noise_y <- rnorm(n, 0, 0.01)
y <- x + noise_y

# Prepare data and plot y
df <- data.frame(x = x, y = y)

ggplot(df) + 
  geom_density(aes(x = y))

As we can see, y is pretty skewed. A naive statistician might say that, being the data not normal, we should not fit a plain linear model, as this assumes normality of data. Keep in mind that the linear model assumes normality of the data conditioned on the predictors, so this might be a situation in which a linear model still makes sense.

# Fit linear regression 
model <- lm(y ~ x, data = df)

# Compute residuals
predictions <- as.vector(predict(model, df))
df$error <- y - predictions

# Show residuals
ggplot(df) + 
  geom_density(aes(x = error))

As we can see, the distribution for the error is not skewed anymore, it looks like a normal. Let’s check it with a qqplot:

ggplot(df) + 
  stat_qq(aes(sample = error)) + 
  stat_qq_line(aes(sample = error))

Indeed, the residuals follow a normal distribution.

To sum up, having a non-normal target doesn’t mean that the linear model is not suitable. If the residuals from the linear model follow a normal distribution, then the linear model is an appropiate one.