Feature selection part 2

In part 1, I showed some of the dangers of using univariate selection methods. In part 2 I want to focus on the pitfalls of feature importance in random forests and gradient boosting methods.

I’ll write about it in the feature selection chapter as feature importances can be used to select features, as in Sklearn SelectFromModel.

In order to select features using feature importances, one can:

Train a model that has allows to compute feature importances.
Retrain the model with only the most important features.

In my experience, when retraining with only the most important features, the model usually degrades a little.

The main issue regarding selecting features using feature importance is that, if a feature is highly correlated with others, its importance will be lower than if it isn’t correlated with any features. For this reason I advise to check that your features are not very correlated if you want to assess them using feature importance.

Experiment set-up

In order to show the issues of selecting features using feature importance, we’ll use a rather ill-defined example. In this example, x1, x2, x3 and x4 are independent variables and the dependent variable is

y ~ x1 + (x2 + x3 + x4) * 0.5 + noise

When training a random forest, x1 should appear as the most important variable. If the feature selection method had to keep only one feature, x1 should be the one to select.

To see an example where feature importance might mislead you, we’ve created some brothers to x1. They are variables that are very correlated to x1 and will be used to model y as well. These brothers are what cause the importance of x1 to be diminished.

library("dplyr")
library("randomForest")
library("glmnet")

set.seed(42)

len <- 5000

x1 <- rnorm(len)
x2 <- rnorm(len)
x3 <- rnorm(len)
x4 <- rnorm(len)

# The outcome is created without the brothers
y <- x1 + 0.5 * x2 + 0.5 * x3 + 0.5 * x4 + rnorm(len)

# x1i are x1's "brothers": variables that are mainly x1 but with some noise
x11 <- 0.95 * x1 + 0.05 * rnorm(len)
x12 <- 0.95 * x1 + 0.05 * rnorm(len)
x13 <- 0.95 * x1 + 0.05 * rnorm(len)
x14 <- 0.95 * x1 + 0.05 * rnorm(len)
x15 <- 0.95 * x1 + 0.05 * rnorm(len)
x16 <- 0.95 * x1 + 0.05 * rnorm(len)
x17 <- 0.95 * x1 + 0.05 * rnorm(len)
x18 <- 0.95 * x1 + 0.05 * rnorm(len)

Then we create the feature matrix, with x1 to x4, as well as x1’s brothers.

model_tbl <- tibble(
  y = y,
  x1 = x1,
  x2 = x2,
  x3 = x3,
  x4 = x4,
  x11 = x11,
  x12 = x12,
  x13 = x13,
  x14 = x14,
  x15 = x15,
  x16 = x16,
  x17 = x17,
  x18 = x18
)

X <- as.matrix(select(model_tbl, -y))

Random forest importance

A random forest model is trained (when training this model, it automatically computes feature importance).

rf <- randomForest(X, y, importance = T)

And we show the importance of the features:

varImpPlot(rf, type = 1)

This shows the shortcomings of feature importance: x1 doesn’t appear as the most important feature. If we were to select three variables, we would select x2, x3 and x4, and this would of course degrade the model performance.

Thinking about it, it makes sense. In this random forest, to model the x1 contribution, some splits are done with x1, some with her brothers. For this reason, if x1 gets broken, the impact is not as big as if x2 breaks.

Lasso selection

On the other hand, lasso kind of makes it (recall that cv.glmnet default is lasso):

# Train lasso
lasso <- cv.glmnet(X, y)

# Kind of makes it
coef(lasso, s = "lambda.min")

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 0.01334871
## x1          0.90829549
## x2          0.49467345
## x3          0.49902516
## x4          0.48559226
## x11         0.01570838
## x12         .         
## x13         .         
## x14         .         
## x15         .         
## x16         0.02241166
## x17         0.06331782
## x18         .

It selects some of the x1 brothers, but with really small coefficients. If we regularize a bit more, they’ll probably vanish.

In fact, the next figure shows that the last feature to be vanished is x1, which didn’t happen in the random forest:

# x1 is the last one to go
plotmo::plot_glmnet(lasso$glmnet.fit)

Of course Lasso selects the variables better in this case, as the model is generated linearly. A case where feature importance might shine more than the Lasso is when the dependent variable is a non-linear function of the features.

To sum up, be careful with feature importance when having highly correlated features.