slider package

slider is an R package that allows to perform sliding window calculations. In this post we’re going to see how slider can be used to perform feature engineering for sales forecasting problems.

Problem formulation

Your are working on a demand prediction problem. You have sales data that has the following structure:

##   shop product       date units_sold
## 1    A  iogurt 2020-02-01         11
## 2    A  iogurt 2020-02-02         15
## 3    A  iogurt 2020-02-03         14
## 4    B  iogurt 2020-02-01         25
## 5    B  iogurt 2020-02-02         33
## 6    B  iogurt 2020-02-03         33

And you want to predict, for each shop and each product, the units that will be sold during the following day.

To do so, you’ll create features like:

How much was the product sold on average during last week in that shop.
How much was the product sold on average during last week in all shops.
How much was the shop selling on average during last week.
Similar features but using last month data, instead of last week. Or maybe even using the over-all sales history that we have.
Perhaps we want compute averages by using the mean, but maybe we want the median, maximum and minimum units sold during the week.

I think slider is one of the simplest ways of doing this.

Introduction to slider

Slider has a main function, slide, and variations of it. According to the documentation,

slide() iterates through .x using a sliding window, applying .f to each sub-window of .x.

The sub-window of .x is highly customizable. The parameters to customize the sub-window are mainly .before, .after, .step and .complete.

Let’s see it with some examples. Compute the over-all sales in shop A until today (this is not a feature we want to train our model on, but something to see the behaviour of slide):

# Cumulative sold items until today
sales_tbl %>% 
  filter(shop == 'A') %>% 
  mutate(
    # (slightly technical warning) We're going to be using 
    # slide_vec instead of slide, they're basically the same, 
    # but slide_vec returns a vector, whereas slide returns a list
    sum_sold_wrong = slide_vec(.x = units_sold, .f = sum, .before = Inf)
  )

##   shop product       date units_sold sum_sold_wrong
## 1    A  iogurt 2020-02-01         11             11
## 2    A  iogurt 2020-02-02         15             26
## 3    A  iogurt 2020-02-03         14             40

The .before parameter indicates how many days do we go back to aggregate the sold units. If we set it to Inf, it computes the units sold until today.

The issue is that this sum_sold_wrong has the units that have been sold including today. If we want to exclude today’s data, which makes sense as we want to predict without today’s information, slider has this nice trick (setting .after = -1):

# Here we are excluding today!
sales_tbl %>% 
  filter(shop == 'A') %>% 
  mutate(
    sum_sold_right = slide_vec(.x = units_sold, .f = sum, .before = Inf, .after = -1)
  )

##   shop product       date units_sold sum_sold_right
## 1    A  iogurt 2020-02-01         11              0
## 2    A  iogurt 2020-02-02         15             11
## 3    A  iogurt 2020-02-03         14             26

Setting .after to -1 is kind of dark but it will be used a lot when doing forecasting using slider. It is important to use a negative .after since we don’t want to leak information from the future into our pipeline.

Feature engineering with slider

Let’s say we want to compute features at shop level:

Mean of units sold during the last week for each shop.
Mean of units sold during the last month for each shop.
Mean of units sold over-all for each shop.
Max of units sold during the last week for each shop.
Max of units sold during the last month for each shop.
Max of units sold over-all for each shop.

What I like about slider is that explaining the features takes more time than coding them:

sales_tbl <- sales_tbl %>% 
  group_by(shop) %>% # Shop-level features
  mutate(
    # Mean of units sold during the last week
    mean_sold_shop_week = slide_vec(.x = units_sold, .f = mean, .before = 7, .after = -1),
    # Mean of units sold during the last month
    mean_sold_shop_month = slide_vec(.x = units_sold, .f = mean, .before = 30, .after = -1),
    # Mean of units sold over-all
    mean_sold_shop = slide_vec(.x = units_sold, .f = mean, .before = Inf, .after = -1),
    # Max of units sold during the last week
    max_sold_shop_week = slide_vec(.x = units_sold, .f = max, .before = 7, .after = -1),
    # Max of units sold during the last month
    max_sold_shop_month = slide_vec(.x = units_sold, .f = max, .before = 30, .after = -1),
    # Max of units sold over-all
    max_sold_shop = slide_vec(.x = units_sold, .f = max, .before = Inf, .after = -1)
  )

If we want to do the same at product level, we only have to change the grouping variable (and variable names):

sales_tbl <- sales_tbl %>% 
  group_by(product) %>% # Product-level features
  mutate(
    # Mean of units sold during the last week
    mean_sold_product_week = slide_vec(.x = units_sold, .f = mean, .before = 7, .after = -1),
    # Mean of units sold during the last month
    mean_sold_product_month = slide_vec(.x = units_sold, .f = mean, .before = 30, .after = -1),
    # Mean of units sold over-all
    mean_sold_product = slide_vec(.x = units_sold, .f = mean, .before = Inf, .after = -1),
    # Max of units sold during the last week
    max_sold_product_week = slide_vec(.x = units_sold, .f = max, .before = 7, .after = -1),
    # Max of units sold during the last month
    max_sold_product_month = slide_vec(.x = units_sold, .f = max, .before = 30, .after = -1),
    # Max of units sold over-all
    max_sold_product = slide_vec(.x = units_sold, .f = max, .before = Inf, .after = -1)
  )

Same if we want features at shop and product level:

sales_tbl <- sales_tbl %>% 
  group_by(shop, product) %>% # Product-level features
  mutate(
    # Mean of units sold during the last week
    mean_sold_sh_product_week = slide_vec(.x = units_sold, .f = mean, .before = 7, .after = -1),
    # Mean of units sold during the last month
    mean_sold_sh_product_month = slide_vec(.x = units_sold, .f = mean, .before = 30, .after = -1),
    # Mean of units sold over-all
    mean_sold_sh_product = slide_vec(.x = units_sold, .f = mean, .before = Inf, .after = -1),
    # Max of units sold during the last week
    max_sold_sh_product_week = slide_vec(.x = units_sold, .f = max, .before = 7, .after = -1),
    # Max of units sold during the last month
    max_sold_sh_product_month = slide_vec(.x = units_sold, .f = max, .before = 30, .after = -1),
    # Max of units sold over-all
    max_sold_sh_product = slide_vec(.x = units_sold, .f = max, .before = Inf, .after = -1)
  )

If we want to take into consideration the day of week, we can use the .step parameter. The following call to slide_vec computes the mean of units sold of the last 4 days of that weekday.

sales_tbl <- sales_tbl %>% 
  ungroup() %>% 
  mutate(
    day_of_week_effect = slide_vec(.x = units_sold, .f = mean, .before = 4, .step = 7, .after = -1),
  )

With very few lines of code we’ve managed to build features that very predictive of our outcome. Moreover, we are not leaking information from the future. A supervised learning model could be trained on these features and we can have very quickly a very decent baseline to start iterating on.

Why slider?

For some of this quantities, the use of slider seems kind of an over-kill. For instance, the over-all mean of units sold in a given shop can be done in two different ways:

# Slider way
sales_tbl <- sales_tbl %>% 
  group_by(shop) %>% 
  mutate(
    mean_sold_shop = slide_vec(.x = units_sold, .f = mean, .before = Inf, .after = -1)
  )

# Simple way
sales_tbl <- sales_tbl %>% 
  group_by(shop) %>% 
  mutate(
    mean_sold_shop = mean(units_sold)
  )

Why would I rather use the slider way? The reason is that the simple way leaks information from the future. That is, it uses the target of a row to create a feature, and then we are going to use that feature to predict the target. We might end up over-trusting the mean_sold_shop feature. This might have two consequences:

If we do it right, by only using the train set to compute mean_sold_shop the model might degrade in the test set. This is not ideal, but we can live with it.
If we do it wrong, by using the test set to compute mean_sold_shop the model will degrade in production, which is a bigger trouble.

With slider, you don’t have to worry about none of the above since you are only using information from the past.