Debugging scikit-learn pipelines
Introduction
The idea of this post is to show how we can alter scikit-learn pipelines behaviour in order to be easier to debug. For this purpose, we'll:
- Create a pipeline with a bug on it.
- Create a tool to debug such pipeline.
- Solve the error and see what the debugging tool offers us.
We'll do this with the arrests data from scikit-lego
library and a very simple pipeline.
# Main tools
import numpy as np
import pandas as pd
import seaborn as sns
# Data
from sklego.datasets import load_arrests
# Scikit-learn classes
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
# Non-scikit learn classes
from category_encoders import OneHotEncoder
# fastcore helpers
from fastcore.dispatch import typedispatch
Also, let's load the data. The data are on US arrests. These data are usually used to study ML bias. We use this dataset in order to show how to debug a pipeline, but any other dataset would be completely valid. We can see that there are both numeric and categorical columns. We're going to try to predict the fact that an arrest was released
or not.
df_arrests = load_arrests(as_frame=True)
df_arrests.head()
Let's do a very basic preprocessing: split the dataset betwwen X and y:
df_arrests_x = df_arrests.drop(columns='released')
df_arrests_y = df_arrests['released'].map({'Yes': 0, 'No': 1})
We want to train a ML model to predict released
. For that purpose, we're going to use a scikit-learn pipeline, where we are going to scale data, impute missing values and train a logistic regression. Simple enough, it should work:
# Define pipeline
steps_list = [
('scaler', StandardScaler()),
('imputer', SimpleImputer()),
('learner', LogisticRegression())
]
pipe = Pipeline(steps_list)
We try to train the model but we get an error, and unfortunately we get an error, which is hard to read (kind of):
ValueError: could not convert string to float: 'White'
I don't even know which is the step of the pipeline that failed, and this makes it hard to debug:
pipe.fit(df_arrests_x, df_arrests_y)
To be able to debug this error, we are going to build a class that is derived from Pipeline
. As a first step, we'll build DebugPipeline
with a debug_fit
method such that it is able to reproduce the same error as in the standard fit
. What we try to do is reproduce the fit
method, and we know that the idea of the fit
of a Pipeline
is to run fit_transform
for every transformer and run a fit
with the learner:
class DebugPipeline(Pipeline):
def debug_fit(self, X, y):
for transformer_name, transformer in self.steps[:-1]:
X = transformer.fit_transform(X)
return self.steps[-1][1].fit(X, y)
pipe = DebugPipeline(steps_list)
Indeed, we get the same error:
pipe.debug_fit(df_arrests_x, df_arrests_y)
Now we have a debug_fit
method that we can control in an easier way than the fit
method. The idea is to log several properties of our input data at each step in order to understand where is the pipeline failing and why is it failing at that step.
As a bonus trick, we'll use typedispatch
from fastcore
to log the types of the input data depending on its type.
Let's create the logging functions:
@typedispatch
def log_dtypes(df: pd.DataFrame):
types_dict = dict(df.dtypes.items())
dtypes_str = ", ".join(
"({}, {})".format(name, dtype) for name, dtype in types_dict.items()
)
return f"types=[{dtypes_str}]"
@typedispatch
def log_dtypes(df: np.ndarray):
types_dict = {i: col.dtype for i, col in enumerate(df.T)}
dtypes_str = ", ".join(
"({}, {})".format(name, dtype) for name, dtype in types_dict.items()
)
return f"types=[{dtypes_str}]"
def log_shape(df):
return f"n_obs={df.shape[0]} n_col={df.shape[1]}"
Let's use these functions before the fit_transform
:
class DebugPipeline(Pipeline):
def debug_fit(self, X, y):
for transformer_name, transformer in self.steps[:-1]:
print(log_shape(X))
print(log_dtypes(X))
print(f"{transformer_name} to be applied")
X = transformer.fit_transform(X)
return self.steps[-1][1].fit(X, y)
And use the debug_fit
. We can see that the pipeline is failing in the scaler step, and that the input data at these step are of type object
. It looks like we needed to transform the categorical variables to numeric, as the standard scaler cannot handle categories.
pipe = DebugPipeline(steps_list)
pipe.debug_fit(df_arrests_x, df_arrests_y)
Let's solve this issue by applying one-hot encoding before running the pipeline:
steps_list = [
('one-hot', OneHotEncoder()),
('scaler', StandardScaler()),
('imputer', SimpleImputer()),
('learner', LogisticRegression())
]
pipe = DebugPipeline(steps_list)
pipe.debug_fit(df_arrests_x, df_arrests_y)
And now, for free, we get to see which are the columns of the pipeline at each step. We can understand the features that our model is using better and have a more intuitive picture of our pipeline.