Introduction

The idea of this post is to show how we can alter scikit-learn pipelines behaviour in order to be easier to debug. For this purpose, we'll:

  • Create a pipeline with a bug on it.
  • Create a tool to debug such pipeline.
  • Solve the error and see what the debugging tool offers us.

We'll do this with the arrests data from scikit-lego library and a very simple pipeline.

Data preparation

First of all, let's load the packages that we are going to use:

# Main tools
import numpy as np
import pandas as pd
import seaborn as sns

# Data
from sklego.datasets import load_arrests

# Scikit-learn classes
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Non-scikit learn classes
from category_encoders import OneHotEncoder


# fastcore helpers
from fastcore.dispatch import typedispatch

Also, let's load the data. The data are on US arrests. These data are usually used to study ML bias. We use this dataset in order to show how to debug a pipeline, but any other dataset would be completely valid. We can see that there are both numeric and categorical columns. We're going to try to predict the fact that an arrest was released or not.

df_arrests = load_arrests(as_frame=True)
df_arrests.head()
released colour year age sex employed citizen checks
0 Yes White 2002 21 Male Yes Yes 3
1 No Black 1999 17 Male Yes Yes 3
2 Yes White 2000 24 Male Yes Yes 3
3 No Black 2000 46 Male Yes Yes 1
4 Yes Black 1999 27 Female Yes Yes 1

Let's do a very basic preprocessing: split the dataset betwwen X and y:

df_arrests_x = df_arrests.drop(columns='released')
df_arrests_y = df_arrests['released'].map({'Yes': 0, 'No': 1})

We want to train a ML model to predict released. For that purpose, we're going to use a scikit-learn pipeline, where we are going to scale data, impute missing values and train a logistic regression. Simple enough, it should work:

# Define pipeline
steps_list = [
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer()),
    ('learner', LogisticRegression())
]
pipe = Pipeline(steps_list)

We try to train the model but we get an error, and unfortunately we get an error, which is hard to read (kind of):

ValueError: could not convert string to float: 'White'

I don't even know which is the step of the pipeline that failed, and this makes it hard to debug:

pipe.fit(df_arrests_x, df_arrests_y)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-3ef09ec7e14e> in <module>
----> 1 pipe.fit(df_arrests_x, df_arrests_y)

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    328         """
    329         fit_params_steps = self._check_fit_params(**fit_params)
--> 330         Xt = self._fit(X, y, **fit_params_steps)
    331         with _print_elapsed_time('Pipeline',
    332                                  self._log_message(len(self.steps) - 1)):

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    294                 message_clsname='Pipeline',
    295                 message=self._log_message(step_idx),
--> 296                 **fit_params_steps[name])
    297             # Replace the transformer of the step with the fitted
    298             # transformer. This is necessary when loading the transformer

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    738     with _print_elapsed_time(message_clsname, message):
    739         if hasattr(transformer, 'fit_transform'):
--> 740             res = transformer.fit_transform(X, y, **fit_params)
    741         else:
    742             res = transformer.fit(X, y, **fit_params).transform(X)

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    691         else:
    692             # fit method of arity 2 (supervised transformation)
--> 693             return self.fit(X, y, **fit_params).transform(X)
    694 
    695 

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y)
    665         # Reset internal state before fitting
    666         self._reset()
--> 667         return self.partial_fit(X, y)
    668 
    669     def partial_fit(self, X, y=None):

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y)
    696         X = self._validate_data(X, accept_sparse=('csr', 'csc'),
    697                                 estimator=self, dtype=FLOAT_DTYPES,
--> 698                                 force_all_finite='allow-nan')
    699 
    700         # Even in the case of `with_mean=False`, we update the mean anyway

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    418                     f"requires y to be passed, but the target y is None."
    419                 )
--> 420             X = check_array(X, **check_params)
    421             out = X
    422         else:

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     71                           FutureWarning)
     72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73         return f(**kwargs)
     74     return inner_f
     75 

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    597                     array = array.astype(dtype, casting="unsafe", copy=False)
    598                 else:
--> 599                     array = np.asarray(array, order=order, dtype=dtype)
    600             except ComplexWarning:
    601                 raise ValueError("Complex data not supported\n"

~/.local/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81 
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84 
     85 

~/.local/lib/python3.7/site-packages/pandas/core/generic.py in __array__(self, dtype)
   1776 
   1777     def __array__(self, dtype=None) -> np.ndarray:
-> 1778         return np.asarray(self._values, dtype=dtype)
   1779 
   1780     def __array_wrap__(self, result, context=None):

~/.local/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81 
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84 
     85 

ValueError: could not convert string to float: 'White'

To be able to debug this error, we are going to build a class that is derived from Pipeline. As a first step, we'll build DebugPipeline with a debug_fit method such that it is able to reproduce the same error as in the standard fit. What we try to do is reproduce the fit method, and we know that the idea of the fit of a Pipeline is to run fit_transform for every transformer and run a fit with the learner:

class DebugPipeline(Pipeline):

    def debug_fit(self, X, y):
        for transformer_name, transformer in self.steps[:-1]:
            X = transformer.fit_transform(X)
        return self.steps[-1][1].fit(X, y)
pipe = DebugPipeline(steps_list)

Indeed, we get the same error:

pipe.debug_fit(df_arrests_x, df_arrests_y)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-04ff0f168997> in <module>
----> 1 pipe.debug_fit(df_arrests_x, df_arrests_y)

<ipython-input-6-c74b872fa0ed> in debug_fit(self, X, y)
      3     def debug_fit(self, X, y):
      4         for transformer_name, transformer in self.steps[:-1]:
----> 5             X = transformer.fit_transform(X)
      6         return self.steps[-1][1].fit(X, y)

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    688         if y is None:
    689             # fit method of arity 1 (unsupervised transformation)
--> 690             return self.fit(X, **fit_params).transform(X)
    691         else:
    692             # fit method of arity 2 (supervised transformation)

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y)
    665         # Reset internal state before fitting
    666         self._reset()
--> 667         return self.partial_fit(X, y)
    668 
    669     def partial_fit(self, X, y=None):

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y)
    696         X = self._validate_data(X, accept_sparse=('csr', 'csc'),
    697                                 estimator=self, dtype=FLOAT_DTYPES,
--> 698                                 force_all_finite='allow-nan')
    699 
    700         # Even in the case of `with_mean=False`, we update the mean anyway

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    418                     f"requires y to be passed, but the target y is None."
    419                 )
--> 420             X = check_array(X, **check_params)
    421             out = X
    422         else:

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     71                           FutureWarning)
     72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73         return f(**kwargs)
     74     return inner_f
     75 

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    597                     array = array.astype(dtype, casting="unsafe", copy=False)
    598                 else:
--> 599                     array = np.asarray(array, order=order, dtype=dtype)
    600             except ComplexWarning:
    601                 raise ValueError("Complex data not supported\n"

~/.local/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81 
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84 
     85 

~/.local/lib/python3.7/site-packages/pandas/core/generic.py in __array__(self, dtype)
   1776 
   1777     def __array__(self, dtype=None) -> np.ndarray:
-> 1778         return np.asarray(self._values, dtype=dtype)
   1779 
   1780     def __array_wrap__(self, result, context=None):

~/.local/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81 
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84 
     85 

ValueError: could not convert string to float: 'White'

Now we have a debug_fit method that we can control in an easier way than the fit method. The idea is to log several properties of our input data at each step in order to understand where is the pipeline failing and why is it failing at that step.

As a bonus trick, we'll use typedispatch from fastcore to log the types of the input data depending on its type.

Let's create the logging functions:

@typedispatch
def log_dtypes(df: pd.DataFrame):
    types_dict = dict(df.dtypes.items())
    dtypes_str = ", ".join(
        "({}, {})".format(name, dtype) for name, dtype in types_dict.items()
    )

    return f"types=[{dtypes_str}]"


@typedispatch
def log_dtypes(df: np.ndarray):
    types_dict = {i: col.dtype for i, col in enumerate(df.T)}
    dtypes_str = ", ".join(
        "({}, {})".format(name, dtype) for name, dtype in types_dict.items()
    )

    return f"types=[{dtypes_str}]"


def log_shape(df):
    return f"n_obs={df.shape[0]} n_col={df.shape[1]}"

Let's use these functions before the fit_transform:

class DebugPipeline(Pipeline):

    def debug_fit(self, X, y):
        for transformer_name, transformer in self.steps[:-1]:
            print(log_shape(X))
            print(log_dtypes(X))
            print(f"{transformer_name} to be applied")
            X = transformer.fit_transform(X)
        return self.steps[-1][1].fit(X, y)

And use the debug_fit. We can see that the pipeline is failing in the scaler step, and that the input data at these step are of type object. It looks like we needed to transform the categorical variables to numeric, as the standard scaler cannot handle categories.

pipe = DebugPipeline(steps_list)
pipe.debug_fit(df_arrests_x, df_arrests_y)
n_obs=5226 n_col=7
types=[(colour, object), (year, int64), (age, int64), (sex, object), (employed, object), (citizen, object), (checks, int64)]
scaler to be applied
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-2734d139122f> in <module>
      1 pipe = DebugPipeline(steps_list)
----> 2 pipe.debug_fit(df_arrests_x, df_arrests_y)

<ipython-input-10-65fdaf99a313> in debug_fit(self, X, y)
      6             print(log_dtypes(X))
      7             print(f"{transformer_name} to be applied")
----> 8             X = transformer.fit_transform(X)
      9         return self.steps[-1][1].fit(X, y)

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    688         if y is None:
    689             # fit method of arity 1 (unsupervised transformation)
--> 690             return self.fit(X, **fit_params).transform(X)
    691         else:
    692             # fit method of arity 2 (supervised transformation)

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y)
    665         # Reset internal state before fitting
    666         self._reset()
--> 667         return self.partial_fit(X, y)
    668 
    669     def partial_fit(self, X, y=None):

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y)
    696         X = self._validate_data(X, accept_sparse=('csr', 'csc'),
    697                                 estimator=self, dtype=FLOAT_DTYPES,
--> 698                                 force_all_finite='allow-nan')
    699 
    700         # Even in the case of `with_mean=False`, we update the mean anyway

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    418                     f"requires y to be passed, but the target y is None."
    419                 )
--> 420             X = check_array(X, **check_params)
    421             out = X
    422         else:

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     71                           FutureWarning)
     72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73         return f(**kwargs)
     74     return inner_f
     75 

~/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    597                     array = array.astype(dtype, casting="unsafe", copy=False)
    598                 else:
--> 599                     array = np.asarray(array, order=order, dtype=dtype)
    600             except ComplexWarning:
    601                 raise ValueError("Complex data not supported\n"

~/.local/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81 
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84 
     85 

~/.local/lib/python3.7/site-packages/pandas/core/generic.py in __array__(self, dtype)
   1776 
   1777     def __array__(self, dtype=None) -> np.ndarray:
-> 1778         return np.asarray(self._values, dtype=dtype)
   1779 
   1780     def __array_wrap__(self, result, context=None):

~/.local/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81 
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84 
     85 

ValueError: could not convert string to float: 'White'

Let's solve this issue by applying one-hot encoding before running the pipeline:

steps_list = [
    ('one-hot', OneHotEncoder()),
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer()),
    ('learner', LogisticRegression())
]
pipe = DebugPipeline(steps_list)
pipe.debug_fit(df_arrests_x, df_arrests_y)
n_obs=5226 n_col=7
types=[(colour, object), (year, int64), (age, int64), (sex, object), (employed, object), (citizen, object), (checks, int64)]
one-hot to be applied
n_obs=5226 n_col=11
types=[(colour_1, int64), (colour_2, int64), (year, int64), (age, int64), (sex_1, int64), (sex_2, int64), (employed_1, int64), (employed_2, int64), (citizen_1, int64), (citizen_2, int64), (checks, int64)]
scaler to be applied
n_obs=5226 n_col=11
types=[(0, float64), (1, float64), (2, float64), (3, float64), (4, float64), (5, float64), (6, float64), (7, float64), (8, float64), (9, float64), (10, float64)]
imputer to be applied
/Users/davidmasip/opt/anaconda3/envs/sk-experiments/lib/python3.7/site-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  elif pd.api.types.is_categorical(cols):
LogisticRegression()

And now, for free, we get to see which are the columns of the pipeline at each step. We can understand the features that our model is using better and have a more intuitive picture of our pipeline.

Summary

  • With its current implementation, we can easily change scikit-learn's Pipeline to slightly change its behaviour and make it easier to debug.
  • With this implementation, we're getting logging for free.
  • We could even log the predictions by building a debug_predict.