Fun with pipelines and irises (python, sklearn, pipeline, categorical, xgboost)

Following is demonstration of a trick I've learned using pipelines with scikit-learn. Your machine-learning development flow probably looks something like this:
  1. explore data
  2. subset features
  3. transform features
  4. calculate new features
  5. split data into train-validate-test
  6. model fit using sklearn estimators
  7. repeat

In the past, I would functionalize each step above and think my code style was pretty good. Lately, I've been using scikit-learn pipelines as even better code style, templating the function calls! You may want to add/switch out different transformers, or add/switch out different models as you learn more - which you will!

Pipelines enable cleaner switching in/out of transforms and models, all you have to do is change out the pipes in your pipeline. Pipelines make your machine-learning workflow reproducible and easier to share with teammates. The notebook itself is attached here. Below I explain how it works.

Demonstration below from Jupyter notebook uses the classic, well-known iris dataset. Given features of irises (the Greek rainbow goddess flower), the goal is to classify samples according to 3 iris species. (apparently Fisher in 1936 didn't know about Iris douglasiana, a 4th species native to Northern CA region.)

First, we'll load the necessary libraries and load the data.
iris flower named after the Greek goddess of the rainbow
import sys
print('Python: {}'.format(sys.version))
import numpy as np
print('numpy: {}'.format(np.__version__))
import pandas as pd
print('pandas: {}'.format(pd.__version__))
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

# Read dataset from URL
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
iris = pd.read_csv(url, names=names)

# Randomize the rows to help prevent sampling bias later when you split train/val/test
iris = iris.reindex(np.random.permutation(iris.index))

# Convert target column to pandas built-in type category
iris["class"] = iris["class"].astype("category")

# Check dataset
print("DATA SIZE")
print(iris.shape)
print("HEAD 2 ROWS:")
print(iris.head(2))
print("SUMMARY STATS:")
print(iris.describe())
Python: 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
numpy: 1.14.5
pandas: 0.23.1
sklearn: 0.19.1

DATA SIZE
(150, 5)
HEAD 2 ROWS:
   sepal-length   sepal-width   petal-length   petal-width        class
0           5.1           3.5            1.4           0.2  Iris-setosa
1           4.9           3.0            1.4           0.2  Iris-setosa
SUMMARY STATS:
       sepal-length   sepal-width   petal-length   petal-width
count    150.000000    150.000000     150.000000    150.000000
mean       5.843333      3.054000       3.758667      1.198667
std        0.828066      0.433594       1.764420      0.763161
min        4.300000      2.000000       1.000000      0.100000
25%        5.100000      2.800000       1.600000      0.300000
50%        5.800000      3.000000       4.350000      1.300000
75%        6.400000      3.300000       5.100000      1.800000
max        7.900000      4.400000       6.900000      2.500000

Explore the data by visualizing features, x's, compared to labels, y's. We'll make a segmented bar chart and check correlations. Intuition about the data will help us "vet" the machine learning results later.

# Matplotlib
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,3))   

# Pandas segmented bar chart of y-value means by feature X
mytable = iris.groupby(by=['class'])['sepal-length','sepal-width','petal-length','petal-width'].mean()
mytable.plot.bar(ax=axes[0]);
x_axis = axes[0].axes.get_xaxis()
x_label = x_axis.get_label()
x_label.set_visible(False)
axes[0].tick_params(axis = 'x', labelsize = 12, labelrotation='default')

# Seaborn box plot numeric x-variables
temp = iris.select_dtypes(include=[np.number])
axes = sns.boxplot(data=temp, ax=axes[1]);

# Print out pandas profile report
pandas_profiling.ProfileReport(iris)

What you should notice: 1) top left chart shows petal-length and petal-width are the most segmenting features between classes (most difference in values), so expect these to be top feature candidates in your classification model, 2) top right chart shows more variance in petal-length than petal-width, 3) middle chart shows equal sampling of 3 classes, 4) correlations show petal-length and petal-width are the most highly correlated (darkest red quadrants) as well as sepal-length.

Certain estimators such as XGBoost require only numeric inputs. This means we have to create dummy numbers from categorical targets. If the categorical data are in the features, we have to additionally use a process called "one-hot encoding". What's tricky about one-hot-encoded variables is you're actually changing the shape of your dataframe, which usually can't be handled in a single pipeline. One trick I've learned is to split up the processing into 2 Stages: Processing and Modeling, each with their own Pipelines. Here is a general pipeline programming flow that works for me and took a while (for me) to figure out.

PROCESSING STAGE
1. Define pipeline to transform categorical variables
2. Define pipeline to transform numeric variables
3. Define pipeline to transform boolean variables
4. Feature_pipeline = Pipeline([ 
        (FeatureUnion(transformer_list=[
         (transform_categoricals) 
         (transform_numerics) 
                (transform_booleans) 
                ])  #end FeatureUnion
 ])  #end Feature_pipeline
5. Target_pipeline = Pipeline([ 
        (FeatureUnion(transformer_list=[
         (transform_categoricals) 
         (transform_numerics) 
                (transform_booleans) 
                ])  #end FeatureUnion
 ])  #end Target_pipeline

SPLIT DATA INTRO TRAIN-VALIDATE-TEST
X_train, X_validate, X_test = Feature_pipeline.transform(df_features)
y_train, y_validate, y_test = Target_pipeline.transform(df_target)

MODELING STAGE
1. select X_train, y_train column you want, e.g. only select dummy-encoded columns leaving out original categorical columns
2. Define pipeline to transform and fit X,y which is default behavior of pipelines
model_pipeline = Pipeline([
 (better form is put your column selection into transform fn here)
        (model_1_of_ensemble)
        (model_2_of_ensemble)
        (more models... ) 
 ])  #end pipeline2
6. model_pipeline.fit_transform(X_train, y_train)
# STEP1: Define processing pipelines

# Encode categorical column
pd.options.mode.chained_assignment = None  # default='warn' due to chaining fn calling fn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
def encode_category(data) :
    data_le = data
    le = LabelEncoder()
    for col in range(0,len(data.columns)):
        # subset just a categorical column
        name = data.columns[col] + '_encoded'
        data_le[name] = le.fit_transform(data.iloc[:,col])
    return data_le


# Define custom sklearn transformer for categorical variables
from sklearn.base import BaseEstimator, TransformerMixin
class TransformCategoricals(BaseEstimator, TransformerMixin):
    """Prepares categorical features from iris data set.
    Args:
       X,y as numpy arrays
    Returns:
       X numpy array transformed, y as original numpy array
    """
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Encode categorical variable "class"
        iris_le = encode_category(X)
        return iris_le


# define pipeline to transform just categoric columns
pipeline_categorical = Pipeline([
    ("select_categorical", TypeSelector(dtype='category'))
    , ("encode", TransformCategoricals())
    # one-hot would go here...
    # add/replace more transforms here...
])

# define pipeline to transform just numeric columns
pipeline_numeric = Pipeline([
    ("select_numeric", TypeSelector(dtype='number'))
    # add/replace more transforms here...
])

# Putting it together, define Feature_pipeline, in case of mixed data types in features
from sklearn.pipeline import Pipeline, FeatureUnion
Feature_pipeline = Pipeline([
    ('union', FeatureUnion([
        ('categorical', Pipeline([
            ("select_categorical", TypeSelector(dtype='category'))            
            , ("encode", TransformCategoricals())
            # add/replace one hot transform here...
        ]))
        , ('numerical', Pipeline([
            ("select_numeric", TypeSelector(dtype='number'))
            # add/replace more numeric transforms here...
        ]))
    ]))  #end FeatureUnion
    #, ("debug", Debug())  #prints head() for debugging
    ])

# SPLIT DATA INTRO TRAIN-VALIDATE-TEST
# Choose the first 105 (out of 150, 70%) examples for training.
# X_train could call Feature_pipeline, but we only have 1 datatype in iris example, which is numerics
X_train = pipeline_numeric.transform(iris.head(105))
y_train = pipeline_categorical.transform(iris.head(105))

# Choose the last 45 (out of 150, 30%) examples for validation.
X_test = pipeline_numeric.transform(iris.tail(45))
y_test = pipeline_categorical.transform(iris.tail(45))

# Double-check that we've done the right thing.
print("Training examples summary:")
print(pd.concat([X_train,y_train], axis=1).describe())
print("Validation examples summary:")
print(pd.concat([X_test,y_test], axis=1).describe())
Training examples summary:
       sepal-length  sepal-width  petal-length  petal-width  class_encoded
count    105.000000   105.000000    105.000000   105.000000     105.000000
mean       5.742857     3.062857      3.576190     1.139048       0.933333
std        0.805970     0.426808      1.742651     0.765795       0.823532
min        4.300000     2.000000      1.000000     0.100000       0.000000
25%        5.100000     2.800000      1.500000     0.300000       0.000000
50%        5.700000     3.000000      4.200000     1.300000       1.000000
75%        6.400000     3.300000      5.000000     1.800000       2.000000
max        7.900000     4.400000      6.600000     2.500000       2.000000
Validation examples summary:
       sepal-length  sepal-width  petal-length  petal-width  class_encoded
count     45.000000    45.000000     45.000000    45.000000      45.000000
mean       6.077778     3.033333      4.184444     1.337778       1.155556
std        0.840424     0.453271      1.760547     0.746899       0.796457
min        4.400000     2.200000      1.200000     0.100000       0.000000
25%        5.500000     2.800000      3.000000     1.000000       1.000000
50%        6.200000     3.000000      4.700000     1.500000       1.000000
75%        6.600000     3.300000      5.500000     1.800000       2.000000
max        7.700000     4.200000      6.900000     2.500000       2.000000

Now we're ready for the modeling stage.

# MODELING STAGE
# drop the original categorical class, keep just the encoded class
y_train = y_train['class_encoded']
y_test = y_test['class_encoded']

# set seed
seed = 123

# Try XGB classifier
from xgboost.sklearn import XGBClassifier
model_pipeline = Pipeline([
    ('XGB', XGBClassifier(seed=seed))
])

# fit the model pipeline
model_pipeline.fit(X_train, y_train)
Pipeline(memory=None,
     steps=[('XGB', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=123,
       silent=True, subsample=1))])

# make predictions with test data
y_pred = model_pipeline.predict(X_test)
predictions = [round(value) for value in y_pred]

# do a quick evaluation of prediction
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print(classification_report(y_test, predictions))
Accuracy: 91.11%
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        11
          1       0.93      0.81      0.87        16
          2       0.85      0.94      0.89        18

avg / total       0.91      0.91      0.91        45

Try a different model, easy!

# define a simple model pipeline
model_pipeline = Pipeline([
    #('XGB', XGBClassifier(seed=seed))
    ('LDA', LinearDiscriminantAnalysis(seed=seed))
])

# fit the model pipeline
model_pipeline.fit(X_train, y_train)
Pipeline(memory=None,
     steps=[('LDA', LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
              solver='svd', store_covariance=False, tol=0.0001))])

# make predictions with test data
y_pred = model_pipeline.predict(X_test)
predictions = [round(value) for value in y_pred]

# do a quick evaluation of prediction
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print(classification_report(y_test, predictions))
Accuracy: 97.78%
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        11
          1       0.94      1.00      0.97        16
          2       1.00      0.94      0.97        18

avg / total       0.98      0.98      0.98        45

Setting up the pipelines might seem like a lot for just the iris, but now you have the process down. Now it will be easier to try out different transformers or model estimators.

NOTE TO SELF TODO: write Part2 of this blog article! Next up, I'll plug in a bunch of different model estimators, including non-scikit-learn such as Google's TensorFlow, and talk about model evaluation... Hint: I recently discovered automatic grid search and lime ...!