- explore data
- subset features
- transform features
- calculate new features
- split data into train-validate-test
- model fit using sklearn estimators
- repeat
In the past, I would functionalize each step above and think my code style was pretty good. Lately, I've been using scikit-learn pipelines as even better code style, templating the function calls! You may want to add/switch out different transformers, or add/switch out different models as you learn more - which you will!
Pipelines enable cleaner switching in/out of transforms and models, all you have to do is change out the pipes in your pipeline. Pipelines make your machine-learning workflow reproducible and easier to share with teammates. The notebook itself is attached here. Below I explain how it works.
Demonstration below from Jupyter notebook uses the classic, well-known iris dataset. Given features of irises (the Greek rainbow goddess flower), the goal is to classify samples according to 3 iris species. (apparently Fisher in 1936 didn't know about Iris douglasiana, a 4th species native to Northern CA region.) First, we'll load the necessary libraries and load the data. |
import sys print('Python: {}'.format(sys.version)) import numpy as np print('numpy: {}'.format(np.__version__)) import pandas as pd print('pandas: {}'.format(pd.__version__)) import matplotlib import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns import sklearn print('sklearn: {}'.format(sklearn.__version__)) # Read dataset from URL url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] iris = pd.read_csv(url, names=names) # Randomize the rows to help prevent sampling bias later when you split train/val/test iris = iris.reindex(np.random.permutation(iris.index)) # Convert target column to pandas built-in type category iris["class"] = iris["class"].astype("category") # Check dataset print("DATA SIZE") print(iris.shape) print("HEAD 2 ROWS:") print(iris.head(2)) print("SUMMARY STATS:") print(iris.describe())
Python: 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] numpy: 1.14.5 pandas: 0.23.1 sklearn: 0.19.1 DATA SIZE (150, 5) HEAD 2 ROWS: sepal-length sepal-width petal-length petal-width class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa SUMMARY STATS: sepal-length sepal-width petal-length petal-width count 150.000000 150.000000 150.000000 150.000000 mean 5.843333 3.054000 3.758667 1.198667 std 0.828066 0.433594 1.764420 0.763161 min 4.300000 2.000000 1.000000 0.100000 25% 5.100000 2.800000 1.600000 0.300000 50% 5.800000 3.000000 4.350000 1.300000 75% 6.400000 3.300000 5.100000 1.800000 max 7.900000 4.400000 6.900000 2.500000
Explore the data by visualizing features, x's, compared to labels, y's. We'll make a segmented bar chart and check correlations. Intuition about the data will help us "vet" the machine learning results later.
# Matplotlib fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,3)) # Pandas segmented bar chart of y-value means by feature X mytable = iris.groupby(by=['class'])['sepal-length','sepal-width','petal-length','petal-width'].mean() mytable.plot.bar(ax=axes[0]); x_axis = axes[0].axes.get_xaxis() x_label = x_axis.get_label() x_label.set_visible(False) axes[0].tick_params(axis = 'x', labelsize = 12, labelrotation='default') # Seaborn box plot numeric x-variables temp = iris.select_dtypes(include=[np.number]) axes = sns.boxplot(data=temp, ax=axes[1]); # Print out pandas profile report pandas_profiling.ProfileReport(iris)
What you should notice: 1) top left chart shows petal-length and petal-width are the most segmenting features between classes (most difference in values), so expect these to be top feature candidates in your classification model, 2) top right chart shows more variance in petal-length than petal-width, 3) middle chart shows equal sampling of 3 classes, 4) correlations show petal-length and petal-width are the most highly correlated (darkest red quadrants) as well as sepal-length.
Certain estimators such as XGBoost require only numeric inputs. This means we have to create dummy numbers from categorical targets. If the categorical data are in the features, we have to additionally use a process called "one-hot encoding". What's tricky about one-hot-encoded variables is you're actually changing the shape of your dataframe, which usually can't be handled in a single pipeline. One trick I've learned is to split up the processing into 2 Stages: Processing and Modeling, each with their own Pipelines. Here is a general pipeline programming flow that works for me and took a while (for me) to figure out.
PROCESSING STAGE 1. Define pipeline to transform categorical variables 2. Define pipeline to transform numeric variables 3. Define pipeline to transform boolean variables 4. Feature_pipeline = Pipeline([ (FeatureUnion(transformer_list=[ (transform_categoricals) (transform_numerics) (transform_booleans) ]) #end FeatureUnion ]) #end Feature_pipeline 5. Target_pipeline = Pipeline([ (FeatureUnion(transformer_list=[ (transform_categoricals) (transform_numerics) (transform_booleans) ]) #end FeatureUnion ]) #end Target_pipeline SPLIT DATA INTRO TRAIN-VALIDATE-TEST X_train, X_validate, X_test = Feature_pipeline.transform(df_features) y_train, y_validate, y_test = Target_pipeline.transform(df_target) MODELING STAGE 1. select X_train, y_train column you want, e.g. only select dummy-encoded columns leaving out original categorical columns 2. Define pipeline to transform and fit X,y which is default behavior of pipelines model_pipeline = Pipeline([ (better form is put your column selection into transform fn here) (model_1_of_ensemble) (model_2_of_ensemble) (more models... ) ]) #end pipeline2 6. model_pipeline.fit_transform(X_train, y_train)
# STEP1: Define processing pipelines # Encode categorical column pd.options.mode.chained_assignment = None # default='warn' due to chaining fn calling fn from sklearn.preprocessing import LabelEncoder, OneHotEncoder def encode_category(data) : data_le = data le = LabelEncoder() for col in range(0,len(data.columns)): # subset just a categorical column name = data.columns[col] + '_encoded' data_le[name] = le.fit_transform(data.iloc[:,col]) return data_le # Define custom sklearn transformer for categorical variables from sklearn.base import BaseEstimator, TransformerMixin class TransformCategoricals(BaseEstimator, TransformerMixin): """Prepares categorical features from iris data set. Args: X,y as numpy arrays Returns: X numpy array transformed, y as original numpy array """ def fit(self, X, y=None): return self def transform(self, X): # Encode categorical variable "class" iris_le = encode_category(X) return iris_le # define pipeline to transform just categoric columns pipeline_categorical = Pipeline([ ("select_categorical", TypeSelector(dtype='category')) , ("encode", TransformCategoricals()) # one-hot would go here... # add/replace more transforms here... ]) # define pipeline to transform just numeric columns pipeline_numeric = Pipeline([ ("select_numeric", TypeSelector(dtype='number')) # add/replace more transforms here... ]) # Putting it together, define Feature_pipeline, in case of mixed data types in features from sklearn.pipeline import Pipeline, FeatureUnion Feature_pipeline = Pipeline([ ('union', FeatureUnion([ ('categorical', Pipeline([ ("select_categorical", TypeSelector(dtype='category')) , ("encode", TransformCategoricals()) # add/replace one hot transform here... ])) , ('numerical', Pipeline([ ("select_numeric", TypeSelector(dtype='number')) # add/replace more numeric transforms here... ])) ])) #end FeatureUnion #, ("debug", Debug()) #prints head() for debugging ])
# SPLIT DATA INTRO TRAIN-VALIDATE-TEST # Choose the first 105 (out of 150, 70%) examples for training. # X_train could call Feature_pipeline, but we only have 1 datatype in iris example, which is numerics X_train = pipeline_numeric.transform(iris.head(105)) y_train = pipeline_categorical.transform(iris.head(105)) # Choose the last 45 (out of 150, 30%) examples for validation. X_test = pipeline_numeric.transform(iris.tail(45)) y_test = pipeline_categorical.transform(iris.tail(45)) # Double-check that we've done the right thing. print("Training examples summary:") print(pd.concat([X_train,y_train], axis=1).describe()) print("Validation examples summary:") print(pd.concat([X_test,y_test], axis=1).describe())
Training examples summary: sepal-length sepal-width petal-length petal-width class_encoded count 105.000000 105.000000 105.000000 105.000000 105.000000 mean 5.742857 3.062857 3.576190 1.139048 0.933333 std 0.805970 0.426808 1.742651 0.765795 0.823532 min 4.300000 2.000000 1.000000 0.100000 0.000000 25% 5.100000 2.800000 1.500000 0.300000 0.000000 50% 5.700000 3.000000 4.200000 1.300000 1.000000 75% 6.400000 3.300000 5.000000 1.800000 2.000000 max 7.900000 4.400000 6.600000 2.500000 2.000000 Validation examples summary: sepal-length sepal-width petal-length petal-width class_encoded count 45.000000 45.000000 45.000000 45.000000 45.000000 mean 6.077778 3.033333 4.184444 1.337778 1.155556 std 0.840424 0.453271 1.760547 0.746899 0.796457 min 4.400000 2.200000 1.200000 0.100000 0.000000 25% 5.500000 2.800000 3.000000 1.000000 1.000000 50% 6.200000 3.000000 4.700000 1.500000 1.000000 75% 6.600000 3.300000 5.500000 1.800000 2.000000 max 7.700000 4.200000 6.900000 2.500000 2.000000
Now we're ready for the modeling stage.
# MODELING STAGE # drop the original categorical class, keep just the encoded class y_train = y_train['class_encoded'] y_test = y_test['class_encoded'] # set seed seed = 123 # Try XGB classifier from xgboost.sklearn import XGBClassifier model_pipeline = Pipeline([ ('XGB', XGBClassifier(seed=seed)) ]) # fit the model pipeline model_pipeline.fit(X_train, y_train)
Pipeline(memory=None, steps=[('XGB', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='multi:softprob', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=123, silent=True, subsample=1))])
# make predictions with test data y_pred = model_pipeline.predict(X_test) predictions = [round(value) for value in y_pred] # do a quick evaluation of prediction accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0)) print(classification_report(y_test, predictions))
Accuracy: 91.11% precision recall f1-score support 0 1.00 1.00 1.00 11 1 0.93 0.81 0.87 16 2 0.85 0.94 0.89 18 avg / total 0.91 0.91 0.91 45
Try a different model, easy!
# define a simple model pipeline model_pipeline = Pipeline([ #('XGB', XGBClassifier(seed=seed)) ('LDA', LinearDiscriminantAnalysis(seed=seed)) ]) # fit the model pipeline model_pipeline.fit(X_train, y_train)
Pipeline(memory=None, steps=[('LDA', LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None, solver='svd', store_covariance=False, tol=0.0001))])
# make predictions with test data y_pred = model_pipeline.predict(X_test) predictions = [round(value) for value in y_pred] # do a quick evaluation of prediction accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0)) print(classification_report(y_test, predictions))
Accuracy: 97.78% precision recall f1-score support 0 1.00 1.00 1.00 11 1 0.94 1.00 0.97 16 2 1.00 0.94 0.97 18 avg / total 0.98 0.98 0.98 45
Setting up the pipelines might seem like a lot for just the iris, but now you have the process down. Now it will be easier to try out different transformers or model estimators.
NOTE TO SELF TODO: write Part2 of this blog article! Next up, I'll plug in a bunch of different model estimators, including non-scikit-learn such as Google's TensorFlow, and talk about model evaluation... Hint: I recently discovered automatic grid search and lime ...!