Machine Learning Python Programming Tips, Best Practices

Content

Content
How to apply same data preprocessing steps to train and test data while working with scikit-learn?
Machine Learning in Python with Scikit-Learn
How Python Manages Memory and Creating Arrays With np.linspace()
Refactoring Python Applications for Simplicity
Python CI/CD using Github action
Pytoch gpu tuning
Pro-tip for pytest users:

How to apply same data preprocessing steps to train and test data while working with scikit-learn?

The general idea is save the preprocessing steps in .pkl file using joblib and reuse them during prediction. This will ensure consistency. If you are using scikit learn, then there is an easy way to club preprocessing and modelling in same object.

Use Pipeline.

Example:

Say in your data you have both numerical and categorical columns. And you need to apply some processing on that and you also want to make sure to apply them during the prediction phase. Also both training and prediction phases are two different pipeline. In such situation you can apply something like this:

import joblib
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X = data[feat_cols]
y = data["OUTCOME"]

numeric_features = feat_cols
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                      ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features)])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, random_state=42, stratify=y)

clf.fit(X_train, y_train)

clf.predict_proba(X_test)

model_file = "../model/model_randomforest.pkl"
joblib.dump(clf, model_file)