How and why to Standardize your data: A python tutorial

In this post I explain why and how to apply Standardization using scikit-learn in Python

May 26, 2020·4 min read

Figure taken from: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html. Left subplot: the unscaled data. Right subplot: the transformed data.

Hi there.

This is my first Medium post. I am an electrical & computer engineer currently finishing my PhD studies in the biomedical engineering and computational neuroscience field. I have been working on machine learning problems for the past 4 years. A very common question that I see all around the web is how to standardize and why to do so, the data before fitting a machine learning model.

How does scikit-learn’s StandardScaler work ?

The first question that comes to one’s mind is:

Why to standardize in the first place?

Why to standardize before fitting a ML model?

Well, the idea is simple. Variables that are measured at different scales do not contribute equally to the model fitting & model learned function and might end up creating a bias. Thus, to deal with this potential problem feature-wise standardized (μ=0, σ=1) is usually used prior to model fitting.

To do that using scikit-learn, we first need to construct an input array X containing the features and samples with X.shape being[number_of_samples, number_of_features] .

Keep in mind that all scikit-learn machine learning (ML) functions expect as input an numpy array X with that shape i.e. the rows are the samples and the columns are the features/variables. Having said that, let’s assume that we have a matrix X where each row/line is a sample/observation and each column is a variable/feature.

Note: Tree-based models are usually not dependent on scaling, but non-tree models models such as SVM, LDA etc. are often hugely dependent on it.

Core of method

The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your features/variables/columns of X, individually, before applying any machine learning model. Thus, StandardScaler() will normalize the features i.e. each column of X, INDIVIDUALLY so that each column/feature/variable will have μ = 0 and σ = 1.

The mathematical formulation of the standardization procedure. Image generated by the author.

Working Python code example:

from sklearn.preprocessing import StandardScaler
import numpy as np

# 4 samples/observations and 2 variables/features
X = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])# the scaler object (model)
scaler = StandardScaler()# fit and transform the data
scaled_data = scaler.fit_transform(X) 

print(X)
[[0, 0],
 [1, 0],
 [0, 1],
 [1, 1]])

print(scaled_data)
[[-1. -1.]
 [ 1. -1.]
 [-1.  1.]
 [ 1.  1.]]

Verify that the mean of each feature (column) is 0:

scaled_data.mean(axis = 0)
array([0., 0.])

Verify that the std of each feature (column) is 1:

scaled_data.std(axis = 0)
array([1., 1.])

The effect of the transform in a visual example

Summary

StandardScaler removes the mean and scales each feature/variable to unit variance. This operation is performed feature-wise in an independent way.
StandardScaler can be influenced by outliers (if they exist in the dataset) since it involves the estimation of the empirical mean and standard deviation of each feature.

How to deal with outliers

Manual way (not recommended): Visually inspect the data and remove outliers using outlier removal statistical methods such as the Interquartile Range (IQR) threshold method.
Recommended way: Use the RobustScaler that will just scale the features but in this case using statistics that are robust to outliers. This scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

That’s all for today! Hope you liked this first post! Next story coming next week. Stay tuned & safe.

Stay tuned & support me

If you liked and found this article useful, follow me and applaud my story to support me!

- My mailing list in just 5 seconds: https://seralouk.medium.com/subscribe

- Become a member and support me:https://seralouk.medium.com/membership

References

[1] https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

[2] https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html

Get in touch with me

LinkedIn: https://www.linkedin.com/in/serafeim-loukas/
ResearchGate: https://www.researchgate.net/profile/Serafeim_Loukas
EPFL profile: https://people.epfl.ch/serafeim.loukas
Stack Overflow: https://stackoverflow.com/users/5025009/seralouk

Serafeim Loukas

Research Scientist at University of Geneva & University Hospital of Bern. PhD, MSc, M.Eng.

Related

Random Forests Algorithm explained with a real-life example and some Python code

Regression for Classification | Hands on Experience

Interpreting Confusing Multiple Linear Regression Results

Testing More Hyperparameters with Less Code in Machine Learning

Sign up for The Variable

By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look.

More from Towards Data Science

Your home for data science. A Medium publication sharing concepts, ideas and codes.

·May 26, 2020

Reading and understanding research papers is like piecing together an unsolved puzzle. Photo by Hans-Peter Gauster on Unsplash.

INSIDE AI NLP365

Day 147 of #NLP365: NLP Papers Summary — Two Birds, One Stone: A Simple, Unified Model for Text Generation from Structured and Unstructured Data

NLP Papers Summary is a series where I summarise the key takeaways of NLP research papers

Project #NLP365 (+1) is where I document my NLP learning journey every single day in 2020. Feel free to check out what I have been learning over the last 290 days here. At the end of this article, you can find previous papers summary grouped by NLP areas and you…

4 things I learned recruiting with Bayesian inference

A step-by-step demonstration of how you can use Bayes’ Theorem to improve your hiring process.

Recently I wrote an article about how to hire developers, and I started with a note that I expected I’d revisit it some years down the line. Well here we are a grand total of 51 days later, and I’ve already got something to add.

I’ve been thinking recently about…

COVID-19 Data Visualization using Python

Data Visualization is the first step towards getting an insight into a large data set in every data science project. Once the data has been acquired and preprocessed (cleaned and deduplicated), the next step in the Data Science Life Cycle is Exploratory Data Analysis which kicks off with visualization of…

What is a Convolutional Neural Network?

A visualization of the basic elements of a Convolutional Neural Network

Visualization is a great tool in understanding rich concepts, especially for beginners in the area. In this article, we will go through the basic elements of a convolutional neural network using visual aids. …

LEARN HOW MACHINES LEARN PART I

Intro to CNNs (Part I): Understanding and Organizing Image Data Sets

A practical introduction to convolutional neural networks with Python, Keras, and PlaidML or TensorFlow

Introduction

Modern technology has made convolutional neural networks (CNNs) a feasible solution for an enormous array of problems, including everything from identifying and locating brand placement in marketing materials, to diagnosing cancer in Lung CTs, and more. You can even use CNNs to sort Lego bricks if that’s your thing. …

Search This Blog

sainsdata