# The Ultimate and Practical Guide on Feature Scaling

Subscribe to my newsletter and never miss my upcoming articles

Machine Learning models are very particular about the type and range of values that have to go for their input in order to do what they do.

With the exception of decision trees, most ML models will expect you to scale the input features. When the input features are scaled, the model can converge faster than it would without scaling the input features. What is feature scaling?

Feature scaling is nothing other than transforming the numerical features into a small range of values. Its name implies that only features are scaled. Labels or output data don't need to be scaled.

There are many feature scaling techniques, but the 2 commonly used are normalization (a.k.a MinMax scaling) and standardization. Another third technique is robust scaling.

In many internet resources, you will see normalization and standardization used interchangeably, but they are completely different and are suited for different datasets.

In this blog, we will shed light on these 3 techniques and will implement them with Scikit-Learn, a well-designed classical machine learning framework.

### 1. Normalization

Normalization is a scaling technique that transforms the numerical feature to the range of values between 0 and 1.

Here is a formula that is followed when normalizing the data. $$Xmin$$ is the minimum value of feature X, and $$Xmax$$ is the maximum value of X. We are basically finding the norm of the feature vector.

$$Xnorm = \frac {X-Xmin} {Xmax-Xmin}$$

Be wary that this formula should be computed on training and testing data separately because otherwise, you would be leaking training data statistics into the test set.

I didn't know that kind of data leakage until my friend Santiago wrote this great tweet about pitfalls that happen in feature scaling.

It's fairly easy to apply any technique to a dataset but as engineers, we are expected to be deliberate on why we use certain techniques. And that brings me to this question: When should we scale our training data with normalization or minimax scaler?

#### When Should you Normalize the Features?

If you have used many datasets, it's very likely that you noticed that features usually have different ranges and scales (or units). Take an example. An age feature may vary from 5 to 80, temperature feature from 20 to 500, a month feature from 1 to 12, and so forth.

With these features having different ranges, it's best to scale them into values between 0 and 1. Normalization can work pretty well in this case.

But more specifically, normalization is a preferrable scaling technique when the data at hand doesn't have a normal or gaussian distribution. If the data's distribution is gaussian, standardization is a preferrable scaling technique(more on this later).

If you don't know the distribution of the data, still, normalization is a good choice at first.

Talking about the particularities of machine learning models such as neural networks and K-Nearest Neighbors(KNN), normalization is a good choice for these types of algorithms because they don't make any assumption of the input data.

Like we said in the beginning, most ML frameworks have direct and flexible support for feature scaling without having to implement it yourself.

#### Implementing Normalization in Scikit-Learn

For illustration purposes, I will use the tips dataset that is available in seaborn datasets.

from seaborn import load_dataset



I will then take all numerical features in the above dataset and it is what we will apply scaler on. Remember, we can only scale numerical features.

num_feats = tip_data[['total_bill', 'tip', 'size']]


For now, let's scale those numerical features with Scikit-Learn preprocessing functions. We will use MinMaxScaler which scales the data to the range between 0 and 1 by default. If you want a different range, you can change that.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

num_scaled = scaler.fit_transform(num_feats)


The output of the scaler is a NumPy array. You can convert it back to a Pandas DataFrame.

import pandas as pd

num_scaled_df = pd.DataFrame(num_scaled, columns=num_feats.columns)


As you can tell from the above dataframe, all features are scaled to values between 0 and 1. That was it for normalization, let's now turn the page to standardization.

### 2. Standardization

In standardization, the numerical features are rescaled to have the 0 mean$$u$$ and unity standard deviation(std or 𝜎 )

Here is the formula of standardization. $$Xstd$$ is the standardized feature, $$X$$ is the feature, $$u$$ is the mean of the feature, and 𝜎 is the standard deviation.

$$Xstd = \frac {X - u} {\sigma}$$

#### When Should you Standardize the Features?

The standardization scaling technique is suitable for data that has a normal or gaussian distribution.

Some machine learning models such as support vector machines (with Radial Basis Function (RBF) kernel,) and linear models (linear and logistic regression) expect the input data to have a normal distribution.

In most cases, whether you choose normalization or standardization, it won't make much difference, but it can. So, it makes sense to try both especially if you are not sure about the distribution of the data.

#### Implementing Standardization in Scikit-Learn

from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
num_std = std_scaler.fit_transform(num_feats)


Converting the scaled features into a dataframe for easy glance,

num_std_scaled_df = pd.DataFrame(num_std, columns=num_feats.columns)


What standardization does is that it will scale the data to have a mean of 0 and unity standard deviation. We can verify that.

import numpy as np

print(f'The mean of scaled data: {np.round(num_std.mean(axis=0))}')
print(f'The standard deviation of scaled data: {num_std.std(axis=0)}')

The mean of scaled data: [-0.  0. -0.]
The standard deviation of scaled data: [1. 1. 1.]


As clear as it appears above, all 3 features have 0 mean and unity(1) standard deviation.

Before we wrap up this, let's also look into another useful scaling technique.

#### 3. Robust Scaler

Robust scaler is kind of similar to standardization but is used when the data contains many outliers.

Instead of dropping the mean, the median is dropped and the data is scaled to the Interquartile Range(IQR). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Like normalization and standardization, Roust scaler is also implemented easily in Scikit-Learn.

from sklearn.preprocessing import RobustScaler

rob_scaler = RobustScaler()
num_rob_scaled = rob_scaler.fit_transform(num_feats)

# Display the first 5 rows

num_rob_scaled[:5]


By scaling the data with Robust Scaler, the median of the resulting values will have a median of zero.

print(f'The median of scaled data: {np.round(np.median(num_rob_scaled, axis=0))}')

The median of scaled data: [-0.  0.  0.]


### The Bottom Line

Scaling the input data before feeding it to a machine learning model is always a good practice.

If you forget anything we talked about above(you can always visit it), take the following important notes:

• Scaling the features helps the model to converge faster.
• Normalization is scaling the data to be between 0 and 1. It is preferred when the data doesn't have a normal distribution.
• Standardization is scaling the data to have 0 mean and unit standard deviation. It is preferred when the data has a normal or gaussian distribution.
• Robust scaling technique is used if the data has many outliers.
• In most cases, the choice of scaling technique won't make much difference (or it can). Try all of them and see what works best with your data.
• Only the features are scaled. The labels should not be scaled.
• Make sure to not fit the scaler on test data. Only transform.
# Don't:
## Only features are scaled not labels. Don't fit on the test set

scaler.fit_transfrom(y_train)
scaler.fit_transfrom(X_test)

# Do
scaler.transform(X_test)