A Comprehensive Guide for Handling Missing Values
A complete and step by step tutorial on how to deal with missing values
Real-world datasets are very messy and in many cases, they have many missing values. Often, missing values are one of the most problems that Data Scientists and Machine Learning Engineers are likely to deal with day to day, and it is not straightforward to know the right strategy.
There are many ways to deal with missing values, but there is no one fit all strategy. The right strategy depends on the dataset, its size or number of examples you have, the size of missing values in concerned features, what can be tolerated, and so on. Choosing the best strategy will help us to provide accurate insights, and avoid us from communicating wrong information. It can also save us time that we would spend tuning the model, from the fact that a good model comes from good (and clean) data.
There are many methods to handle missing values, but basically, everything we can do falls into the following:
* Removing the missing values * Filling the missing values * Leaving the missing values as they are.
In this article, I will walk through the common ways to handle missing values. For illustrations and code, I will use the California housing dataset available on Kaggle.
1. Removing the missing values
Before manipulating anything, this is the status of the missing values in the dataset.
The quickest way to remove missing values completely is done as follows:
We would be done with the NaNs work at this point, but we lost the data or perhaps there are other ways we can try.
If you wanted to be specific on removing rows or columns, you can specify it in the axis parameter, as follows. The entire column with missing values will be removed.
This is can however lead to the loss of data that could be helpful despite that it contains NaNs. If you wanted more control, then you can use thresh to specify how non-missing values to keep for a given column or row.
The above code won't change anything since non-missing values in the total bedrooms are more than 200. But if you change 200 to 20600, you will see that the total number of bedrooms will be removed.
Another interesting thing to try is to determine if the row or column will be removed from the dataframe when we have at least one missing value or all are missing.
- how is set to any, remove any column or row which has any missing value
- how is set to all, remove a column or row if all values are missing.
Above, we are removing any row which contains any missing values.
2. Filling the missing values
With Pandas, filling the missing values is very straightforward. Here is how you can fill them with a given number.
You can also use ffill (forward fill) or bfill(backward fill), where you fill the values preceding or following the missing value.
The downside of this is that it can mislead us. Let's take an example at index 2826 in the dataframe. The house with total rooms of 154 has 522 bedrooms, which is impossible.
Let's also look at other ways you can use to fill the missing values. Say we want to fill all missing values by mean or median of a given column. In this case, we will use Scikit-Learn imputer method to handle this.
from sklearn.impute import SimpleImputer housing_numeric=housing.drop('ocean_proximity', axis=1) mean_fill=SimpleImputer(missing_values=np.NaN,strategy='mean') mean_fill.fit(housing_numeric) mean_filled=mean_fill.transform(housing_numeric) mean_filled=pd.DataFrame(mean_filled, columns=housing_numeric.columns)
You can do the same thing about median, just replace median in SimpleImputer(missing_values=np.NaN,strategy='mean'). It will be:
You can also fill the missing values by the most frequent number in the feature. You will replace most_frequent with the Simple Imputer function. It will be like:
The last thing about using Simple Imputer is that you can use it to replace all missing values with a constant value. This is how you can apply it:
Like all previous strategies, you will have to inspect the results to be sure you are not over manipulating your data.
The last strategy to look at is Multivariate Imputation or iterative imputation. If you have few missing values, no doubt that it would be good to remove them completely to avoid imputing them with irrelevant values. "Quality over quantity".
But also you may wish to keep the data and perhaps find a better way to handle the missing values. One of the best ways out there (considering all the flaws of the above methods) is to fill the given missing value considering the values of other features.
In this case, we will use the Scikit-Learn method called Iterative Imputer. This works on the principle of Linear regression. I will not go deep, but below is how it is done. It estimates each feature from all the other features.
A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion. From Scikit-Learn documentation.
from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer iter_imputer = IterativeImputer() housing_imputed=iter_imputer.fit_transform(housing_numeric) housing_imputed=pd.DataFrame(housing_imputed, columns=housing_numeric.columns)
By using this method, the missing values will be estimated from other features. As we said before, the right strategy will depend on your problem and the number of missing values you have, and the size of your dataset.
3. Leaving the missing values as they are
In this case, you will leave the missing values as they are. You will only have to ensure you don't have something like NaNs in your model input data because most machine learning models accept numeric inputs.
Though using this strategy you will have empty values, but at least you will not have introduced noise or eliminated important data. There is always a tradeoff!
This is the end!!
Thanks for finishing this tutorial. You can get a copy of the notebook and codes here.
Until the next time, may your models always converge!