How to Do Exploratory Data Analysis Effectively
Exploratory Data Analysis or what many people call EDA is a critical step in a machine learning or data science project. This step is more about learning the data and when done properly, you can find some interesting insights that can help you in understanding why certain predictions were made.
In this post, I want to share how I approach the exploratory analysis. I usually start with simple things that you may already know, especially if you have done any end to end project.
Taking a Quick Look at the Dataset
This is the initial step in the process of exploratory analysis. It is here that you can take a glance at the features, their data types, value ranges, and looking at the whole overview from head to tail.
Those little things can be overlooked, but understanding them is necessary for knowing the data at hand.
Checking Missing Values
Real-world datasets are not always clean. In fact most of the time they are not. They may contain missing values.
Before getting to the cleaning process, I find it helpful to examine the type of features that contain missing values and their size. In most cases, I will get a straight idea of how I will handle these types of values.
Take an example: Let's say that you're predicting the price range of the mobile phone. Among the features, there is sim cards support feature that has 3 unique values (1 for 1 sim, 2 and 3 respectively). In that feature, some values in all these 3 categories are missing. Will you fill it with the mean value? Median? Frequent? In this kind of example, the right option is often removing the missing values completely or perhaps doing a rigorous study before filling them. You do not want to end up with a phone that we all know supports 1 sim card but in your data, it has 3 sim cards.
You might think that the above details should come only when doing data preprocessing, but I find it helpful to dive deep into features missing values and to think about the imputation strategies when exploring the data so the latter becomes icing on a cake.
Stats are underestimated, yet they can provide a huge overview of the dataset. In this step, I like to examine the maximum/minimum values in each feature, the top most values, medians, means and so forth.
These can seem simple given that it can be very easy to display them (with tools like Pandas) but if you can spend time thinking about these and what they mean to your analysis, you can learn a lot about data.
Data leakage is one of the hidden pitfalls in ML lifecycle. Data leakage can be caused by leaked features. What are they?
Leaked features are features that may contain information similar to the things you're predicting. Take an example: If you're predicting the annual salary of an employee, having a feature of monthly salary will be a huge leak. We know annual salary is monthly salary times 12, and the model will get this quickly and optimistically thus will only rely on that only to make predictions. The effect? The model will likely fail when deployed.
So during data analysis, it's worth peeping in features to see if there may be a leak.
In classification problems, having unbalanced classes is inevitable. You may be building a classifier that recognizes cat/dog, but you have 900 images of dogs and 100 images of cats to build that classifier. If you didn't realize this right away, your model will predict dogs at 90% always but will likely miss out on cats. Why? The data is not balanced.
It is always a good practice to spot these kinds of things before they mislead results after the project has been completed.
Checking Correlating features
Correlating features contain the same information. Usually, when interpreting two correlating features, the correlation will be a value between -1 and 1. When such value is close to 1, it means those features contain the same or (nearly) information. When it is close to -1, it means they are completely different.
If two features contain the same information, you would likely want to remove one of them. The results are less affected and you will have reduced the size of your dataset already.
It's important to check that! Below is a typical example of visualized correlations.
Image: Feature correlations performed on titanic dataset.
It can be so easy to dismiss this especially if you are working with structured data. If the dataset contains duplicates, it's not a good thing for the machine learning algorithm and it will likely reduce its ability to generalize on the (new) data.
Biases and Outliers
Biases are not easy to spot, it will take rigorous work to go through data only to learn if the data is free of biases. This is especially common in things that involve underestimated entities like races, countries, and gender. So to make sure we are building effective learning systems that don't exclude minorities, it's better to look if we don't have those kinds of biases as it can kill the whole work.
Talking about outliers, you want to spot them too. Outliers (or anomalies) are unusual data that are in the dataset. It can be a value that is beyond the normal range such as having an age of 5000 in an age feature or a value recorded in a wrong unit.
One of the ways to spot anomalies is to go through the data either by visualization or common sense. The right way to dealing with outliers depends on the dataset. Sometimes, removing them can help the analysis or can harm it in other ways.
Seasonalities and Trends
This is not a usual thing to come across in analysis especially if you are not merely doing a time series analysis, but in any structured data that has some time components, you can! Let me give you an example.
For once, I was working with structured data that had some numerical and time features. Among these features, there was a demand and date (recorded for two years). When analyzed the demand for over two years, I found that there were seasonalities and trends.
In time series analysis, seasonality is repetitive scenarios or consistent behaviours with time. Take a different example: It has been observed that the shopping sites experience a consistent increase of page visitors during the weekends.
On the other hand, the trend is a prolonged increase/decrease in a given feature. A classic example of a trend is Moore's law which predicted that the number of transistors will double every two years, and that has happened 100% for the last 50 years or so. Moore's law was backed by the advent of microprocessors and the need for computation power.
In the real-life, you can have trends mixed with seasonalities. And sometimes noise can come into the picture. I find it to be a great insight if I can be able to take my analysis from graphs to actually spotting things like those.
More Exploratory Analysis
After learning all of the above things in the data, it can also be good to visualize different features. If there are categorical features, it might mean seeing its categories and the number of examples in each category. You can also plot the relationship between different features using scatter plots or the distribution of some individual features using histograms.
There is no limit to what can be done, either here or in former steps. The bottom goal is to
become one with the data(CC: Andrej).
The Bottom Line
When done well, exploratory data analysis can help us to understand a lot of things about the dataset, from missing values, correlating features, biases, outliers, and much more. Like Andrej Karpathy said, the goal is to 'become one with the data'.
Thank you for reading!
Every week, I write one article about machine learning techniques, ideas or best practices that can help you to build effective learning systems. If you find the article helpful, support it by sharing it with friends or tech communities that you are a part of.
Connect with me on Twitter!