Machine Learning 101: Outliers introduction
Building a Machine Learning might become easier by the day, but there’s a rule of thumb: garbage input equals garbage predictions. Outliers are observation that significantly differ from other observations. Having outliers in your data will hinder you models, let’s discover what they are, how to detect them and remove them.
A first encounter with Outliers
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.Wikipedia, the free encyclopedia (page)
As mentioned, an outlier is somewhat of an anomaly in data. Most of real-world data that you will encounter during your journey in data science and machine learning will include some outliers. Many Machine Learning models struggle with outliers, models that are not sensitive to outliers are called robust models.
It is therefore essential for any data scientist to know what an outlier looks like, how to detect and remove outliers, and most importantly why this should be done. In the following notebook I will show you a common method used to detect (and remove) outliers. This method is based on Quartiles and the Interquartile Range (IQR), it is integrated in most plotting libraries and it is known as “Tukey’s Fences“.
Outliers in fish
The following notebook uses the Fish market dataset available here, it is free and released under the GPL2 license. This dataset includes a number of species of fish and for each fish some measurements such as weight and height.
(Basic) Explanatory Data Analysis
Every good ML algorithm should start with an in-depth Explanatory Data Analysis (EDA). In the EDA you should always try to explore the data as much as possible, through exploration it is possible to infer basic features of the data, from those basic inferences you can start developing a basic intuition. From there you can start formulating hypotheses and implement the algorithm you see fit.
As the purpose of this notebook is to illustrate what outliers are it will only cover this topic.
Firstly, let's import basic utilities:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Some font improvements
plt.rc('font', size=14)
plt.rc('axes', titlesize=14)
plt.rc('axes', labelsize=14)
Let's now read the csv file (keep in mind it should be in the same folder as the notebook!):
df = pd.read_csv('Fish.csv')
Let's take a look at the first rows of the dataset. It is important to do this in order to get a basic understanding.
df.head()
Now let's take a closer look to the dataset to get important statistical indicators such as the mean and standard deviation
df.describe()
Let's now plot each numerical feature against each other, in order to get a clear distinction use a palette with high contrast (Viridis) and use the species as color.
sns.pairplot(df, hue='Species', palette='viridis')
Do you notice something strange? Is there some observation that catches your eye?
Outliers detection using matplotlib/seaborn
Boxplots offer immediate visual feedback to isolate outliers. Outliers are plotted as little diamonds. Can you spot something in the following plots?
fig, axes = plt.subplots(ncols=2, nrows=3, figsize=(15, 18), sharey=False)
sns.boxplot(data=df, y='Weight', x='Species', ax=axes[0][0])
sns.boxplot(data=df, y='Width', x='Species', ax=axes[0][1])
sns.boxplot(data=df, y='Height', x='Species', ax=axes[1][0])
sns.boxplot(data=df, y='Length1', x='Species', ax=axes[1][1])
sns.boxplot(data=df, y='Length2', x='Species', ax=axes[2][0])
sns.boxplot(data=df, y='Length3', x='Species', ax=axes[2][1])
Outliers detection using Tukey's Fences
Tukey's Fences define a convenient interval for inliers (not outliers). Everything outside the two extremes, should be considered an outlier. $Q_1$ is the first quartile, $Q_3$ is the third quartile, $Q_3-Q_1$ is the Inter-Quartile Range (IQR). $k$ is an arbitrary constant, Tukey proposed 1.5 for outliers, and 3 for "far" outliers. This method is the same used by matplotlib and seaborn with k=1.5 when plotting boxplots.
Let's define the following function to achieve the desired behavior.
def outliers(df, index, k=1.5):
q1 = df[index].quantile(0.25)
q3 = df[index].quantile(0.75)
iqr = q3-q1
outliers = df[(df[index] < q1-k*iqr) | (df[index] > q3+k*iqr)]
return outliers
Let's now take a look at Roach weights.
sns.boxplot(data=df[df['Species'] == 'Roach'], y='Weight', width=0.2)
outliers(df[df['Species'] == 'Roach'], 'Weight')
As you can see there are three outliers among the Roach species when you consider their weight. The first one doesn't even have a weight, so it's an obvious outlier. But what happens if we take a look at the wight of all the species?
outliers(df, 'Weight')
The three Roach outliers are now not detected anymore. That is because considering all the species they now fall within the two Tukey's Fences. So are they actually outliers? The answer to this question isn't easy. The fact is that you must always look at the data, if there was some kind of magic algorithm that could delete all outliers without failing it would be at the very base of every ML algorithm, right? You must not overdo when discarding observation, your goal should not be to eliminate the diamonds from each and every boxplot you can think of.
Let's take a look at the other dimensions and their outliers.
fig, axes = plt.subplots(ncols=2, figsize=(15, 5), sharey=False)
sns.boxplot(data=df.drop('Weight', axis=1), ax=axes[0])
sns.boxplot(data=df, y='Weight', x='Species')
Let's take a look once again at the Roach species, this time their Lenght2 feature:
outliers(df[df['Species'] == 'Roach'], 'Length2')
As you can see the fish that didn't have a weight associated is not here, however the fish with weight equals 390 is. That is an indicator that the observation is indeed an outlier.
Removing outliers
You can use the following function if you want to remove outliers based on an entire feature. It will return the entire DataFrame without the outliers.
def drop_outliers(df, index, k=1.5):
return df.drop(outliers(df, index, k).index)
df = drop_outliers(df, 'Weight')
If you want to remove outliers based on a class and on a feature (such as weight, species) you will have to do something a bit more complicated:
df = df.drop(outliers(df[df['Species'] == 'Roach'], 'Weight').index)
fig, axes = plt.subplots(ncols=2, figsize=(15, 5), sharey=False)
sns.boxplot(data=df.drop('Weight', axis=1), ax=axes[0])
sns.boxplot(data=df, y='Weight', x='Species')
Conclusion
As you can see removing outliers in this way has created other outliers in the Pike species when it comes to weight. Outliers can be harmful to Machine Learning models, but removing them without care may reduce your dataset considerably.
- 2020 A year in review for Marksei.com - 30 December 2020
- Red Hat pulls the kill switch on CentOS - 16 December 2020
- OpenZFS 2.0 released: unified ZFS for Linux and BSD - 9 December 2020
Recent Comments