The following notebook uses the Fish market dataset available here, it is free and released under the GPL2 license. This dataset includes a number of species of fish and for each fish some measurements such as weight and height.
Every good ML algorithm should start with an in-depth Explanatory Data Analysis (EDA). In the EDA you should always try to explore the data as much as possible, through exploration it is possible to infer basic features of the data, from those basic inferences you can start developing a basic intuition. From there you can start formulating hypotheses and implement the algorithm you see fit.
As the purpose of this notebook is to illustrate the Linear Regression applied to a Regression Problem, the performed EDA will outline just the basic features of the dataset.
Firstly, let's import basic utilities:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(101) # This is needed so that if you run this notebook again you will get the same results
Let's now read the csv file (keep in mind it should be in the same folder as the notebook!):
df = pd.read_csv('Fish.csv')
Let's take a look at the first rows of the dataset. It is important to do this in order to get a basic understanding.
df.head()
Now let's take a closer look to the dataset to get important statistical indicators such as the mean and standard deviation
df.describe()
Let's now plot each numerical feature against each other, in order to get a clear distinction use a palette with high contrast (Viridis) and use the species as color.
sns.pairplot(df, hue='Species', palette='viridis')
As you can see there is quite a strong pattern between: Length1 and Length2, Length2 and Length3, Length3 and Length1. This is prefect to get started. What kind of secrets does these three lengths hide? Are you able to speculate about what Length1 and Lenght2 are?
It is important to reshape the two dimensions (X and y), if you don't do this the model will throw an error.
X = np.array(df['Length1']).reshape(-1, 1)
y = np.array(df['Length2']).reshape(-1, 1)
Split the dataset in two parts: train and test. This is needed to calculate the accuracy (and many other metrics) of the model. We will use the train part during the training, and the test part during the evaluation. The model will not see the test part during its training.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=101)
Import the model and instantiate it:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
Now let's train the model:
lr.fit(X_train, y_train)
Let's review how the model is doing using the R2. R2 is a statistical indicator to know whether the model is "a good fit" and how well it performs. In this case (one independent variable) the R2 is equal to the Pearson Correlation Coefficient. R2 can assume values between 0.0 and 1.0, where 0 means the worst fit and 1 the best fit.
lr.score(X_test, y_test)
Pretty high! That's due to the fact the two variables (Length1 and Length2) take the shape of a straight line as observed during the EDA. Let's now plot the predict values against the test part of the dataset.
plt.scatter(X_test, y_test)
plt.plot(X_test, lr.predict(X_test), color='red')
plt.show()
As you can see the line is pretty close, athough not perfect. But how come these two?
"Are you able to speculate about what Length1 and Lenght2 are?"
You followed the notebook up to now without knowing what Length1 and Lenght2 are. Let's take a step back.
Length1 represents Vertical length in centimeters, while Lenght2 is Diagonal length in centimeters, kudos if you get that right without reading the solution.
If you look at it now it is just natural to think "If the fish is inscribed within a rectangle, the more the diagonal grows the more the height of the rectangle will grow.
The procedure to predict the weight of the fish using Linear Regression is pretty similar to last one. The only notable difference is that there are multiple independent variables. Since it is not the purpose of this notebook to explain how to represent categorical variables to use them with a ML model, the "Species" variable will be dropped entirely (will not contribute).
X = df.drop(['Weight', 'Species'], axis=1)
y = df['Weight']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=101)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)
As you can see the R2 score is not as high as it was earlier, yet it is still pretty high. Since it is pretty difficult to show the predicted values against the actual values, let's take a look at them as they are.
df_t = df.copy()
df_t['Predicted Weight'] = lr.predict(df_t.drop(['Weight', 'Species'], axis=1))
df_t['Difference'] = df_t['Weight'] - df_t['Predicted Weight']
df_t[['Weight', 'Predicted Weight', 'Difference']].head(20)
The model is not performing well (or rather, not as well as hoped) for this problem and data. The predicted weight (in grams) is sometimes really close (-4) and sometimes far off (-212).
The EDA didn't cover some basics such as feature selection and removing outliers. Also the model was deprived of a feature: the Species which, as you might imagine, may influence the weight of a fish. Another important factor is the size of the dataset: usually larger datasets lead to more accurate results given that data is not trash.
Even though R2 is ~0.85, the model is not a good fit at predicting this value. Always observe the data and don't apply techniques blindly!