Linear Regression using fish (regression problem)

The following notebook uses the Fish market dataset available here, it is free and released under the GPL2 license. This dataset includes a number of species of fish and for each fish some measurements such as weight and height.

(Basic) Explanatory Data Analysis

Every good ML algorithm should start with an in-depth Explanatory Data Analysis (EDA). In the EDA you should always try to explore the data as much as possible, through exploration it is possible to infer basic features of the data, from those basic inferences you can start developing a basic intuition. From there you can start formulating hypotheses and implement the algorithm you see fit.

As the purpose of this notebook is to illustrate the Linear Regression applied to a Regression Problem, the performed EDA will outline just the basic features of the dataset.

Firstly, let's import basic utilities:

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(101) # This is needed so that if you run this notebook again you will get the same results

Let's now read the csv file (keep in mind it should be in the same folder as the notebook!):

In [2]:
df = pd.read_csv('Fish.csv')

Let's take a look at the first rows of the dataset. It is important to do this in order to get a basic understanding.

In [3]:
df.head()
Out[3]:
Species Weight Length1 Length2 Length3 Height Width
0 Bream 242.0 23.2 25.4 30.0 11.5200 4.0200
1 Bream 290.0 24.0 26.3 31.2 12.4800 4.3056
2 Bream 340.0 23.9 26.5 31.1 12.3778 4.6961
3 Bream 363.0 26.3 29.0 33.5 12.7300 4.4555
4 Bream 430.0 26.5 29.0 34.0 12.4440 5.1340

Now let's take a closer look to the dataset to get important statistical indicators such as the mean and standard deviation

In [4]:
df.describe()
Out[4]:
Weight Length1 Length2 Length3 Height Width
count 159.000000 159.000000 159.000000 159.000000 159.000000 159.000000
mean 398.326415 26.247170 28.415723 31.227044 8.970994 4.417486
std 357.978317 9.996441 10.716328 11.610246 4.286208 1.685804
min 0.000000 7.500000 8.400000 8.800000 1.728400 1.047600
25% 120.000000 19.050000 21.000000 23.150000 5.944800 3.385650
50% 273.000000 25.200000 27.300000 29.400000 7.786000 4.248500
75% 650.000000 32.700000 35.500000 39.650000 12.365900 5.584500
max 1650.000000 59.000000 63.400000 68.000000 18.957000 8.142000

Let's now plot each numerical feature against each other, in order to get a clear distinction use a palette with high contrast (Viridis) and use the species as color.

In [5]:
sns.pairplot(df, hue='Species', palette='viridis')
Out[5]:
<seaborn.axisgrid.PairGrid at 0x1f7b30cd188>

As you can see there is quite a strong pattern between: Length1 and Length2, Length2 and Length3, Length3 and Length1. This is prefect to get started. What kind of secrets does these three lengths hide? Are you able to speculate about what Length1 and Lenght2 are?

Linear Regression (one independent variable): Let's predict Length2 Knowing Length1

It is important to reshape the two dimensions (X and y), if you don't do this the model will throw an error.

In [6]:
X = np.array(df['Length1']).reshape(-1, 1) 
y = np.array(df['Length2']).reshape(-1, 1)

Split the dataset in two parts: train and test. This is needed to calculate the accuracy (and many other metrics) of the model. We will use the train part during the training, and the test part during the evaluation. The model will not see the test part during its training.

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=101)

Import the model and instantiate it:

In [8]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

Now let's train the model:

In [9]:
lr.fit(X_train, y_train)
Out[9]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Let's review how the model is doing using the R2. R2 is a statistical indicator to know whether the model is "a good fit" and how well it performs. In this case (one independent variable) the R2 is equal to the Pearson Correlation Coefficient. R2 can assume values between 0.0 and 1.0, where 0 means the worst fit and 1 the best fit.

In [10]:
lr.score(X_test, y_test)
Out[10]:
0.9990508254380637

Pretty high! That's due to the fact the two variables (Length1 and Length2) take the shape of a straight line as observed during the EDA. Let's now plot the predict values against the test part of the dataset.

In [11]:
plt.scatter(X_test, y_test)
plt.plot(X_test, lr.predict(X_test), color='red')
plt.show()

As you can see the line is pretty close, athough not perfect. But how come these two?

Conclusion: Let's take a step back

"Are you able to speculate about what Length1 and Lenght2 are?"

You followed the notebook up to now without knowing what Length1 and Lenght2 are. Let's take a step back.

Length1 represents Vertical length in centimeters, while Lenght2 is Diagonal length in centimeters, kudos if you get that right without reading the solution.

If you look at it now it is just natural to think "If the fish is inscribed within a rectangle, the more the diagonal grows the more the height of the rectangle will grow.

Linear Regression (multiple independent variables): Let's predict weight

The procedure to predict the weight of the fish using Linear Regression is pretty similar to last one. The only notable difference is that there are multiple independent variables. Since it is not the purpose of this notebook to explain how to represent categorical variables to use them with a ML model, the "Species" variable will be dropped entirely (will not contribute).

In [12]:
X = df.drop(['Weight', 'Species'], axis=1)
y = df['Weight']
In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=101) 
In [14]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
In [15]:
lr.fit(X_train, y_train)
Out[15]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [16]:
lr.score(X_test, y_test)
Out[16]:
0.8642224833122479

As you can see the R2 score is not as high as it was earlier, yet it is still pretty high. Since it is pretty difficult to show the predicted values against the actual values, let's take a look at them as they are.

In [17]:
df_t = df.copy()
In [18]:
df_t['Predicted Weight'] = lr.predict(df_t.drop(['Weight', 'Species'], axis=1))
In [19]:
df_t['Difference'] = df_t['Weight'] - df_t['Predicted Weight']
In [20]:
df_t[['Weight', 'Predicted Weight', 'Difference']].head(20)
Out[20]:
Weight Predicted Weight Difference
0 242.0 329.661992 -87.661992
1 290.0 374.756426 -84.756426
2 340.0 384.719326 -44.719326
3 363.0 427.226647 -64.226647
4 430.0 463.235780 -33.235780
5 450.0 465.648356 -15.648356
6 500.0 503.924833 -3.924833
7 390.0 470.949250 -80.949250
8 450.0 509.362339 -59.362339
9 500.0 541.636382 -41.636382
10 475.0 535.761445 -60.761445
11 500.0 542.386113 -42.386113
12 500.0 511.929040 -11.929040
13 340.0 551.704054 -211.704054
14 600.0 577.196006 22.803994
15 600.0 612.411903 -12.411903
16 700.0 600.716507 99.283493
17 700.0 593.223367 106.776633
18 610.0 625.015601 -15.015601
19 650.0 636.927079 13.072921

Conclusion: The model is a good fit, but is it the best it can do?

The model is not performing well (or rather, not as well as hoped) for this problem and data. The predicted weight (in grams) is sometimes really close (-4) and sometimes far off (-212).

The EDA didn't cover some basics such as feature selection and removing outliers. Also the model was deprived of a feature: the Species which, as you might imagine, may influence the weight of a fish. Another important factor is the size of the dataset: usually larger datasets lead to more accurate results given that data is not trash.

Even though R2 is ~0.85, the model is not a good fit at predicting this value. Always observe the data and don't apply techniques blindly!