Machine Learning 101: Linear Regression in Python
Linear Regression is probably the first ML algorithm any Data Scientist will encounter in its journey. As a matter of fact it is the easiest ML algorithm to learn conceptually. Let’s take a look.
Classification vs Regression
Machine Learning essentially deals with two kinds of problems:
- Classification: predicting a class, for example whether a user is male or female (the two classes) given their history of purchased items.
- Regression: predicting a value, for example the price (the value) of a used car given the model, the age, the kilometers on the odometer.
It is important to remember that Machine Learning is no magic, ML algorithms are still algorithms: multiple inputs, one output. The most important difference between a traditional algorithm and an ML one is the “experience” the ML algorithm gains during the training phase.
In Classification problems the algorithm tries to predict the class the entry will fall into, it may be two classes (such as the example above, male versus female) or more than two classes. The former is often called Binary Classification the latter is referred to as Multiclass Classification.
In Regression there is no class to predict, instead there is a scale and the algorithm tries to predict the value on that scale. In the example above the price is the sought value.
Linear Regression in Python
Linear Regression is the most basic algorithm of Machine Learning and it is usually the first one taught. Linear Regression is usually applied to Regression Problems, you may also apply it to a classification problem, but you will soon discover it is not a good idea. Although the term may seem fancy, the idea behind it is pretty easy to understand.
Let’s suppose we have two variables X and y, imagine that X and y grow at the same rate: adding 1 to X means that also y grows by 1. If you draw a plot of these two variables you will have a straight line. Knowing the equation of that line (y = mX + q) will enable you to know y if you know X.
Linear Regression is about finding that line from a number of observations (X and y). As with every ML algorithm, the more observations you have, and the more accurate they are, the better the algorithm will be at predicting the outcome.
Linear Regression using fish (regression problem)
The following notebook uses the Fish market dataset available here, it is free and released under the GPL2 license. This dataset includes a number of species of fish and for each fish some measurements such as weight and height.
(Basic) Explanatory Data Analysis
Every good ML algorithm should start with an in-depth Explanatory Data Analysis (EDA). In the EDA you should always try to explore the data as much as possible, through exploration it is possible to infer basic features of the data, from those basic inferences you can start developing a basic intuition. From there you can start formulating hypotheses and implement the algorithm you see fit.
As the purpose of this notebook is to illustrate the Linear Regression applied to a Regression Problem, the performed EDA will outline just the basic features of the dataset.
Firstly, let's import basic utilities:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(101) # This is needed so that if you run this notebook again you will get the same results
Let's now read the csv file (keep in mind it should be in the same folder as the notebook!):
df = pd.read_csv('Fish.csv')
Let's take a look at the first rows of the dataset. It is important to do this in order to get a basic understanding.
df.head()
Now let's take a closer look to the dataset to get important statistical indicators such as the mean and standard deviation
df.describe()
Let's now plot each numerical feature against each other, in order to get a clear distinction use a palette with high contrast (Viridis) and use the species as color.
sns.pairplot(df, hue='Species', palette='viridis')
As you can see there is quite a strong pattern between: Length1 and Length2, Length2 and Length3, Length3 and Length1. This is prefect to get started. What kind of secrets does these three lengths hide? Are you able to speculate about what Length1 and Lenght2 are?
Linear Regression (one independent variable): Let's predict Length2 Knowing Length1
It is important to reshape the two dimensions (X and y), if you don't do this the model will throw an error.
X = np.array(df['Length1']).reshape(-1, 1)
y = np.array(df['Length2']).reshape(-1, 1)
Split the dataset in two parts: train and test. This is needed to calculate the accuracy (and many other metrics) of the model. We will use the train part during the training, and the test part during the evaluation. The model will not see the test part during its training.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=101)
Import the model and instantiate it:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
Now let's train the model:
lr.fit(X_train, y_train)
Let's review how the model is doing using the R2. R2 is a statistical indicator to know whether the model is "a good fit" and how well it performs. In this case (one independent variable) the R2 is equal to the Pearson Correlation Coefficient. R2 can assume values between 0.0 and 1.0, where 0 means the worst fit and 1 the best fit.
lr.score(X_test, y_test)
Pretty high! That's due to the fact the two variables (Length1 and Length2) take the shape of a straight line as observed during the EDA. Let's now plot the predict values against the test part of the dataset.
plt.scatter(X_test, y_test)
plt.plot(X_test, lr.predict(X_test), color='red')
plt.show()
As you can see the line is pretty close, athough not perfect. But how come these two?
Conclusion: Let's take a step back
"Are you able to speculate about what Length1 and Lenght2 are?"
You followed the notebook up to now without knowing what Length1 and Lenght2 are. Let's take a step back.
Length1 represents Vertical length in centimeters, while Lenght2 is Diagonal length in centimeters, kudos if you get that right without reading the solution.
If you look at it now it is just natural to think "If the fish is inscribed within a rectangle, the more the diagonal grows the more the height of the rectangle will grow.
Linear Regression (multiple independent variables): Let's predict weight
The procedure to predict the weight of the fish using Linear Regression is pretty similar to last one. The only notable difference is that there are multiple independent variables. Since it is not the purpose of this notebook to explain how to represent categorical variables to use them with a ML model, the "Species" variable will be dropped entirely (will not contribute).
X = df.drop(['Weight', 'Species'], axis=1)
y = df['Weight']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=101)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)
As you can see the R2 score is not as high as it was earlier, yet it is still pretty high. Since it is pretty difficult to show the predicted values against the actual values, let's take a look at them as they are.
df_t = df.copy()
df_t['Predicted Weight'] = lr.predict(df_t.drop(['Weight', 'Species'], axis=1))
df_t['Difference'] = df_t['Weight'] - df_t['Predicted Weight']
df_t[['Weight', 'Predicted Weight', 'Difference']].head(20)
Conclusion: The model is a good fit, but is it the best it can do?
The model is not performing well (or rather, not as well as hoped) for this problem and data. The predicted weight (in grams) is sometimes really close (-4) and sometimes far off (-212).
The EDA didn't cover some basics such as feature selection and removing outliers. Also the model was deprived of a feature: the Species which, as you might imagine, may influence the weight of a fish. Another important factor is the size of the dataset: usually larger datasets lead to more accurate results given that data is not trash.
Even though R2 is ~0.85, the model is not a good fit at predicting this value. Always observe the data and don't apply techniques blindly!
Conclusion
You have learnt what Linear Regression is, the intuition behind the technique and how to apply Linear Regression to one or multiple variables to predict a value. Yet the most important lesson is embedded in the last phrase of the notebook: “Always observe the data and don’t apply techniques blindly!“
- 2020 A year in review for Marksei.com - 30 December 2020
- Red Hat pulls the kill switch on CentOS - 16 December 2020
- OpenZFS 2.0 released: unified ZFS for Linux and BSD - 9 December 2020
Recent Comments