A Brief Overview of Simple Linear Regression and its Implementation in Python
Hello everyone, In this article, I’m going to give a brief overview of Simple Linear Regression and how it can be implemented in Python in various ways. First of all, we’ll go through the basic math, and then we’ll jump into the code. On reading this article, I assume that you already have some prior knowledge of Python and some of its libraries, such as Numpy, Pandas, and Matplotlib. If not, don’t worry! They are not that difficult, and you can learn them very quickly with a little effort if you already know Python. You can still follow this tutorial even if you are unaware of those libraries. All you need is some passion and desire! So, without further ado, let’s get started!
Table of Contents
1. Introduction
2. Import Essential Libraries
3. Data Collection
4. Data Analysis
5. Data Visualization
6. Data Pre-Processing
7. Model Building and Training
8. Model Evaluation
9. Assumptions
What is Regression ?
Regression is a statistical technique used to model the relationship between dependent and independent variables, aiding predictions and trend forecasting. It involves two types of variables: independent (regressor/predictor/feature) and dependent (target/output), with the latter being continuous. For instance, predicting a person’s percentage based on study hours and assessment marks involves the percentage as the dependent variable and study hours and assessment marks as independent variables. Regression can be linear or non-linear; linear regression assumes a proportional relationship between variables.
Understanding Simple Linear Regression
Simple Linear Regression (SLR) is a type of linear regression that focuses on just one independent variable and one dependent variable. If you recall the equation of a straight line from your school days, it typically appears as y=mx+c, where m represents the slope and c is the y-intercept. This equation is nothing but a simple linear regression model. Here, x is an independent variable, and y is a dependent variable. If we know the values of the slope and y-intercept, then we can easily calculate the value of y for a given value of x. This is known as a prediction. Leveraging Python, we can implement SLR to make precise predictions based on a given dataset.
Implementation of Simple Linear Regression in Python
Let’s embark on the practical implementation of SLR in Python, step by step:
Import Essential Libraries
Before you start any Data Science project, it is important to import the necessary libraries. Not only does this ensure that all of the necessary modules and libraries are available for your task, but it also makes it easier for other people to understand and replicate your project. Furthermore, it is a best practise to keep the imported libraries at the top of your code, as it makes it easier to keep track of which libraries are being used.
import os
import math
import warnings
import scipy
from scipy import stats
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model, model_selection, preprocessing, metrics
warnings.filterwarnings("ignore")
Data Collection
In this stage, collect and store the data that is needed to carry out your task. In this tutorial, I am generating sample data that includes both independent (X) and dependent (y) variables. Here, I am using the “random” module from the Numpy library to generate random data. If you don’t know, the function random.rand() simply generates random data that forms a uniform distribution according to the given dimensions (shape) as arguments to it. We generate y from X by adding some random noise to it using random.randn(). The random noise is also known as the random error component, which is distributed normally as per the assumptions of linear regression. Here, the regressor variable X is 2-D in nature since it can have more than one feature (in the case of multiple linear regression). X is simply a vector of features. The target y is usually a 1-D array since the regression is univariate (involves one dependent variable) in nature. Since it is univariate linear regression model, we use reshape() to change the shape of X from 2-D to 1-D.
# Generate random data for regression
X = np.random.rand(1000, 1) + 7
y = 6 * X + 7 + 2 * np.random.randn(1000, 1)
y = y.reshape(-1)
print(X.shape, y.shape)
(1000, 1) (1000,)
Data Analysis
After our data has been collected and stored, we need to analyze it to gain some insights from it. It is simply a process of transforming our data into a format from which we can derive useful insights or patterns for making decisions. It helps us model the relationships more accurately. It involves inspecting our data, generating the summary, finding the correlation between the variables, etc. Here, I have created a Pandas data frame from our original data, as it makes it easier to analyse our data by providing different methods for data manipulation. You can see the first and last 5 rows in the data frame, the type of data, and other information such as the number of non-null values in each column. We can simply get the summary of the data using the describe() method. Finally, we calculated the correlation between the variables using the corr() method, which uses Pearson’s correlation coefficient. Correlation gives a measure of how strongly one variable depends on the other variable. It ranges between -1 and 1. Here, -1 and 1 indicate a perfect negative and perfect positive correlation, respectively. Our model tends to be better at making predictions if there is a strong correlation between the regressor and the target variable. We can also visualize the correlation matrix by making use of a seaborn’s heatmap (a colour-encoded matrix).
# Create a pandas data frame from data
df = pd.DataFrame(np.c_[X, y], columns = ["x", "y"])
print(df.shape)
print(df.dtypes)
# Top 5 rows
df.head()
# Bottom 5 rows
df.tail()
# Information about the data frame, such as data type and null-counts
df.info()
# Summary of the data frame
df.describe()
# Correlation matrix ( Pearson's Correlation Co-efficient )
corr_matrix = df.corr()
print(corr_matrix)
# Visualizing correlation matrix with a heatmap
sns.heatmap(corr_matrix, annot = True)
plt.show()
Data Visualization
For any project, it is required to communicate our thoughts and insights to others effectively and clearly, even if they are not from a technical background. This is where data visualization comes in. With data visualization, we can not only communicate our views but also get some insights and detect patterns from the data that are not usually apparent. Various plots, such as scatter plots, histograms, box plots, violin plots, bar plots, pie charts, etc., can be used to visualize our data. In the scatter plot, we can see how the regressor and target variable are correlated. From this, we can say that there is a linear relationship between them. From the histograms, we can see how the regressor and target variables are distributed. As we can see, the density plots represent how data is distributed and its density. Finally, the box plots reveal the skewness in the data and tell us if there is any peak in the distribution (kurtosis). It also represents the outliers, minimum and maximum values, along with the 3 quartiles Q1 (25%), Q2 (50%), and Q3 (75%). You can also make use of the pairplot() function in Seaborn to draw a plot between each and every column in the data frame.
### Visualizing data through various plots
# Scatter Plot
sns.scatterplot(x = 'x', y = 'y', data = df)
plt.show()
# Histogram
for var in ['x', 'y']:
sns.histplot(x = var, data = df)
plt.show()
# Density Plot
for var in ['x', 'y']:
sns.kdeplot(x = var, data = df)
plt.show()
# Box Plot
for var in ['x', 'y']:
sns.boxplot(x = var, data = df)
plt.show()
sns.pairplot(df)
plt.show()
Data Pre-Processing and Preparation
This stage involves transforming the raw data into features so that the model can train on it. It involves various phases such as data cleaning, data integration, data reduction, and data transformation. Since our data doesn’t involve any missing values, we can skip the data cleaning process. Data reduction can be applied if there are multiple regressor variables because the training will get computationally expensive with an increase in the number of features. I am not going to discuss the in-depth details of these processes; rather, I am only focusing on superficial aspects that are only needed. As a part of data transformation, we need to normalize our data so that it falls within a specific range and follows a specific pattern or distribution. But, before transforming the data, we need to split the data into training and testing sets. This is because there is a need to evaluate the mode’s performance on unseen data, which is the test set. So, we basically train the model on training data, and then it will find patterns from the data. And then we evaluate the model using a testing set. Generally, the train set comprises 70–80% of the data, and the test set comprises 20–30% of the data. We can use Sklearn’s model_selection.train_test_split() to split the data into training and testing sets. Here, the random_state keyword argument ensures that the same set of random samples is generated for every run. After the data has been splitted, it needs to be normalized. Here, I am performing Z-Score normalization or MinMax scaling using sklearn’s preprocessing. StandardScaler(). It simply transforms the data by subtracting the mean from it and dividing it by the standard deviation. Feature scaling is not mandatory, but some algorithms, such as Gradient Descent, are sensitive to feature scaling, and they don’t perform well or take more time if the features are not scaled. Another important thing you should make note of is that feature scaling should be applied after splitting the data. This is because there may be a scope for information leakage from the testing set if we don’t do so.
# Splitting the data into training and testing data
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.2, random_state = 3)
print(X_train.shape, X_test.shape) ## 80% fortraining and 20% for testing
(800, 1) (200, 1)
# Feature Scaling ( Z-Score Normalization or Standardization )
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
print(X_train.mean(), X_train.std()) ## Mean ~ 0 and Std ~ 1
print(X_test.mean(), X_test.std())
1.7097434579227411e-15 1.0
9.792167077193881e-16 1.0
Ok, we have done all that we need. Let’s hop into model building and training.
Model Building and Training
If we remember the equation of the straight line, it may probably look like this: y = ax + b, where a is the slope and b is the y-intercept. This is a simple linear regression model that can be used to predict y when x is given. Here, we need to compute the values of unknowns that are slope and intercept so that we can make predictions. Note: The terms a and b are also known as the model’s parameters.
Using Normal Equations:
The Ordinary Least Squares (OLS) method utilizes normal equations to directly compute the slope and intercept of the regression line. OLS aims to minimize the sum of squares of residuals (SSE), which represents the disparity between actual and estimated target variable values. Achieving the minimum SSE entails finding optimal slope and intercept values. To minimize SSE, we undertake partial differentiation of the SSE function with respect to the slope and intercept, setting these derivatives to zero. This process yields the normal equations, essential for estimating the slope and intercept. You can see what the estimated slope and intercept look like below. Also, look at some of the notations listed below. Along with the residual sum of squares (SSE), we have the regression sum of squares (SSR) and the total sum of squares (SST). SSR is also known as explained error, whereas SSE is also known as unexplained error. The relation between SSR, SSE, and SST is given by the ANOVA decomposition of the total sum of squares (SST). It states that the total sum of squares can be written as the sum of the regression sum of squares (SSR) and the residual error sum of squares (SSE). You can also derive this if you want to. Along with SST, SSR, and SSE, we also have their means, i.e., MST, MSR, and MSE. Here, MST, MSR, and MSE are obtained by dividing SST, SSR, and SSE with their corresponding degrees of freedom. In general, for SLR, the degrees of freedom for SST, SSR, and SSE are given as m — 1, 1 and m — 2, respectively. Degrees of freedom simply indicate the number of values for a variable that we can choose independently. Throughout this tutorial, I have used m to represent the number of instances or tuples and n to represent the number of features. After getting all these equations, we simply coded them in Python. As you can see, we get the values of slope and intercept directly by using normal equations. We can make use of them to predict the value of x for a given value of y. Let’s plot the line of linear regression using Matplotlib and Seaborn. It is simply the line plotted between the regressor variable and the predicted target. It always passes through the estimated values of the target. Next, we can compute metrics, such as MST, MSR, and MSE, to assess the performance of the model, as I mentioned earlier. They can be used to assess the performance of a linear regression model. We also have some other metrics, which we are going to see later.
Here is the implementation of SLR using Normal Equations:
# Compute sxx, sxy and syy
x_train = X_train[:, 0]
x_train_mean = x_train.mean()
y_train_mean = y_train.mean()
sxx = np.sum(np.square(x_train - x_train_mean))
sxy = np.sum((x_train - x_train_mean) * (y_train - y_train_mean))
syy = np.sum(np.square(y_train - y_train_mean))
print(sxx, sxy, syy)
# Computing beta (slope) and alpha (intercept)
beta = sxy / sxx
alpha = y_train_mean - x_train_mean * beta
print(beta, alpha)
# Plotting the line of linear regression
y_train_pred = beta * x_train + alpha
sns.scatterplot(x = x_train, y = y_train, label = "Data Point")
sns.lineplot(x = x_train, y = y_train_pred, color = "red", label = "Line of Regression")
plt.show()
# Compute mse, msr and mst
sse = np.sum(np.square(y_train - y_train_pred))
ssr = np.sum(np.square(y_train_mean - y_train_pred))
sst = np.sum(np.square(y_train - y_train_mean))
print(sse, ssr, sst)
m = len(x_train)
print(m)
mse = sse / (m - 2)
msr = ssr / 1
mst = sst / (m - 1)
print(mse, msr, mst)
Now, let’s look at how we can estimate the values of slope and intercept using another technique known as Gradient Descent.
Using Gradient Descent:
Gradient Descent (GD) is an iterative optimization approach used to find the minimum value of a function. It tweaks the model’s parameters, which are slope and intercept, and updates them on every iteration or epoch to minimize the cost function. Here, the cost function is used to represent the error. It is similar to that of SSE, which we saw earlier. Here are the components of Gradient Descent:
- Cost Function: It is an error representation. There are many cost functions that we can use for linear regression, such as mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), etc. In this tutorial, we’re going to use MSE because of its simplicity and because it guarantees a global minima ( there is only one minimum value for the function, which is both local and global, which we can say, in other words, that the cost function gives a bowl shape ).
- Number of epochs: It is a hyperparameter that represents the number of iterations or epochs to be run, i.e., the number of times the cost function needs to be updated to become minimum. When all the training data passes through the algorithm once, then it is known as one epoch. The cost function keeps getting smaller after each epoch if we use a good learning rate.
- Learning Rate: It is a hyperparameter that refers to the size of the update step. If it is too large, the algorithm diverges instead of converging, i.e., the cost value increases. If it is too small, the algorithm requires a larger number of iterations to converge, and it may suffer from the Vanishing Gradients Problem. In the vanishing gradients problem, the gradients of the cost function tend to become very small, and the update is very minimal. Hence, we need to choose an optimal value for the learning rate hyperparameter.
Note: There is a difference between parameters and hyperparameters. Parameters are updated while training, and they affect the model’s predictions, whereas hyperparameters are set before training and are not included while making predictions. We need to choose good values for these hyperparameters to have the best model. It can be done through a technique known as Hyperparameter Tuning.
Steps involved in Gradient Descent:
1. Initialise the values of the slope (weight) and intercept randomly.
2. Find the gradients of the cost function. Gradients can be obtained by partially differentiating the cost function with respect to slope and intercept.
3. Update the values of slope and intercept by subtracting the product of learning rate and gradients from them.
4. Repeat steps 2 and 3 for a fixed number of epochs.
Finally, the updated values of slope and intercept are obtained, and the cost function is at its minimum at those values of slope and intercept.
Note:
- Gradient Descent is sensitive to feature scaling. It converges faster if we perform feature scaling.
- There are many versions of Gradient Descent that we can use. They are: i) Batch GD ii) Stochastic GD, and iii) Mini Batch GD. In this tutorial, we have used Batch GD. Batch GD is named because it uses a whole batch of training instances at every epoch to compute the gradients and update the parameters.
- We can also make many additions to plain GD, such as implementing early stopping ( stop training when there is no progress in the validation error for a fixed number of epochs ), including a tolerance hyperparameter to stop the training when the change in the update step is very minimal, etc.
Here is the implementation of SLR through Batch Gradient Descent:
# Implementing Linear Regression using Batch Gradient Descent
class LinearRegression:
def __init__(self, epochs = 1000, eta = 0.01):
self.epochs = epochs
self.eta = eta
self.weights = self.bias = None
def fit(self, X, y):
m, n = X.shape
self.weights = np.random.randn(n)
self.bias = np.random.rand(1)
costs = np.zeros(self.epochs)
for i in range(self.epochs):
y_pred = np.dot(X, self.weights) + self.bias
cost = (1 / m) * np.sum(np.square(y_pred - y))
costs[i] = cost
dw = (2 / m) * np.dot(X.T, y_pred - y)
db = (2 / m) * np.sum(y_pred - y)
self.weights -= dw * self.eta
self.bias -= db * self.eta
plt.plot(X[:, 0], y_pred)
plt.plot(X[:, 0], y_pred, color = "black", lw = 2)
plt.show()
plt.xlabel("Epoch")
plt.ylabel("Cost (MSE)")
plt.plot(np.arange(1, self.epochs + 1), costs, label = "Learning Curve")
plt.legend()
plt.show()
def predict(self, X):
y_pred = np.dot(X, self.weights) + self.bias
return y_pred
reg = LinearRegression(eta = 0.01, epochs = 1000)
reg.fit(X_train, y_train)
print(reg.weights, reg.bias)
y_test_pred = reg.predict(X_test)
Here, I have plotted a curve between the number of epochs and the cost function at every epoch. This learning curve helps in choosing an optimal value for the number of epochs. You can also see how the line of linear regression is updating after each epoch. As I mentioned earlier, you can also experiment with it by including additional things like tolerance and early stopping.
Using Sklearn:
Implementing SLR through Sklearn’s API is quite simple and straight-forward. We can make use of the linear_model. LinearRegression() for a linear regression model. Then we call the fit() method to train the model, and then predict() to make predictions. Sklearn computes the parameters internally using a technique known as single-value decomposition (SVD). It is a matrix factorization technique used to find the pseudo-inverse of a matrix directly.
Here is the implementation of SLR using Sklearn’s API:
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print(model.coef_, model.intercept_)
y_test_pred = model.predict(X_test)
Now, we have knowledge of how linear regression can be implemented in different ways. Now, we look into some metrics that are used to evaluate the performance of a regression model.
Model Evaluation
It refers to evaluating the model on a testing set to assess its overall performance. After training the model, there is a need to know how good our model is and how accurate its predictions are. Here, we are making use of an evaluating metric known as the Coefficient of Determination, aka R-square, to evaluate the model’s performance. The R-square is a statistical measure of how good the model is and how well it captures the variability in the dependent variable. For example, if R² is 0.76, then we can say that up to 76% of the variability in the target variable can be explained by the regression model. R-Square is simply the ratio of the regression sum of squares (SSR) and the total sum of squares (SST). We can make use of Sklearn’s metrics. r2_score() to get the R-square value directly by passing true values and predicted values of the target variable as arguments.
y_test_pred = model.predict(X_test)
sse = np.sum(np.square(y_test - y_test_pred))
sst = np.sum(np.square(y_test - y_test.mean()))
r2 = 1 - sse / sst
print(r2)
print(model.score(X_test, y_test))
print(metrics.r2_score(y_test, y_test_pred))
Apart from R-Square, there are so many other metrics, such as adjusted R-Square, MSE, MAE, RMSE, t-value, F-value, Cook’s statistic, etc., to evaluate our model’s performance.
Hypothesis Testing:
Hypothesis testing is generally done in order to validate our assumptions. Through hypothesis testing, we can test whether there is a significant relationship between the regressor and target variable involved in a linear regression model. Hypothesis testing generally involves a null hypothesis (H0) and an alternate hypothesis (H1). When one is rejected, another is accepted. Here are two types of tests that I have used for testing hypotheses:
i) t-test
ii) F-test
i) t-test: The t-statistic is compared to a critical value from the t-distribution with m = 2 degrees of freedom, where m is the number of observations or instances. If the t-statistic is greater than the critical value, then the null hypothesis is rejected and the alternative hypothesis is accepted.
# Hypothesis Testing ( t-test )
t_q = -stats.t.ppf(q = 0.05 / 2, df = m - 2)
var_beta = mse / sxx
t = beta / math.sqrt(var_beta)
print(t, t_q)
if t > t_q:
print("Null hypothesis rejected ! There is a significant relationship between x and y interms of slope.")
else:
print("Null hypothesis accepted ! There is no significant relation between x and y interms of slope.")
var_alpha = mse * (1 / m + x_train_mean ** 2 / sxx)
t = alpha / math.sqrt(var_alpha)
if t > t_q:
print("Null hypothesis rejected ! There is a significant relationship between x and y interms of intercept.")
else:
print("Null hypothesis accepted ! There is no significant relation between x and y interms of intercept.")
ii) F-test: It is used to perform an overall test. It tends to be very useful when the number of regressor variables is increasing.
# Hypothesis Test ( F-Test )
f = msr / mse
f_q = 1 / stats.f.ppf(q = 0.05, dfn = 1, dfd = m - 2)
print(f, f_q)
if f > f_q:
print("Null hypothesis rejected ! There is a significant relation between x and y.")
else:
print("Null hypothesis accepted ! There is no significant relation between x and y.")
We can also get the whole regression summary using the statsmodels.formula API. It is as follows:
stats_model = smf.ols(formula = "y ~ x", data = pd.DataFrame(np.c_[X_train, y_train], columns = ['x', 'y'])).fit().summary()
print(stats_model)
The expected result will look something like this,
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.383
Model: OLS Adj. R-squared: 0.382
Method: Least Squares F-statistic: 494.5
Date: Sun, 19 Nov 2023 Prob (F-statistic): 1.24e-85
Time: 12:14:20 Log-Likelihood: -1705.4
No. Observations: 800 AIC: 3415.
Df Residuals: 798 BIC: 3424.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 51.9183 0.072 719.024 0.000 51.777 52.060
x 1.6057 0.072 22.238 0.000 1.464 1.747
==============================================================================
Omnibus: 0.721 Durbin-Watson: 2.046
Prob(Omnibus): 0.697 Jarque-Bera (JB): 0.622
Skew: -0.063 Prob(JB): 0.733
Kurtosis: 3.053 Cond. No. 1.00
==============================================================================
When can I apply Linear Regression ? What are the assumptions about it ?
Linear regression can’t be applied in all cases. There are some assumptions that need to be satisfied in order to apply linear regression. These include:
- Linearity: There should be a linear relationship between the regressor and target variable, meaning that the change in
y is proportional to the change in X. It can be tested by calculating the correlation coefficient or by plotting a scatter plot. - Homoscedasticity: Also known as homogeneity of variance, this assumption states that the variance of the residuals (the differences between the observed and predicted values) should be constant across all levels of the independent variable x. In simpler terms, the spread of the residuals should remain consistent as x changes.
- Normality: It implies that the residuals are distributed normally. You can test this by plotting a density plot for residuals.
- No perfect multicollinearity: No perfect multicollinearity implies that there should not be a high correlation between the regressor variables. It is applicable in cases of multiple linear regression.
- No autocorrelation: autocorrelation occurs when residuals are correlated with each other. In the context of simple linear regression, this assumption implies that the residuals should not show a pattern or correlation when plotted against time or the order of observations.
Even if a few of the assumptions are not satisfied, we can transform the data so that the assumptions can be satisfied. For instance, if the linearity assumption is not satisfied, we can apply polynomial features to transform our existing data into a new form.
Conclusion:
In this article, we have learned about Simple Linear Regression and how it can be implemented in Python. Learning this concept opens the door to predictive modelling and paves the way for your Data Science journey. Whether forecasting trends or making predictions, the insights gleaned from this guide empower you to leverage data effectively. With this foundational knowledge, you can explore more advanced regression techniques and dive deeper into the world of Data Science.