Step-by-Step: Implementing Gradient Descent Variants in Python for Beginners — A Comprehensive Guide

13 min readNov 22, 2023

Hello everyone, Do you like math ? Whatever it may be, this article is for you. In this article, I’m going to give a brief overview of what Gradient Descent is, the different variants involved, and how it can be implemented in Python from scratch. Before reading this article, I’m assuming that you are somewhat familiar with linear regression and are aware of Python libraries such as Numpy. If not, don’t worry ! They are super easy to learn, and Numpy is really a good Python library to start with if you are into Data Science or Machine Learning. In this article, we are going to take a deep dive into an algorithm called Gradient Descent, and we are also going to implement it from scratch in Python. So why so late ? Let’s dive into the topic !

Table of Contents

1. Introduction

2. Batch Gradient Descent

3. Stochastic Gradient Descent

4. Mini-Batch Gradient Descent

What is Gradient Descent ?

Gradient Descent (GD) is an iterative optimization algorithm used to find the minimum value of a function. It tweaks the model’s parameters iteratively to minimize the cost function. Here, the cost function is like a metric used for representing error. The idea behind the Gradient Descent is to simply update the parameters iteratively until the cost function converges to a minimum value. It involves initializing the model’s parameters, computing the gradients, and updating the parameters. If you think it will be hard, don’t panic ! It won’t be that tough, and you will get clarity in a while.

Components of Gradient Descent:

Here are the components of Gradient Descent:

Cost Function: The cost function is a metric that represents an error. There are many cost functions that we can use for regression, such as mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), etc. In this tutorial, we’re going to use MSE because of its simplicity and because it guarantees a global minima ( there is only one minimum value for the function, which is both local and global, which we can say, in other words, that the cost function gives a bowl shape ). Moreover, the MSE cost is continuous, and the slope does not change abruptly.
Number of epochs or iterations: It is a hyperparameter that represents the number of iterations to be run, i.e., the number of times the gradients are to be computed for updating the model’s parameters. When all the training data passes through this entire process once, then it is known as one epoch. The cost function keeps getting smaller after each epoch if we use a good learning rate.
Learning Rate: It is a hyperparameter that refers to the size of the update step. If it is too large, the algorithm diverges instead of converging, i.e., the cost value tends to increase. If it is too small, the algorithm requires a very large number of iterations to converge, and it may suffer from the Vanishing Gradients Problem. In the vanishing gradients problem, the gradients of the cost function tend to become very smaller, and the update tends to become very minimal. Hence, we need to choose an optimal value for this learning rate hyperparameter in order to have a good model.

Note: There is a difference between parameters and hyperparameters. Parameters are updated while training, and they affect the model’s predictions, whereas hyperparameters are set before training and are not included while making predictions. We need to choose good values for these hyperparameters to have the best model. But, unfortunately, there is no such thing as good value. It depends on the specific application, and it is often a trial-and-error strategy. But generally, we take the learning rate hyperparameter within the range of 1e-2 and 1e-6. It can be done through Hyperparameter Tuning ( testing possible combinations of the hyperparameters and evaluating the model ).

Steps involved in Gradient Descent:

Initialize the model’s parameters (weights and bias) randomly. This is known as random initialization.
Find the gradients of the cost function. Gradients can be obtained by partially differentiating the cost function with respect to the model’s parameters. Instead of computing gradients separately for weights and the bias term, we can also compute them collectively by including the bias term in the feature vector itself.
Update the values of weights and bias term by subtracting the product of learning rate and gradients from them. This is the update step, and the parameters will get optimized after every iteration.
Repeat steps 2 and 3 for a fixed number of epochs until the cost function converges to the minimum possible value.

Finally, the updated values of parameters are obtained, and the cost function is at its minimum at those values of parameters.

Note:

Gradient Descent is sensitive to feature scaling. It converges faster if we perform feature scaling. If we don’t do so, the cost function (MSE) will be in the shape of an elongated bowl instead of a bowl. It increases the computational complexity of the algorithm.
We can also make many additions to plain GD, such as implementing early stopping ( stop training when there is no progress in the validation error for a fixed number of epochs ), including a tolerance hyperparameter to stop the training when the change in the update step is very minimal, etc.

Here, I have provided the markdown in jupyter notebook which may help you in understanding better, because in this article we are going to implement Gradient Descent for Linear Regression.

Data Collection, Pre-Processing and Preparation:

Here, for this tutorial, I have generated random sample data for multiple linear regression involving two features and a target variable using sklearn’s API. Just don’t bother too much about what Sklearn is. It is a Machine Learning library that provides the implementation of many algorithms. Not only that, it provides various utilities for data preprocessing, model building and training, model evaluation, etc. If you studied linear regression earlier, the equation would look like this: y = mx + c. The equation below also resembles the same, but rather it is a multiple linear regression model, i.e., it involves multiple independent variables and a single dependent variable. You can go through the below code for better understanding.

import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model, model_selection, preprocessing, metrics, datasets

warnings.filterwarnings("ignore")

# Generating random regression data

X, y = datasets.make_regression(n_samples = 1000, n_features = 2, noise = 7, random_state = 16)

print(X.shape, y.shape)

# Splitting the data into training and testing sets

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.2, random_state = 3)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# Feature Scaling ( Standardization or Z-Score Normalization )

scaler = preprocessing.StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

print(X_train.mean(), X_train.std())
print(X_test.mean(), X_test.std())

Different variants of Gradient Descent:

There are three main variants of gradient descent that we can use. They are: i) Batch GD ii) Stochastic GD, and iii) Mini Batch GD. The difference between them lies mainly in the number of training instances or samples used per iteration, computational complexity, and likelihood of finding a global minima. Now, we will have a look into each one of them and learn how to implement them in Python from scratch.

i) Batch Gradient Descent: The name indicates that it uses a whole batch of training instances at each epoch to compute the gradients of the cost function. Although it is computationally expensive when compared to other variants, it guarantees finding an optimal solution. One of the main drawbacks of Batch GD comes into play when the cost function does not give a bowl shape or when it is discontinuous. In some cases, the cost function gives a bowl shape. There may be ridges, plateaus, and holes. And, in that case, Batch GD will be more likely to end up finding a local minima instead of a global minima.

Batch Gradient Descent with and without feature scaling

Pitfalls due to discontinuous cost function

Pros:

It guarantees finding the most optimal solution, especially when the cost function gives a bowl shape.
The bias and variance are lower when compared to other variants. Therefore, it gives rise to a very good model.

Cons:

It may get stuck in the local minima and end up finding a local optimal solution if the learning rate is not chosen properly.
It has more computational complexity, and it becomes extremely slower with the increase in the number of features and training samples.

Here is the implementation of Batch Gradient Descent in Python from scratch:

# Batch Gradient Descent

class BatchGD:

    def __init__(self, max_iter = 1000, eta = 0.001, tol = 1e-6):
        self.max_iter = max_iter
        self.eta = eta
        self.tol = tol
        self.weights = self.bias = None

    def fit(self, X, y):
        m, n = X.shape
        self.weights = np.random.rand(n)
        self.bias = np.random.rand(1)[0]
        costs = []
        
        for i in range(1, self.max_iter + 1):
            y_pred = np.dot(X, self.weights) + self.bias
            mse = (1 / m) * np.sum(np.square(y_pred - y))
            costs.append(mse)
            
            dw = (2 / m) * np.dot(X.T, y_pred - y)
            db = (2 / m) * np.sum(y_pred - y)
            
            temp = self.weights
            self.weights = self.weights - self.eta * dw
            self.bias = self.bias - self.eta * db
            
            if i % 100 == 0:
                print(f"Epoch: {i}\tCost: {mse}")
                
            if all(abs(self.weights - temp) <= self.tol):
                print(f"Stopped at iteration {i} !")
                break

    def predict(self, X):
        y_pred = np.dot(X, self.weights) + self.bias
        return y_pred

# Training and evaluation ( Batch GD )

lr = BatchGD(max_iter = 10000, eta = 0.01)

lr.fit(X_train, y_train)

print(f"Weights (Coefficients): {lr.weights}\nBias (Intercept): {lr.bias}")

y_test_pred = lr.predict(X_test)

r2_score = metrics.r2_score(y_test, y_test_pred)

print("R-Square (Coefficient of determination): {score}".format(score = r2_score))

# Implementation through Sklean API

model = linear_model.LinearRegression()

model.fit(X_train, y_train)

print(f"Weights (Coefficients): {model.coef_}\nBias (Intercept): {model.intercept_}")

r2_score = model.score(X_test, y_test)

print("R-Square (Coefficient of determination): {score}".format(score = r2_score))

You can see that I have also implemented linear regression using Sklearn’s API, which makes use of single value decomposition (matrix factorization technique) instead of gradient descent. Despite their different approaches, you may notice that both produce similar results.

ii) Stochastic Gradient Descent: Unlike Batch GD, it doesn’t use an entire batch of training samples to compute gradients. Instead, it uses only one sample randomly from all the training instances at a time. You may notice that, instead of cost, here I have mentioned loss. Many people use “cost” and “loss” interchangeably, but generally, cost is the error for all training instances, while loss is the error for a single training instance. The cost is actually the average of all the losses. Therefore, I’ve used squared error (SSE) instead of mean squared error (MSE). One excellent thing about SGD is that it is very fast when compared to its variants since it uses a single instance at a time. But the main drawback of this is its stochastic nature. This implies that the update is irregular, i.e., the cost function keeps bouncing up and down instead of settling down slowly. Hence, it can only guarantee a good solution, but it can’t guarantee the optimal solution. One solution to this dilemma is to gradually reduce the learning rate as the GD approaches global minima. We can do this by introducing a learning schedule instead of merely using a fixed learning rate. Surprisingly, there is also an advantage associated with this drawback, especially in cases where the cost function does not give a bowl shape. That means it can escape local minima and doesn’t end up finding a local optimal value due to its stochastic nature. Note: It is necessary to shuffle the training data, especially if it is sorted according to the labels (in classification problems) or in a way that it shows a particular pattern. This is because if we don’t do so, the SGD will keep optimising its parameters for one label and then the other. This causes the resulting model to have a very high bias or variance.

Pros:

It is computationally efficient and especially suitable for larger datasets.
It is very less likely to get stuck in the local minima.

Cons:

It has a high bias and variance compared to other variants. And in some cases, it overfits due to the high variance.
By the end, it only finds a good minimum value but not the best optimal value.

Here is the implementation of Stochastic Gradient Descent in Python from scratch:

# Stochastic Gradient Descent

class StochasticGD:

    def __init__(self, max_iter = 100, t0 = 5, t1 = 50, tol = 1e-7):
        self.max_iter = max_iter
        self.tol = tol
        self.t0, self.t1 = 5, 50
        self.weights = self.bias = None

    def fit(self, X, y):
        m, n = X.shape
        self.weights = np.random.rand(n)
        self.bias = np.random.rand(1)[0]
        losses, flag = [], False
        
        for epoch in range(1, self.max_iter + 1):
            perm_idx = np.random.permutation(m)
            X, y = X[perm_idx], y[perm_idx]
            
            if flag: break
                
            if epoch % 10 == 0:
                print(f"Epoch: {epoch}\tLoss: {losses[epoch - 1]}")
                
            for i in range(m):
                rand_idx = np.random.randint(m)
                X_i = X[rand_idx: rand_idx + 1]
                y_i = y[rand_idx: rand_idx + 1]
                
                y_pred = np.dot(X_i, self.weights) + self.bias
                loss = np.sum(np.square(y_pred - y_i))
                losses.append(loss)
                
                dw = np.dot(X_i.T, y_pred - y_i)
                db = np.sum(y_pred - y_i)
                
                temp = self.weights
                eta = self.learning_schedule(m * epoch + i)
                
                self.weights = self.weights - eta * dw
                self.bias = self.bias - eta * db
                
                if all(abs(self.weights - temp) <= self.tol):
                    print(f"Stopped at iteration {epoch} !")
                    flag = True
                    break

    def learning_schedule(self, t):
        return self.t0 / (t + self.t1)

    def predict(self, X):
        y_pred = np.dot(X, self.weights) + self.bias
        return y_pred

# Training and evaluation ( Stochastic GD )

lr = StochasticGD(max_iter = 50, tol = 1e-8)

lr.fit(X_train, y_train)

print(f"Weights (Coefficients): {lr.weights}\nBias (Intercept): {lr.bias}")

y_test_pred = lr.predict(X_test)

r2_score = metrics.r2_score(y_test, y_test_pred)

print("R-Square (Coefficient of determination): {score}".format(score = r2_score))

# Implementation through Sklean API

model = linear_model.SGDRegressor()

model.fit(X_train, y_train)

print(f"Weights (Coefficients): {model.coef_}\nBias (Intercept): {model.intercept_}")

r2_score = model.score(X_test, y_test)

print("R-Square (Coefficient of determination): {score}".format(score = r2_score))

You can see that I have also implemented SGD regression using Sklearn’s API, which makes use of Stochastic Gradient Descent internally. Additionally, it has some more parameters, such as a penalty (regularization), which can be used to prevent overfitting by constraining the model’s parameters with the addition of a penalty term.

iii) Mini-Batch Gradient Descent: You may easily guess what Mini-Batch GD is if you have gone through Batch GD and Stochastic GD. Like you guess, it is between Batch GD and Stochastic GD. It doesn’t use entire training instances, like Batch GD, or a single training instance at a time, like Stochastic GD. Instead, it uses a mini-batch, which is a random subset of the entire training set, for gradient computation. The size of the mini-batch is a variable, and a batch size of 32 is good and is most widely used. It outperforms SGD and boosts performance through hardware optimization, especially when using GPU’s. As with SGD, we need to shuffle the training instances to get rid of high bias or variance.

Comparison of three variants on how they approach global minima

Pros:

It provides a performance boost with hardware optimization if used with GPUs.
It has both the best properties of SGD and Batch GD, i.e., in terms of computational complexity as well as the likelihood of finding the global minima.

Cons:

It may be stuck in the local minima if the cost function is discontinuous or if it doesn’t give a bowl shape.

Here is the implementation of Mini-Batch Gradient Descent in Python from scratch:

# Mini-Batch Gradient Descent

class MiniBatchGD:

    def __init__(self, max_iter = 100, eta = 0.001, tol = 1e-7, batch_size = 32):
        self.max_iter = max_iter
        self.tol = tol
        self.eta = eta
        self.batch_size = batch_size
        self.weights = self.bias = None

    def fit(self, X, y):
        m, n = X.shape
        self.weights = np.random.rand(n)
        self.bias = np.random.rand(1)[0]
        costs, flag = [], False
        
        for epoch in range(1, self.max_iter + 1):
            perm_idx = np.random.permutation(m)
            X, y = X[perm_idx], y[perm_idx]
            
            if flag: break
                
            if epoch % 100 == 0:
                print(f"Epoch: {epoch}\tLoss: {costs[epoch - 1]}")
                
            for i in range(0, m, self.batch_size):
                X_mini = X[i: i + self.batch_size]
                y_mini = y[i: i + self.batch_size]
                
                y_pred = np.dot(X_mini, self.weights) + self.bias
                cost = 1 / self.batch_size * np.sum(np.square(y_pred - y_mini))
                costs.append(cost)
                
                dw = 2 / self.batch_size * np.dot(X_mini.T, y_pred - y_mini)
                db = 2 / self.batch_size * np.sum(y_pred - y_mini)
                
                temp = self.weights
                self.weights = self.weights - self.eta * dw
                self.bias = self.bias - self.eta * db
                
                if all(abs(self.weights - temp) <= self.tol):
                    print(f"Stopped at iteration {epoch} !")
                    flag = True
                    break

    def predict(self, X):
        y_pred = np.dot(X, self.weights) + self.bias
        return y_pred

# Training and evaluation ( MiniBatch GD )

lr = MiniBatchGD(max_iter = 1000, tol = 1e-7)

lr.fit(X_train, y_train)

print(f"Weights (Coefficients): {lr.weights}\nBias (Intercept): {lr.bias}")

y_test_pred = lr.predict(X_test)

r2_score = metrics.r2_score(y_test, y_test_pred)

print("R-Square (Coefficient of determination): {score}".format(score = r2_score))

# Implementation through Sklean API

def form_batches(X, y, batch_size = 32):
    m, n = X.shape
    X_batches, y_batches = [], []
    
    for i in range(0, m, batch_size):
        perm_idx = np.random.permutation(m)
        X, y = X[perm_idx], y[perm_idx]
        
        X_batch = X[i: i + batch_size]
        y_batch = y[i: i + batch_size]

        if X_batch.shape[0] < batch_size:
            continue
        
        X_batches.append(X_batch)
        y_batches.append(y_batch)
        
    return np.array(X_batches), np.array(y_batches)

model = linear_model.SGDRegressor()

X_batches, y_batches = form_batches(X_train, y_train)

for X, y in zip(X_batches, y_batches):
    model.partial_fit(X, y)

print(f"Weights (Coefficients): {model.coef_}\nBias (Intercept): {model.intercept_}")

r2_score = model.score(X_test, y_test)

print("R-Square (Coefficient of determination): {score}".format(score = r2_score))

Conclusion:

Gradient Descent is indeed a wonderful, simple, and valuable algorithm. It offers foundational knowledge and lays the groundwork for understanding numerous Machine Learning and Deep Learning algorithms, including linear regression, logistic regression, support vector machines, neural networks, and more. Through this tutorial, you’ve acquired a powerful tool to add to your machine learning toolkit.

Thanks for reading this article. If you found it useful, please don’t forget to share. Here is a book I recommend you read to get started with Machine Learning. It is really a great book to start with especially if you are a beginner.

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

Step-by-Step: Implementing Gradient Descent Variants in Python for Beginners — A Comprehensive Guide

What is Gradient Descent ?

Different variants of Gradient Descent:

Conclusion:

Written by Prasad Meesala