Skip to main content

Understanding Gradient Descent in Linear Regression

· 5 min read
Vadim Nicolai
Senior Software Engineer at Vitrifi

Introduction

Gradient descent is a fundamental optimization algorithm used in machine learning to minimize the cost function and find the optimal parameters of a model. In the context of linear regression, gradient descent helps in finding the best-fitting line by iteratively updating the model parameters. This article delves into the mechanics of gradient descent in linear regression, focusing on how the parameters are updated and the impact of the sign of the gradient.

The Linear Regression Model

Model Equation

Linear regression aims to model the relationship between an independent variable and a dependent variable by fitting a linear equation to observed data:

fw,b(x)=wx+bf_{w,b}(x) = w \cdot x + b

xx: The input feature or independent variable. • ww: The weight or coefficient, representing the slope of the line. • bb: The bias or intercept term, indicating where the line crosses the y-axis. • fw,b(x)f_{w,b}(x): The predicted output for a given input xx.

This equation represents a straight line in two-dimensional space, where the goal is to find the optimal values of ww and bb that minimize the difference between the predicted outputs and the actual outputs.

Cost Function

To assess the accuracy of the model, we use a cost function J(w,b)J(w, b), commonly defined as the Mean Squared Error (MSE):

J(w,b)=12mi=1m(fw,b(x(i))y(i))2J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} \left( f_{w,b}\left( x^{(i)} \right) - y^{(i)} \right)^2

mm: The number of training examples. • x(i)x^{(i)}: The ii-th input feature. • y(i)y^{(i)}: The actual output corresponding to x(i)x^{(i)}.

The goal is to find the parameters ww and bb that minimize this cost function.

Gradient Descent Algorithm

Update Rules

Gradient descent minimizes the cost function by updating the parameters in the opposite direction of the gradient:

Update rule for ww:

wwαJ(w,b)ww \leftarrow w - \alpha \frac{\partial J(w, b)}{\partial w}

Update rule for bb:

bbαJ(w,b)bb \leftarrow b - \alpha \frac{\partial J(w, b)}{\partial b}

α\alpha: The learning rate, controlling the step size during each iteration.

Computing the Gradients

Partial derivative with respect to ww:

J(w,b)w=1mi=1m(fw,b(x(i))y(i))x(i)\frac{\partial J(w, b)}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} \left( f_{w,b}\left( x^{(i)} \right) - y^{(i)} \right) x^{(i)}

Partial derivative with respect to bb:

J(w,b)b=1mi=1m(fw,b(x(i))y(i))\frac{\partial J(w, b)}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \left( f_{w,b}\left( x^{(i)} \right) - y^{(i)} \right)

Understanding Parameter Updates

Impact of Negative Gradient on ww

When J(w,b)w\frac{\partial J(w, b)}{\partial w} is a negative number (less than zero), what happens to ww after one update step?

Explanation:

• The update rule for ww is:

wwαJ(w,b)ww \leftarrow w - \alpha \frac{\partial J(w, b)}{\partial w}

• If J(w,b)w<0\frac{\partial J(w, b)}{\partial w} < 0, then:

wwα(negative number)=w+αnegative numberw \leftarrow w - \alpha (\text{negative number}) = w + \alpha |\text{negative number}|

• Since α>0\alpha > 0 and negative number>0|\text{negative number}| > 0, the term αnegative number\alpha |\text{negative number}| is positive. • Therefore, ww increases after the update.

Conclusion: When the gradient of the cost function with respect to ww is negative, the parameter ww increases during the gradient descent update.

Update Step for Parameter bb

For linear regression, what is the update step for parameter bb?

bbα1mi=1m(fw,b(x(i))y(i))b \leftarrow b - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( f_{w,b}\left( x^{(i)} \right) - y^{(i)} \right)

Explanation:

• The gradient with respect to bb is:

J(w,b)b=1mi=1m(fw,b(x(i))y(i))\frac{\partial J(w, b)}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \left( f_{w,b}\left( x^{(i)} \right) - y^{(i)} \right)

• Substituting this into the update rule:

bbαJ(w,b)bb \leftarrow b - \alpha \frac{\partial J(w, b)}{\partial b}

• Thus, the update step for bb is as given.

Note: The update for bb does not include the term x(i)x^{(i)}, unlike the update for ww.

Practical Implications

Effect of Gradient Sign on Parameter Updates

• Negative Gradient (Jw<0\frac{\partial J}{\partial w} < 0): • The parameter ww increases. • Moves ww in the direction that decreases the cost function. • Positive Gradient (Jw>0\frac{\partial J}{\partial w} > 0): • The parameter ww decreases. • Also aims to reduce the cost function.

Importance of the Learning Rate

• The learning rate α\alpha determines how big the update steps are. • A small α\alpha may result in slow convergence. • A large α\alpha may cause overshooting the minimum or divergence.

Conclusion

Understanding how gradient descent updates the parameters in linear regression is crucial for effectively training models. When the gradient with respect to a parameter is negative, the parameter increases, and when the gradient is positive, the parameter decreases. The specific update rules for ww and bb reflect their roles in the model and ensure that the cost function is minimized.

By mastering these concepts, you can better tune your models and achieve higher predictive accuracy in your machine learning tasks.

Feel free to explore more about gradient descent variations, such as stochastic gradient descent and mini-batch gradient descent, to enhance your understanding and application of optimization algorithms in machine learning.