All about Gradient Descent

Linear regression is an important algorithms in machine learning. It models the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation (for example a line) to observed data. This blog will walk through the mathematics behind linear regression models.

What is Linear Regression?

Linear regression finds a linear function that best predicts the target variable y based on predictor variables x. The model equation is:

y=m0+m1x1+m2x2+…+mnxn
y = m_0 + m_1 x_1 + m_2 x_2 + ldots + m_n x_n
y=m0+m1x1+m2x2++mnxn

where

m0m_0 m0

is the intercept and

m1,m2,…,mnm_1, m_2, ldots, m_n m1,m2,,mn

are the coefficients (weights) for features

x1,x2,…,xnx_1, x_2, ldots, x_n x1,x2,,xn

. In linear regression we aim to find optimal value for

m0,m1,m2,…,mnm_0, m_1, m_2, ldots, m_n m0,m1,m2,,mn

to converge the cost function.

But, What is Cost function?

The cost function in linear regression is a mathematical measure of how well the model’s predicted values match the actual target values in the training data. It quantifies the difference (error) between the predicted values

(y^)(hat{y}) (y^)

and the true values

(y)(y) (y)

, representing this error as a single number that the learning algorithm tries to minimize. Now, how does learning algorithm minimizes it? The answer is using Gradient descent.

Gradient Descent

Gradient descent is used to find the global minima of the cost function, lower the cost better the model fits into the data set. But, how it finds the global minima? Remember functions, differentiation, partial derivatives etc. ? We use these mathematical concepts to achieve global minima. Lets understand the mathematics behind it.

Let’s consider a simple cost function, a parabolic function

y = x2 −4x − 3y = x^{2 }-4x – 3 y = x2 4x  3

Parabolic Cost function

Now from the graph it is clearly visible that at

x=2,y=−7x = 2, y = -7 x=2,y=7

, we have the global minima. But, it is not possible to trace graph and observe the global minima for all cost functions because cost functions can be as complex as Mean squared error function with y depending on not just x but multiple independent set of variable like

x0,x1,x2,…,xnx_0, x_1, x_2, ldots, x_n x0,x1,x2,,xn

. For example,

J(y^)=12N∑i=1N(y^i−yi)2
J(hat{y}) = frac{1}{2N} sum_{i=1}^{N} (hat{y}_i – y_i)^2
J(y^)=2N1i=1N(y^iyi)2

where, the predicted values is

(yi^)(hat{y_i}) (yi^)

, the true values is

(yi)(y_i) (yi)

,

J(yi)J(y_i) J(yi)

is the cost function,

NN N

is number of data points and n is no of feature in each data point. You see, it is very difficult to plot this cost function and observe the local minima, now here comes mathematics to rescue us.

In gradient descent, the core idea is to move in the direction of steepest slope based on calculus (gradient at a point or the slope). So, by walking down the hill step by step we will reach the global minima. Now, how do we find this mathematically. So lets take example of Mean Squared Error (MSE) function, our cost function for linear regression,

J(y^)=12N∑i=1N(y^i−yi)2
J(hat{y}) = frac{1}{2N} sum_{i=1}^{N} (hat{y}_i – y_i)^2
J(y^)=2N1i=1N(y^iyi)2

for regression line

y=mx+by = mx + b y=mx+b

, our goal here is to find optimal values of m (slope) and b (intercept) to reduce the cost function. So lets take partial derivative of

J(m,b)J(m,b) J(m,b)

. So, substituting

yi=mxi+by_i = mx_i + b yi=mxi+b

and taking partial derivative with respect to m, we get,

∂J(m,b)∂m=1N∑i=1N(xi((mxi+b)−yi))
frac{partial J(m,b)}{partial m} = frac{1}{N} sum_{i=1}^N (x_i((mx_i+b) – y_i))
mJ(m,b)=N1i=1N(xi((mxi+b)yi))

again, taking partial derivative of

J(m,b)J(m,b) J(m,b)

with respect to b, we get,

∂J(m,b)∂b=1N∑i=1N((mxi+b)−yi)
frac{partial J(m,b)}{partial b} = frac{1}{N} sum_{i=1}^N ((mx_i+b) – y_i)
bJ(m,b)=N1i=1N((mxi+b)yi)

Let

mcurrm_{text{curr}} mcurr

and

bcurrb_{text{curr}} bcurr

be current values of slope and intercept and

αalpha α

be the rate of learning, then out new slope and intercept will be,

mnew=mcurr−α.∂J(m,b)∂m
m_{text{new}} = m_{text{curr}} – alpha.frac{partial J(m,b)}{partial m}
mnew=mcurrα.mJ(m,b)

bnew=bcurr−α.∂J(m,b)∂b
b_{text{new}} = b_{text{curr}} – alpha.frac{partial J(m,b)}{partial b}
bnew=bcurrα.bJ(m,b)

We keep iterating over this process till our cost function converges, and once the cost function converges we get our optimal value for slope (m) and intercept (b).

Now, this was just to fit a simple straight line containing one variable

xx x

, what is we have n number of variables? well in that case we make use of linear algebra with multi-variate calculus.

The General Case

In real life you will not get data whose outcome will depend on a single parameter. In real life, we will have n number of independent parameters on which out outcome would depend. So, how to express these in terms of mathematical equation? Here comes linear algebra and vectors. So, lets say for our regression line

y=mx+by = mx+b y=mx+b

, we will try to express each terms in form of matrix for example,

X=[x1], M=[mb]
X = begin{bmatrix} x newline 1 end{bmatrix}
, space
M = begin{bmatrix} m & b end{bmatrix}
X=[x1], M=[mb]

and Y being our outcome matrix, we can now express our line as a dot product of two matrix,

Y=M⋅X⇠(Eq.1)
Y = M centerdot X quad dashleftarrow (Eq. 1)
Y=MX(Eq.1)

Expanding out idea further we can now express a multi-vatiate regression line

y=m0+m1x1+m2x2+…+mnxn
y = m_0 + m_1 x_1 + m_2 x_2 + ldots + m_n x_n
y=m0+m1x1+m2x2++mnxn


as,

X=[1x1x2⋅⋅⋅xn-1xn], M=[m0m1⋅⋅mn-1mn]
X = begin{bmatrix} 1 newline x_1 newline x_2 newline cdot newline cdot newline cdot newline x_{text{n-1}} newline x_n end{bmatrix}
, space
M = begin{bmatrix} m_0 & m_1 & cdot & cdot & m_{text{n-1}} & m_n end{bmatrix}
X=1x1x2xn-1xn, M=[m0m1mn-1mn]

and to express our line, we can take dot product of M and X same as Eq.1 mentioned above. Now lets try to express all our calculation in terms of matrix.

∇MJ(M)=1N∑i=1NXiT(M⋅Xi−Yi)
nabla_M J(M) = frac{1}{N} sum_{i=1}^{N} {X_i}^T(M centerdot X_i – Y_i)
MJ(M)=N1i=1NXiT(MXiYi)

where,

∇MJ(M)nabla_M J(M) MJ(M)

is Jacobian expression for

∂J(m,b)∂mfrac{partial J(m,b)}{partial m} mJ(m,b)

and

∂J(m,b)∂bfrac{partial J(m,b)}{partial b} bJ(m,b)

. Similarly our new M would be

Mnew=Mold−α∇MJ(M)
M_{text{new}} = M_{text{old}} – alphanabla_M J(M)
Mnew=MoldαMJ(M)

We keep iterating over this process till our cost function converges, and once the cost function converges we get our optimal value for M.

So, this is it, we have covered the mathematics behind gradient descent and how to apply it in optimizing the cost function of models. We will further discuss its implementation using python, till then take care!

Leave a Reply