All about Gradient Descent – Codango® / Codango.Com

Linear regression is an important algorithms in machine learning. It models the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation (for example a line) to observed data. This blog will walk through the mathematics behind linear regression models.

What is Linear Regression?

Linear regression finds a linear function that best predicts the target variable y based on predictor variables x. The model equation is:

m_0 + m_1 x_1 + m_2 x_2 + ldots + m_n x_n

where
$m_0$
is the intercept and
$m1,m2,…,mnm_1, m_2, ldots, m_n$
are the coefficients (weights) for features
$x1,x2,…,xnx_1, x_2, ldots, x_n$
. In linear regression we aim to find optimal value for
$m0,m1,m2,…,mnm_0, m_1, m_2, ldots, m_n$
to converge the cost function.

But, What is Cost function?

The cost function in linear regression is a mathematical measure of how well the model’s predicted values match the actual target values in the training data. It quantifies the difference (error) between the predicted values
$(y^)(hat{y})$
and the true values
$(y)$
, representing this error as a single number that the learning algorithm tries to minimize. Now, how does learning algorithm minimizes it? The answer is using Gradient descent.

Gradient Descent

Gradient descent is used to find the global minima of the cost function, lower the cost better the model fits into the data set. But, how it finds the global minima? Remember functions, differentiation, partial derivatives etc. ? We use these mathematical concepts to achieve global minima. Lets understand the mathematics behind it.

Let’s consider a simple cost function, a parabolic function
$y = x^{2 }-4x – 3$

Now from the graph it is clearly visible that at
$x = 2, y = - 7$
, we have the global minima. But, it is not possible to trace graph and observe the global minima for all cost functions because cost functions can be as complex as Mean squared error function with y depending on not just x but multiple independent set of variable like
$x0,x1,x2,…,xnx_0, x_1, x_2, ldots, x_n$
. For example,

J(y^)=12N∑i=1N(y^i−yi)2 J(hat{y}) = frac{1}{2N} sum_{i=1}^{N} (hat{y}_i – y_i)^2

where, the predicted values is
$(yi^)(hat{y_i})$
, the true values is
$y_i)$
,
$J(y_i)$
is the cost function,
$N$
is number of data points and n is no of feature in each data point. You see, it is very difficult to plot this cost function and observe the local minima, now here comes mathematics to rescue us.

In gradient descent, the core idea is to move in the direction of steepest slope based on calculus (gradient at a point or the slope). So, by walking down the hill step by step we will reach the global minima. Now, how do we find this mathematically. So lets take example of Mean Squared Error (MSE) function, our cost function for linear regression,

J(y^)=12N∑i=1N(y^i−yi)2 J(hat{y}) = frac{1}{2N} sum_{i=1}^{N} (hat{y}_i – y_i)^2

for regression line
$y = m x + b$
, our goal here is to find optimal values of m (slope) and b (intercept) to reduce the cost function. So lets take partial derivative of
$J (m, b)$
. So, substituting
$y_i = mx_i + b$
and taking partial derivative with respect to m, we get,

sum_{i=1}^N (x_i((mx_i+b) – y_i))

again, taking partial derivative of
$J (m, b)$
with respect to b, we get,

sum_{i=1}^N ((mx_i+b) – y_i)

Let
$mcurrm_{text{curr}}$
and
$bcurrb_{text{curr}}$
be current values of slope and intercept and
$α$
be the rate of learning, then out new slope and intercept will be,

m_{text{new}} = m_{text{curr}} – alpha.frac{partial J(m,b)}{partial m}

b_{text{new}} = b_{text{curr}} – alpha.frac{partial J(m,b)}{partial b}

We keep iterating over this process till our cost function converges, and once the cost function converges we get our optimal value for slope (m) and intercept (b).

Now, this was just to fit a simple straight line containing one variable
$x$
, what is we have n number of variables? well in that case we make use of linear algebra with multi-variate calculus.

The General Case

In real life you will not get data whose outcome will depend on a single parameter. In real life, we will have n number of independent parameters on which out outcome would depend. So, how to express these in terms of mathematical equation? Here comes linear algebra and vectors. So, lets say for our regression line
$y = m x + b$
, we will try to express each terms in form of matrix for example,

X = [x 1], M = [m b]

and Y being our outcome matrix, we can now express our line as a dot product of two matrix,

Y = M \cdot X ⇠ (Eq .1)

Expanding out idea further we can now express a multi-vatiate regression line
$m_0 + m_1 x_1 + m_2 x_2 + ldots + m_n x_n$
as,

x_1 newline x_2 newline cdot newline cdot newline cdot newline x_{text{n-1}} newline x_n end{bmatrix} , space M = begin{bmatrix} m_0 & m_1 & cdot & cdot & m_{text{n-1}} & m_n end{bmatrix}

and to express our line, we can take dot product of M and X same as Eq.1 mentioned above. Now lets try to express all our calculation in terms of matrix.

nabla_M J(M) = frac{1}{N} sum_{i=1}^{N} {X_i}^T(M centerdot X_i – Y_i)

where,
$∇MJ(M)nabla_M J(M)$
is Jacobian expression for
$∂J(m,b)∂mfrac{partial J(m,b)}{partial m}$
and
$∂J(m,b)∂bfrac{partial J(m,b)}{partial b}$
. Similarly our new M would be

M_{text{new}} = M_{text{old}} – alphanabla_M J(M)

We keep iterating over this process till our cost function converges, and once the cost function converges we get our optimal value for M.

So, this is it, we have covered the mathematics behind gradient descent and how to apply it in optimizing the cost function of models. We will further discuss its implementation using python, till then take care!

What is Linear Regression?

But, What is Cost function?

Gradient Descent

The General Case

Leave a Reply Cancel reply