CS2109S AY 2024/25 Semester 2

Lecture 5 Math Explained - Linear Regression

Gradient Descent

Suppose we have $m$ examples $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(m)}, y^{(m)})$. Then, the mean squared error of a prediction function $h_{\boldsymbol{w}}$ is:

$$J_{MSE}(w) = \dfrac{1}{2m} \sum_{i=1}^m (h_{\boldsymbol{w}}(x^{(i)}) - y^{(i)})^2$$

Mean Squared Error

The factor of $2$ in the denominator simplifies the derivative calculation during gradient descent.

The derivative of the term $(h_{\boldsymbol{w}}(x^{(i)}) - y^{(i)})^2$ creates a constant term $2$. It cancels out with this factor of $\dfrac{1}{2}$ perfectly.

Suppose the hypothesis is $h_{\boldsymbol{w}}(x) = \boldsymbol{w} \cdot \boldsymbol{x}$, i.e. we can vary the weight $\boldsymbol{w}$ to minimize the loss. We should thus calculate the derivative of the loss function with respect to $\boldsymbol{w}$ (which tells us how changing $\boldsymbol{w}$ influence the loss), that is,

$$\begin{align} \dfrac{\partial J_{MSE}(w)}{\partial w} &= \dfrac{1}{2m} \sum_{i=1}^m \dfrac{\partial}{\partial w} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} - y^{(i)})^2 \\ &= \dfrac{1}{2m} \sum_{i=1}^m 2(\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} - y^{(i)}) \dfrac{\partial}{\partial w} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} - y^{(i)}) \text{\ $\blacktriangleleft$\ chain rule} \\ &= \dfrac{1}{m} \sum_{i=1}^m (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} - y^{(i)}) \boldsymbol{x}^{(i)} \end{align}$$

Remember that we want to minimize the loss. If $\dfrac{\partial J_{MSE}(w)}{\partial w} < 0$, we will want to increase $w$ such that the loss decreases. If $\dfrac{\partial J_{MSE}(w)}{\partial w} > 0$, we will want to decrease $w$ such that the loss decreases.

Therefore, we should update $\boldsymbol{w}$ as follows:

$$\boldsymbol{w} \leftarrow \boldsymbol{w} - \gamma \dfrac{\partial J_{MSE}(w)}{\partial w} \text{ where } \gamma > 0$$

Notice that the direction which $\boldsymbol{w}$ is changed aligns with our intuition shown above. $\gamma$ is the step size which controls the magnitude of the change.

Not at all! Note that $\dfrac{\partial J_{MSE}(w)}{\partial w}$ is the derivative of the loss and it has nothing to do with the magntiude of $w$. Therefore, we need a learning rate $\gamma$ to control the rate where the weight $w$ is updated.

The only guarantee is that the direction of the update, given by this formula, is correct.

Given that the update only moves towards the correct direction, gradient descent assumes that the learning rate is small enough so that the algorithm would run towards convergence. If the learning rate is too large, although the update is towards the correct direction, it might overshoot which might cause the algorithm to diverge:

Learning Rate

Under a small enough learning rate $\gamma$, gradient descent is guaranteed to converge to the global minimum for convex functions and to a local minimum for non-convex functions.

Convex vs Concave Functions

Let's first look into the definition of a convex function:

A function is convex iff $f(\lambda \boldsymbol{x} + (1 - \lambda) \boldsymbol{x}') \leq \lambda f(\boldsymbol{x}) + (1 - \lambda) f(\boldsymbol{x}')$ for all $\boldsymbol{x}, \boldsymbol{x'}$ and $\lambda \in [0, 1]$.

Intuitively, we can draw a straight line between any two points satisfying the function. The function $f$ must lie below this straight line.


Convex and Non-convex Functions (adapted from CS5339 Notes by Jonathan Scarlett)

This brings us a key property of convex functions: Any local minimum must also be a global minimum. To picture this, try to draw a convex function with two different local minimums. Draw a line between the two local minimums and you will get a contradiction.

Since gradient descent is guaranteed to bring us to a local minimum (under a small enough learning rate $\gamma$), this would also be a global minimum for convex functions.

For $d$-dimensional data, we will use an input vector $\boldsymbol{x}$ (instead of a scalar $x$) and a weight vector $\boldsymbol{w}$ (instead of a scalar $w$). The corresponding hypothesis would be $h_{\boldsymbol{w}}(x) = \boldsymbol{w} \cdot \boldsymbol{x}$ (dot product between $\boldsymbol{w}$ and $\boldsymbol{x}$, i.e. multiplying each input feature $x_i$ by its corresponding weight $w_i$). Equivalently, this can be written as $h_{\boldsymbol{w}}(\boldsymbol{x}) = \boldsymbol{w}^{\top} \boldsymbol{x}$.

For ease of analysis, we usually append a dimension to the input vector $\boldsymbol{x}$, and its value should be 1 for all input samples. This works exactly the same as the bias term (the corresponding weight value will be the magnitude of the bias term $b$).

Normal Equation

The normal equation gives an analytical solution to the linear regression problem. Let's first rewrite the loss function by a bit:

$$\begin{align} J_{MSE}(w) & = \sum_{i=1}^m (h_{\boldsymbol{w}}(x^{(i)}) - y^{(i)})^2 \\ & = \sum_{i=1}^m (\boldsymbol{w} \cdot x^{(i)} - y^{(i)})^2 & \blacktriangleleft \text{ Our hypothesis is $h_{\boldsymbol{w}}(\boldsymbol{x}) = \boldsymbol{w} \cdot x^{(i)}$} \\ & = (\boldsymbol{X} \boldsymbol{w} - \boldsymbol{y})^{\top}(\boldsymbol{X} \boldsymbol{w} - \boldsymbol{y}) & \blacktriangleleft \text{ Dot product between the vector containing all $\boldsymbol{w} \cdot x^{(i)} - y^{(i)}$ and itself} \\ & = (\boldsymbol{X} \boldsymbol{w})^{\top} (\boldsymbol{X} \boldsymbol{w}) - \boldsymbol{y}^{\top} \boldsymbol{X} \boldsymbol{w} - (\boldsymbol{X} \boldsymbol{w})^{\top} \boldsymbol{y} + \boldsymbol{y}^{\top} \boldsymbol{y} \\ & = \boldsymbol{w}^{\top} \boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{w} - 2 \boldsymbol{w}^{\top} \boldsymbol{X}^{\top} \boldsymbol{y} + \boldsymbol{y}^{\top} \boldsymbol{y} & \blacktriangleleft \text{ $(\boldsymbol{X}\boldsymbol{w})^{\top} = \boldsymbol{w}^{\top}\boldsymbol{X}^{\top}$} \end{align}$$

where $\boldsymbol{X}$ is the matrix of all input features and $\boldsymbol{y}$ is the vector of all output values. Next, we set the derivative of the loss function to zero:

$$\begin{align} \dfrac{\partial J_{MSE}(\boldsymbol{w})}{\partial \boldsymbol{w}} = 2 \boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{w} - 2 \boldsymbol{X}^{\top} \boldsymbol{y} & = 0 & \blacktriangleleft \text{ From matrix calculus cheatsheet: $\dfrac{\partial (\boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x})}{\partial \boldsymbol{x}} = 2 \boldsymbol{A} \boldsymbol{x}$ and $\dfrac{\partial (\boldsymbol{x}^{\top} \boldsymbol{b})}{\partial \boldsymbol{x}} = \boldsymbol{b}$} \\ \boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{w} & = \boldsymbol{X}^{\top} \boldsymbol{y} \\ \boldsymbol{w} & = (\boldsymbol{X}^{\top} \boldsymbol{X})^{-1} \boldsymbol{X}^{\top} \boldsymbol{y} \end{align}$$

Therefore, we obtained a solution to $\boldsymbol{w}$ assuming $\boldsymbol{X}^{\top} \boldsymbol{X}$ is invertible.


Last updated: 19 February 2025