CS2109S AY 2024/25 Semester 2

Lecture 6 Math Explained - Logistic Regression

Logistic Regression

We use logistic regression to perform classification tasks, i.e. estimating the probability of an event occuring. It can be viewed as a (linear) regression followed by applying a logistic function.

Logisitc Regression

Formally, we define the hypothesis as follows:

$$h_{\boldsymbol{w}}(\boldsymbol{x}) = \sigma (\boldsymbol{x} \cdot \boldsymbol{w})$$

where $\sigma$ is the logistic function. We usually use the sigmoid function defined as follows:

$$\sigma(z) = \dfrac{1}{1 + e^{-z}}$$

Remember that we want to predict the probability of an event occuring, say, $p$. However, we cannot build a linear regression model directly to predict $p$, since $p$ is capped within the range $[0, 1]$. Thus, we have to transform $p$ to a quantity that can take any real values.

Let's first define the odds ratio $\dfrac{p}{1 - p}$, which describes the ratio between the probability that the event occurs and the probability that it doesn't occur. Instead of $p$ which is capped within the range $[0, 1]$, the odds ratio can now take any positive value, i.e. $[0, +\infty)$.

Next, we take the natural log of the odds' ratio, which we call the logit function:

$$\text{logit}(p(y = 1 \vert \boldsymbol{x})) = \log \left( \dfrac{p}{1 - p} \right)$$

The logit function can take any real value, i.e. $(-\infty, \infty)$. That's exactly what we wanted. We could thus build a linear regression model to model the relationship between our input variable $\boldsymbol{x}$ and the logit function:

$$\log \left( \dfrac{p}{1 - p} \right) = \boldsymbol{x} \cdot \boldsymbol{w}$$

Remember our goal is to predict the value of $p$, not $\log \left( \dfrac{p}{1 - p} \right)$, thus we have

$$\begin{align} \log \left( \dfrac{p}{1 - p} \right) &= \boldsymbol{x} \cdot \boldsymbol{w} \\ \dfrac{p}{1 - p} &= e^{\boldsymbol{x} \cdot \boldsymbol{w}} \\ \dfrac{1 - p}{p} &= e^{-(\boldsymbol{x} \cdot \boldsymbol{w})} \\ \dfrac{1}{p} - 1 &= e^{-(\boldsymbol{x} \cdot \boldsymbol{w})} \\ \dfrac{1}{p} &= 1 + e^{-(\boldsymbol{x} \cdot \boldsymbol{w})} \\ p &= \dfrac{1}{1 + e^{-(\boldsymbol{x} \cdot \boldsymbol{w})}} \end{align}$$

This is exactly the sigmoid function!

This is just a quick intuition - optionally, you can watch this video for a more rigorous understanding.

Remember that graident descent only guarantees to reach the global minimum when the loss function is convex, which is not the case if we use mean squared error in this case. Thus, we use (binary) cross-entropy loss instead:

$$\begin{align} BCE(\hat{y}) & = \begin{cases} - \log(\hat{y}) & \text{if } y = 1 \\ - \log(1 - \hat{y}) & \text{if } y = 0 \end{cases} \\ \end{align}$$

Or equivalently,

$$BCE(\hat{y}) = -y \log(\hat{y}) - (1 - y) \log(1 - \hat{y})$$

where $\hat{y}$ is the predicted probability where $y = 1$.

Recall that when we discuss entropy (for decision trees), we discussed the idea of information content - The "surprise" when encountering a certain value:

$$I(e) = \log \left( \dfrac{1}{P(e)} \right) = - \log P(e)$$

The same "surprise" appears when an incorrect prediction occurs. For example, when $y = 1$, we will be surprised when $\hat{y}$ is close to 0 (thus the "surprisal" approaches infinity), and we won't be surprised when $\hat{y}$ is close to 1 (thus the "surprisal" appproaches 0).

When $y = 0$, we have to invert our values accordingly.

Again, this is just a quick intuition - optionally, you can watch this video for a more rigorous understanding.

Bias and Variance

Let's try to understand bias and variance formally. Suppose we train a model $h_{\boldsymbol{w}}$ based on a hypothesis class $H$. By sampling a different subset of data as the training samples, we obtain different models $h_{\boldsymbol{w}}$.



Two samples drawn from the same population and their respective models

Plotting all the models (trained on different samples) out, we get this:

We are interested in knowing the generalization ability of our model (can it handle unseen data well?), i.e. the expected squared error when we plug in an input $x$.

$$\text{Err}(\boldsymbol{x}) = \mathbb{E}[(y - h_{\boldsymbol{w}}(\boldsymbol{x}))^2]$$

Note that both $\boldsymbol{x}$ and $h_{\boldsymbol{w}}$ are both random variables - we can be given a random sample to train our model, and we're given a random input to test our model.

Equivalently, we can write the error as follows:

$$\text{Err}(\boldsymbol{x}) = \left(\underbrace{y - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]}_{\text{bias}} \right) ^2 + \underbrace{\mathbb{E}\left[\left(\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - h_{\boldsymbol{w}}(\boldsymbol{x})\right)^2\right]}_{\text{variance}}$$

We first write

$$\begin{align} \text{Err}(\boldsymbol{x}) &= \mathbb{E}[(y - h_{\boldsymbol{w}}(\boldsymbol{x}))^2] \\ &= \mathbb{E}[(y - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] + \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - h_{\boldsymbol{w}}(\boldsymbol{x}))^2] \\ &= \mathbb{E}[(y - h_{\boldsymbol{w}}(\boldsymbol{x})])^2] + 2 \mathbb{E}\left[(y - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]) \cdot (\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - h_{\boldsymbol{w}}(\boldsymbol{x}))\right] + \mathbb{E}[(h_{\boldsymbol{w}}(\boldsymbol{x}) - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x}))^2] \\ \end{align}$$

We then look into each of the terms indepenently:

  • For the first term, we have $$\begin{align} \mathbb{E}[(y - h_{\boldsymbol{w}}(\boldsymbol{x})])^2] &= \mathbb{E}(y^2) - 2 \mathbb{E}(y \cdot h_{\boldsymbol{w}}(\boldsymbol{x})) + \mathbb{E}(h_{\boldsymbol{w}}(\boldsymbol{x})^2) \\ &= y^2 - 2 y \cdot \mathbb{E}(h_{\boldsymbol{w}}(\boldsymbol{x})) + \mathbb{E}(h_{\boldsymbol{w}}(\boldsymbol{x})^2) \\ &= (y - \mathbb{E}(h_{\boldsymbol{w}}(\boldsymbol{x})))^2 \end{align}$$ which is the square of the bias.
  • For the second term, we have $$\begin{align} \mathbb{E}\left[(y - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]) \cdot (\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - h_{\boldsymbol{w}}(\boldsymbol{x}))\right] &= \mathbb{E}(y \cdot \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]) - \mathbb{E}(y \cdot h_{\boldsymbol{w}}(\boldsymbol{x})) - \mathbb{E}(\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]^2) + \mathbb{E}(\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] \cdot h_{\boldsymbol{w}}(\boldsymbol{x})) \\ &= y \cdot \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - y \cdot \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]^2 + \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]^2 \\ &= 0 \end{align}$$ which is 0.
  • The third term is exactly the variance.

Therefore, we have

$$\text{Err}(\boldsymbol{x}) = \left(\underbrace{y - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]}_{\text{bias}} \right) ^2 + \underbrace{\mathbb{E}\left[\left(\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - h_{\boldsymbol{w}}(\boldsymbol{x})\right)^2\right]}_{\text{variance}}$$

Practically, it is impossible to attain $\text{Err}(\boldsymbol{x}) = 0$ since there is some underlying error in the $y$ values. Therefore, $y = f(\boldsymbol{x}) + \varepsilon$ where $\varepsilon$ is the error term.

In such case, we can write

$$\text{Err}(\boldsymbol{x}) = \left(\underbrace{y - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]}_{\text{bias}} \right) ^2 + \underbrace{\mathbb{E}\left[\left(\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - h_{\boldsymbol{w}}(\boldsymbol{x})\right)^2\right]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{true error}}$$

where $\sigma^2$ is the true error.

Here, there are two terms:

Let's now investigate the relationship between the bias and the variance:

Bias vs Variance

This results in the bias-variance tradeoff: Trying to reduce the bias might lead to a higher variance, and vice versa. To minimize the generalization error $\text{Err}(\boldsymbol{x})$, we need to strike a balance between the bias and variance.

Bias vs Variance

References


Last updated: 14 February 2025