CS2109S AY 2024/25 Semester 2
Lecture 6 Math Explained - Logistic Regression
Logistic Regression
We use logistic regression to perform classification tasks, i.e. estimating the probability of an event occuring. It can be viewed as a (linear) regression followed by applying a logistic function.
Formally, we define the hypothesis as follows:
$$h_{\boldsymbol{w}}(\boldsymbol{x}) = \sigma (\boldsymbol{x} \cdot \boldsymbol{w})$$
where $\sigma$ is the logistic function. We usually use the sigmoid function defined as follows:
$$\sigma(z) = \dfrac{1}{1 + e^{-z}}$$
Remember that we want to predict the probability of an event occuring, say, $p$. However, we cannot build a linear regression model directly to predict $p$, since $p$ is capped within the range $[0, 1]$. Thus, we have to transform $p$ to a quantity that can take any real values.
Let's first define the odds ratio $\dfrac{p}{1 - p}$, which describes the ratio between the probability that the event occurs and the probability that it doesn't occur. Instead of $p$ which is capped within the range $[0, 1]$, the odds ratio can now take any positive value, i.e. $[0, +\infty)$.
Next, we take the natural log of the odds' ratio, which we call the logit function:
$$\text{logit}(p(y = 1 \vert \boldsymbol{x})) = \log \left( \dfrac{p}{1 - p} \right)$$
The logit function can take any real value, i.e. $(-\infty, \infty)$. That's exactly what we wanted. We could thus build a linear regression model to model the relationship between our input variable $\boldsymbol{x}$ and the logit function:
$$\log \left( \dfrac{p}{1 - p} \right) = \boldsymbol{x} \cdot \boldsymbol{w}$$
Remember our goal is to predict the value of $p$, not $\log \left( \dfrac{p}{1 - p} \right)$, thus we have
$$\begin{align} \log \left( \dfrac{p}{1 - p} \right) &= \boldsymbol{x} \cdot \boldsymbol{w} \\ \dfrac{p}{1 - p} &= e^{\boldsymbol{x} \cdot \boldsymbol{w}} \\ \dfrac{1 - p}{p} &= e^{-(\boldsymbol{x} \cdot \boldsymbol{w})} \\ \dfrac{1}{p} - 1 &= e^{-(\boldsymbol{x} \cdot \boldsymbol{w})} \\ \dfrac{1}{p} &= 1 + e^{-(\boldsymbol{x} \cdot \boldsymbol{w})} \\ p &= \dfrac{1}{1 + e^{-(\boldsymbol{x} \cdot \boldsymbol{w})}} \end{align}$$
This is exactly the sigmoid function!
This is just a quick intuition - optionally, you can watch this video for a more rigorous understanding.
Remember that graident descent only guarantees to reach the global minimum when the loss function is convex, which is not the case if we use mean squared error in this case. Thus, we use (binary) cross-entropy loss instead:
$$\begin{align} BCE(\hat{y}) & = \begin{cases} - \log(\hat{y}) & \text{if } y = 1 \\ - \log(1 - \hat{y}) & \text{if } y = 0 \end{cases} \\ \end{align}$$
Or equivalently,
$$BCE(\hat{y}) = -y \log(\hat{y}) - (1 - y) \log(1 - \hat{y})$$
where $\hat{y}$ is the predicted probability where $y = 1$.
Recall that when we discuss entropy (for decision trees), we discussed the idea of information content - The "surprise" when encountering a certain value:
$$I(e) = \log \left( \dfrac{1}{P(e)} \right) = - \log P(e)$$
The same "surprise" appears when an incorrect prediction occurs. For example, when $y = 1$, we will be surprised when $\hat{y}$ is close to 0 (thus the "surprisal" approaches infinity), and we won't be surprised when $\hat{y}$ is close to 1 (thus the "surprisal" appproaches 0).
When $y = 0$, we have to invert our values accordingly.
Again, this is just a quick intuition - optionally, you can watch this video for a more rigorous understanding.
Bias and Variance
Let's try to understand bias and variance formally. Suppose we train a model $h_{\boldsymbol{w}}$ based on a hypothesis class $H$. By sampling a different subset of data as the training samples, we obtain different models $h_{\boldsymbol{w}}$.


Two samples drawn from the same population and their respective models
Plotting all the models (trained on different samples) out, we get this:
We are interested in knowing the generalization ability of our model (can it handle unseen data well?), i.e. the expected squared error when we plug in an input $x$.
$$\text{Err}(\boldsymbol{x}) = \mathbb{E}[(y - h_{\boldsymbol{w}}(\boldsymbol{x}))^2]$$
Note that both $\boldsymbol{x}$ and $h_{\boldsymbol{w}}$ are both random variables - we can be given a random sample to train our model, and we're given a random input to test our model.
Equivalently, we can write the error as follows:
$$\text{Err}(\boldsymbol{x}) = \left(\underbrace{y - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]}_{\text{bias}} \right) ^2 + \underbrace{\mathbb{E}\left[\left(\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - h_{\boldsymbol{w}}(\boldsymbol{x})\right)^2\right]}_{\text{variance}}$$
We first write
$$\begin{align} \text{Err}(\boldsymbol{x}) &= \mathbb{E}[(y - h_{\boldsymbol{w}}(\boldsymbol{x}))^2] \\ &= \mathbb{E}[(y - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] + \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - h_{\boldsymbol{w}}(\boldsymbol{x}))^2] \\ &= \mathbb{E}[(y - h_{\boldsymbol{w}}(\boldsymbol{x})])^2] + 2 \mathbb{E}\left[(y - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]) \cdot (\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - h_{\boldsymbol{w}}(\boldsymbol{x}))\right] + \mathbb{E}[(h_{\boldsymbol{w}}(\boldsymbol{x}) - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x}))^2] \\ \end{align}$$
We then look into each of the terms indepenently:
- For the first term, we have $$\begin{align} \mathbb{E}[(y - h_{\boldsymbol{w}}(\boldsymbol{x})])^2] &= \mathbb{E}(y^2) - 2 \mathbb{E}(y \cdot h_{\boldsymbol{w}}(\boldsymbol{x})) + \mathbb{E}(h_{\boldsymbol{w}}(\boldsymbol{x})^2) \\ &= y^2 - 2 y \cdot \mathbb{E}(h_{\boldsymbol{w}}(\boldsymbol{x})) + \mathbb{E}(h_{\boldsymbol{w}}(\boldsymbol{x})^2) \\ &= (y - \mathbb{E}(h_{\boldsymbol{w}}(\boldsymbol{x})))^2 \end{align}$$ which is the square of the bias.
- For the second term, we have $$\begin{align} \mathbb{E}\left[(y - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]) \cdot (\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - h_{\boldsymbol{w}}(\boldsymbol{x}))\right] &= \mathbb{E}(y \cdot \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]) - \mathbb{E}(y \cdot h_{\boldsymbol{w}}(\boldsymbol{x})) - \mathbb{E}(\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]^2) + \mathbb{E}(\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] \cdot h_{\boldsymbol{w}}(\boldsymbol{x})) \\ &= y \cdot \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - y \cdot \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]^2 + \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]^2 \\ &= 0 \end{align}$$ which is 0.
- The third term is exactly the variance.
Therefore, we have
$$\text{Err}(\boldsymbol{x}) = \left(\underbrace{y - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]}_{\text{bias}} \right) ^2 + \underbrace{\mathbb{E}\left[\left(\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - h_{\boldsymbol{w}}(\boldsymbol{x})\right)^2\right]}_{\text{variance}}$$
Practically, it is impossible to attain $\text{Err}(\boldsymbol{x}) = 0$ since there is some underlying error in the $y$ values. Therefore, $y = f(\boldsymbol{x}) + \varepsilon$ where $\varepsilon$ is the error term.
In such case, we can write
$$\text{Err}(\boldsymbol{x}) = \left(\underbrace{y - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]}_{\text{bias}} \right) ^2 + \underbrace{\mathbb{E}\left[\left(\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - h_{\boldsymbol{w}}(\boldsymbol{x})\right)^2\right]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{true error}}$$
where $\sigma^2$ is the true error.
Here, there are two terms:
-
Bias: This is the expected error of the model even when trained with infinite training data.
For example, consider the dataset shown above. If we consider linear models as our hypothesis class, we can only attain the following at best:
This best linear model would match the expected value of our prediction $h_{\boldsymbol{w}}(\boldsymbol{x})$.
The error term (bias) would thus be the true value $y$ minus the expected value of our prediction, i.e. $y - \mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]$.
-
Variance: This is the error of the model due to sampling training data. It is the difference between the generalization error of best model and the expected model with a (finite) training set of size $N$.
Let's take the same dataset as an example. Consider the prediction at point $x$ for each of the models. The variance is the expected squared difference between the prediction of the best model ($\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})]$) and the expected model ($h_{\boldsymbol{w}}(\boldsymbol{x})$):
Mathematically, the variance is defined as $\mathbb{E}\left[\left(\mathbb{E}[h_{\boldsymbol{w}}(\boldsymbol{x})] - h_{\boldsymbol{w}}(\boldsymbol{x})\right)^2\right]$.
Let's now investigate the relationship between the bias and the variance:
- For the first figure, even the best (expected) model in the hypothesis class is far-off from the true values. This means the model has a high bias.
The variance is low since the models are all similar to each other. - For the last figure, the predictions can change drastically from the best (expected) model. This means the model has a high variance.
The bias is low since the best (expected) model is close to the true values.
This results in the bias-variance tradeoff: Trying to reduce the bias might lead to a higher variance, and vice versa. To minimize the generalization error $\text{Err}(\boldsymbol{x})$, we need to strike a balance between the bias and variance.
References
- Logistic Regression
- Sigmoid and SoftMax Functions in 5 minutes
- A Visual Understanding of Bias and Variance
- Mathematical definition of the bias-variance tradeoff
- Bias-variance tradeoff - Derivation
Last updated: 14 February 2025