Lecture 7 Math Explained - Support Vector Machines

CS2109S AY 2024/25 Semester 2

Lecture 7 Math Explained - Support Vector Machines

The Problem
Calculating the Margin
Primal Formulation
Lagrange Multipliers and Duality
Dual Formulation
Comparing the two Formulations
Kernel Trick

Before we start...
This topic is pretty mathematically heavy (the mathematical tools used will be around at MA ~2k ~3k level, and you probably have to take a ~5k CS course to deep dive into it with the ML context). I'll try my best to design this set of notes such that it is rigorous enough for a math student, but then it's understable for a CS student without math inclinations. I'll put all optional content in gray, e.g. mathematical proofs. Please feel free to skip them.

This is an example of optional content.

The Problem

In the logistic regression lecture, we have used the following hypothesis for classification problems: $$h_{\boldsymbol{w}}(\boldsymbol{x}) = \sigma (\boldsymbol{w} \cdot \boldsymbol{x})$$

We classify points such that $\boldsymbol{w} \cdot \boldsymbol{x} > 0$ as positive, and points such that $\boldsymbol{w} \cdot \boldsymbol{x} < 0$ as negative. Here, we denote the line $\boldsymbol{w} \cdot \boldsymbol{x} = 0$ as the decision boundary, the boundary that separates the positive and negative samples.

A natural question arises: Can we choose a decision boundary that best separates the positive and negative data? Mathematically, we could compute the margin - the distance from the decision boundary to the nearest positive / negative sample. The best decision boundary would maximize the margin.

Calculating the Margin

Starting from now, we use $y = \{-1, 1\}$ (instead of $y = \{0, 1\}$) for the class labels. Negative samples are denoted by $-1$ to make it more "symmetric". We also write out the bias term explicitly (because we would need to use that), so the hypothesis becomes $h_{\boldsymbol{w}}(\boldsymbol{x}) = \sigma(\boldsymbol{w} \cdot \boldsymbol{x} + b)$.

Recall that we're classifying the data according to $\boldsymbol{w} \cdot \boldsymbol{x} + b$. If we would like to create a margin between the positive and negative data, then we would condition that $\boldsymbol{w} \cdot \boldsymbol{x} + b$ cannot be "close to zero": $$\begin{align*} \boldsymbol{w} \cdot \boldsymbol{x}^{+} + b & \geq c \\ \boldsymbol{w} \cdot \boldsymbol{x}^{-} + b & \leq -c \end{align*} \ \text{\ for some $c$}$$

But then... notice we can scale the weight $\boldsymbol{w}$ and the bias $b$ arbitraily. Instead of placing a free variable $c$ here, we can simply scale $\boldsymbol{w}$ and $b$ by a factor of $\frac{1}{c}$ to fix the right hand side to $1$: $$\begin{align*} \boldsymbol{w} \cdot \boldsymbol{x}^{+} + b & \geq 1 \\ \boldsymbol{w} \cdot \boldsymbol{x}^{-} + b & \leq -1 \end{align*}$$

Next, we take advantage that $y^{+} = 1$ and $y^{-} = -1$, and merge these two conditions to a single one: $$y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) \geq 1 \text{ for all samples $(\boldsymbol{x}^{(i)}, y^{(i)})$}$$

Ideally, we would want the decision boundary to be the "median" line separating the positive and negative samples. That is, the closest negative sample (with the largest negative value of $\boldsymbol{w} \cdot \boldsymbol{x} + b$) would be as far from $0$ as the closest positive sample (with the smallest positive value of $\boldsymbol{w} \cdot \boldsymbol{x} + b$). Hence, we have the following: $$\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b = \pm 1 \text{ for support vectors $(\boldsymbol{x}^{(i)}, y^{(i)})$}$$

Due to this "median" argument, we should "tune" the bias $b$ such that the support vectors on the negative hyperplane satisfies $\boldsymbol{w} \cdot \boldsymbol{x} + b = -c$ and the support vectors on the positive hyperplane satisfies $\boldsymbol{w} \cdot \boldsymbol{x} + b = c$.

However, recall that we can scale $\boldsymbol{w}$ and $b$ accordingly. We scale $\boldsymbol{w}$ and $b$ by a factor of $\frac{1}{c}$ such that the right hand side becomes $\pm 1$. This is how we got this equation.

In summary, this equation is the result of adjusting the values of $b$ and $\boldsymbol{w}$.

With this formulation, we can calculate the size of the margin with a dot product.

Math Refersher: Dot Product
Let $\boldsymbol{a}$ and $\boldsymbol{b}$ be vectors, and $\boldsymbol{u}$ be the unit vector with the same direction as $\boldsymbol{b}$. The length of the projection from $\boldsymbol{a}$ to $\boldsymbol{b}$ would be: $$\lVert \boldsymbol{a} \rVert \cos \theta = \boldsymbol{a} \cdot \boldsymbol{u} = \boldsymbol{a} \cdot \dfrac{\boldsymbol{b}}{\lVert \boldsymbol{b} \rVert}$$

Note that $\boldsymbol{w}$ is a vector orthogonal (i.e. perpendicular) to the line $\boldsymbol{w} \cdot \boldsymbol{x} + b = 0$, so the margin would simply be a projection on the vector $\boldsymbol{w}$.

Which vector to project though? Of course we're projecting the support vectors! We could pick two support vectors, one on the positive hyperplane and one on the negative hyperplane, and project the vector $\boldsymbol{x}^+ - \boldsymbol{x}^-$ on $\boldsymbol{w}$. This would be our margin!

Let's calculate the size of the margin now: $$\begin{align} \text{margin} & = (\boldsymbol{x}^+ - \boldsymbol{x}^-) \cdot \dfrac{\boldsymbol{w}}{\lVert \boldsymbol{w} \rVert} & \text{$\blacktriangleleft$ Projection (see above)} \\ & = \dfrac{\boldsymbol{w} \cdot \boldsymbol{x}^{+} - \boldsymbol{w} \cdot \boldsymbol{x}^{-}}{\lVert \boldsymbol{w} \rVert} & \text{$\blacktriangleleft$ Expand} \\ & = \dfrac{(1 - b) - (-1 - b)}{\lVert \boldsymbol{w} \rVert} & \text{$\blacktriangleleft$ Plug in $\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} = \pm 1 - b$ above}\\ & = \dfrac{2}{\lVert \boldsymbol{w} \rVert} \end{align}$$

Primal Formulation

Summarizing what we did above, let's formulate SVM as an optimization problem!

We are finding a weight $\boldsymbol{w}$ that maximizes the margin, subject to the condition that all points are classified correctly.

The size of the margin is given by $\dfrac{2}{\lVert \boldsymbol{w} \rVert}$. This would be our objective function (to be maximized/minimized).
All points are classified correctly if and only if $y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) \geq 1$ for all samples $(\boldsymbol{x}^{(i)}, y^{(i)})$ (see the derivation above).

This gives us the following optimization problem: $$ \begin{align*} \text{maximize}_{\boldsymbol{w}, b} & \ \dfrac{2}{\lVert \boldsymbol{w} \rVert} \\ \text{subject to} & \ y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) \geq 1 \end{align*} $$

Dividing by $\lVert \boldsymbol{w} \rVert$ doesn't look nice here, so let's rewrite it by a little bit. Notice that maximizing $\dfrac{2}{\lVert \boldsymbol{w} \rVert}$ is identical to minimizing $\lVert \boldsymbol{w} \rVert$, and hence identical to minimizing $\lVert \boldsymbol{w} \rVert^2$ (note that norms are nasty due to the square root involved, so we better work on its square). Similar to mean-squared error, we would like to add a $\dfrac{1}{2}$ constant factor so that it cancels out in the derivative too. Therefore, our optimization problem becomes:

$$ \begin{align*} \text{minimize}_{\boldsymbol{w}, b} & \ \dfrac{1}{2} \lVert \boldsymbol{w} \rVert^2 \\ \text{subject to} & \ y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) \geq 1 \end{align*} $$

This the primal formulation of the SVM problem! The objective function is convex and the constraints are linear, so this optimization problem could be solved easily using off-the-shelf tools (it's a straightforward convex optimization problem).

Lagrange Multipliers and Duality

Let's digress a little bit to discuss on the concept of duality. Here, we focus on general (convex) optimization problems with inequality constraints. This can be described as follows: $$ \begin{align*} \text{minimize}_{\boldsymbol{w}} & \ f(\boldsymbol{w}) \\ \text{subject to} & \ g_i(\boldsymbol{w}) \geq 0 \text{ for $i = 1, 2, \ldots, m$} \end{align*} $$

We call this the primal problem.

Let's consider a simple case where there is only one constraint: $$ \begin{align*} \text{minimize}_{\boldsymbol{w}} & \ f(\boldsymbol{w}) \\ \text{subject to} & \ g(\boldsymbol{w}) \geq 0 \end{align*} $$

Let $\boldsymbol{w}^*$ be a point that we believe to be a minimizer. If we could find a direction $\boldsymbol{v}$ such that $\nabla_{\boldsymbol{v}} \ g(\boldsymbol{w}^{*}) = \boldsymbol{0}$ but $\nabla_{\boldsymbol{v}} \ f(\boldsymbol{w}^{*}) \neq \boldsymbol{0}$, then we can move $\boldsymbol{w}$ along that direction such that $f(\boldsymbol{w})$ is improved while the constraint is still satisfied. However, if $\boldsymbol{w}^*$ is truly the minimizer, this should be impossible.

Therefore, if $\boldsymbol{w}^*$ is the minimizer, either (i) $\nabla f(\boldsymbol{w}^{*}) = \boldsymbol{0}$, or (ii) $\nabla f$ and $\nabla g$ are parallel, i.e. $\nabla f(\boldsymbol{w}^{*}) = \alpha \nabla g(\boldsymbol{w}^*)$.

In case (i), we're simply constraining on $f$. This suggests that the constraint is "inactive" (i.e. removing it makes no difference).
In case (ii), we can rewrite the condition as $\nabla (f(\boldsymbol{w}^*) - \alpha g(\boldsymbol{w}^*)) = \boldsymbol{0}$, suggesting that $\boldsymbol{w}^*$ minimizes the function $f(\boldsymbol{w}^*) - \alpha g(\boldsymbol{w}^*)$ for a suitably chosen $\alpha$. This is the idea behind Lagrange multipliers.

For these types of optimization problems, we can define the Lagrangian of this optimization problem as $$\mathcal{L}(\boldsymbol{w}, \boldsymbol{\alpha}) = f(\boldsymbol{w}) - \sum_i \alpha_i g_i(\boldsymbol{w})$$ where $\alpha = (\alpha_1, \alpha_2, \ldots, \alpha_m)$ are the Lagrange multipliers such that $\alpha_i \geq 0$. The idea is that we no longer insist that $g_i(\boldsymbol{w}) \geq 0$, but we pay a penalty (scaled by $\alpha_i$) if $g_i(\boldsymbol{w}) < 0$. Conversely, we are rewarded if $g_i(\boldsymbol{w}) > 0$.

If $\boldsymbol{w}$ satisfies all the constraints $g_i(\boldsymbol{w}) \geq 0$, then $\mathcal{L}(\boldsymbol{w}, \boldsymbol{\alpha}) \leq f(\boldsymbol{w})$ for any values of $\boldsymbol{\alpha}$ (because there is no penalty!).

Let's try to minimize $\mathcal{L}(\boldsymbol{w}, \alpha)$ by varying $\boldsymbol{w}$, for a fixed value of $\boldsymbol{\alpha}$. This is defined by the Lagrange dual function: $$h(\boldsymbol{\alpha}) = \min_{\boldsymbol{w}} \mathcal{L}(\boldsymbol{w}, \alpha)$$

Consider the optimal solution to the optimization problem - we call it $\boldsymbol{w}^*$. Obviously, it would satisfy all the constraints. By our observation above, we have $\mathcal{L}(\boldsymbol{w}^*, \boldsymbol{\alpha}) \leq f(\boldsymbol{w}^*)$. This means that minimizing the Lagrangian over $\boldsymbol{w}$ would give a value no worse than $f(\boldsymbol{w}^*)$ (i.e. the optimal solution to our original optimization problem!).

Another way of phrasing this is that $h(\boldsymbol{\alpha})$ provides a lower bound for our optimal solution $f(\boldsymbol{w}^*)$. Note that $\boldsymbol{\alpha}$ is a free variable here, so it's natural to try finding the maximum possible lower bound (i.e. maximizing $h(\boldsymbol{\alpha})$ w.r.t. $\boldsymbol{\alpha}$): $$ \begin{align*} \text{maximize}_{\boldsymbol{\alpha}} & \ h(\boldsymbol{\alpha}) \\ \text{subject to} & \ \alpha_i \geq 0 \text{ for $i = 1, 2, \ldots, m$} \end{align*} $$

Suppose $h(\boldsymbol{\alpha}^*)$ is the optimal solution to this optimization problem.

Since $h(\boldsymbol{\alpha})$ is always a lower bound to $f(\boldsymbol{w}^*)$, we have: $$h(\boldsymbol{\alpha}^*) \leq f(\boldsymbol{w}^*)$$ this is known as weak duality.
If the original optimization problem is convex (i.e. $f$ is convex and $g_i$ are all linear), and a mild regularity condition holds (we'll not go into this here), then $$h(\boldsymbol{\alpha}^*) = f(\boldsymbol{w}^*)$$ this is known as strong duality (a very important result in convex optimization).

When strong duality holds, the following optimization problem (called the dual problem) would be equivalent to the primal problem: $$ \begin{align*} \text{maximize}_{\boldsymbol{\alpha}} & \ \min_{\boldsymbol{w}} \mathcal{L}(\boldsymbol{w}, \boldsymbol{\alpha}) \\ \text{subject to} & \ \alpha_i \geq 0 \text{ for $i = 1, 2, \ldots, m$} \end{align*} $$

One way to understand duality is by the minimax theorem, which states that in fact $$\min_A \max_B f(A, B) = \max_B \min_A f(A, B)$$ in the case that $f(A, \cdot)$ is concave in $B$, $f(\cdot, B)$ is convex in $A$, and some other mild conditions hold (try to picture this in the figure below!).

It is very clear that the dual problem is a "max min": $$\max_{\boldsymbol{\alpha}} \min_{\boldsymbol{w}} \mathcal{L}(\boldsymbol{w}, \boldsymbol{\alpha})$$

Surprise surprise! The primal problem is in fact a "min max"! $$\min_{\boldsymbol{w}} \max_{\boldsymbol{\alpha}} \mathcal{L}(\boldsymbol{w}, \boldsymbol{\alpha})$$

Picture this as a strategic game - the min player moves first and the max player moves next.
Suppose $\boldsymbol{w}$ does not satisfy the constraints, which means $g_i(\boldsymbol{w}) > 0$ for some $i$.
The max player can simply choose $\alpha_i$ to be arbitraily large so that $\mathcal{L}(\boldsymbol{w}, \boldsymbol{\alpha}) \rightarrow \infty$. Given that the min player wants to minimize the function, the min player will simply not choose such $\boldsymbol{w}$ (so that the max player could not pull it up to $\infty$).
What if $\boldsymbol{w}$ satisfies the constraints? Recall that the min player will get a "reward" by satisfying the constraints, but surely the max player does not want this reward to happen. Therefore, the max player would simply set all $\alpha_i$ to $0$ to zero out the rewards. Essentially, the value of $\mathcal{L}(\boldsymbol{w}, \boldsymbol{\alpha})$ would be equal to $f(\boldsymbol{w})$.
Since the min player always pick $\boldsymbol{w}$ that satisfies the constraints, and minimizes the function $f(\boldsymbol{w})$, the min player's choice is indeed identical to the primal problem.

It is always true that "min max" $\geq$ "max min" (this maps to weak duality). But "min max" = "max min" is only guaranteed when the conditions in the minimax theorem is satisfied (this maps to strong duality).

In addition, when the primal problem is convex, the following conditions (known as the KKT conditions) will be necessary and sufficient for $(\boldsymbol{w}^*, \boldsymbol{\alpha}^*)$ to be the optimal solution to BOTH the primal and dual problem:

Primal feasability: $g_i(\boldsymbol{w}^*) \leq 0$ for $i = 1, 2, \ldots, m$.
(This is straightforward - $\boldsymbol{w}^*$ must satisfy the constraints of the primal problem.)
Dual feasability: $\alpha_i^* \geq 0$ for $i = 1, 2, \ldots, m$.
(This is also straightforward - $\boldsymbol{\alpha}^*$ must satisfy the constraints of the dual problem.)
Complementary slackness: $\alpha_i^* g_i(\boldsymbol{w}^*) = 0$ for $i = 1, 2, \ldots, m$.
(This condition tells us that either $\alpha_i^* = 0$ or $g_i(\boldsymbol{w}^*) = 0$ - the former case happens when $g_i$ is an "inactive" constraint and the latter case happens when $g_i$ is an "active" constraint.)
Vanishing gradient: $\nabla f(\boldsymbol{w}^*) + \sum_{i=1}^m \alpha_i^* \nabla g_i(\boldsymbol{w}^*) = 0$.
(This is based on the concept behind Lagrange multipliers.)

Dual Formulation

Let's get back to SVMs now. Recall the primal formulation is as follows:

To match the constraint with its generic form $g_i(\boldsymbol{w}) \geq 0$, we write it as: $$y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) - 1 \geq 0$$

There is one constraint for each data point, so we need to introduce a Lagrange multiplier $\alpha^{(i)}$ for each sample. The Lagrangian is $$\mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha}) = \dfrac{1}{2} \lVert \boldsymbol{w} \rVert^2 - \sum_i \alpha^{(i)} \left[ y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) - 1 \right]$$

Therefore, the dual formulation becomes: $$ \begin{align*} \text{maximize}_{\boldsymbol{\alpha}} & \ \min_{\boldsymbol{w}, b} \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha}) \\ \text{subject to} & \ \alpha^{(i)} \geq 0 \end{align*} $$

To minimize $\mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha})$ w.r.t. $\boldsymbol{w}$ and $b$, we have to set the partial derivatives $\dfrac{\partial \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha})}{\partial \boldsymbol{w}}$ and $\dfrac{\partial \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha})}{\partial b}$ to 0: $$\dfrac{\partial \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha})}{\partial \boldsymbol{w}} = 0 \quad \Rightarrow \quad \boldsymbol{w} - \sum_i \alpha^{(i)} y^{(i)} x^{(i)} = 0 \quad \Rightarrow \quad \boldsymbol{w} = \sum_i \alpha^{(i)} y^{(i)} x^{(i)}$$ $$\dfrac{\partial \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha})}{\partial b} = 0 \quad \Rightarrow \quad \sum_{i} \alpha^{(i)} y^{(i)} = 0$$

Plugging the two conditions we got back to the definition of the Lagrangian, we have $$\begin{align} \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha}) & = \dfrac{1}{2} \lVert \boldsymbol{w} \rVert^2 - \sum_i \alpha^{(i)} \left[ y^{(i)} (\boldsymbol{w} \cdot x^{(i)} + b) - 1 \right] \\ & = \dfrac{1}{2} \lVert \boldsymbol{w} \rVert^2 - \sum_i \alpha^{(i)} y^{(i)} (\boldsymbol{w} \cdot x^{(i)}) - \sum_i \alpha^{(i)} y^{(i)} b + \sum_i \alpha^{(i)} \\ & = \dfrac{1}{2} \left\lVert \sum_i \alpha^{(i)} y^{(i)} x^{(i)} \right\rVert ^2 - \sum_i \alpha^{(i)} y^{(i)} \left( \sum_j \alpha^{(j)} y^{(j)} x^{(j)} \right) \cdot x^{(i)} - \sum_i \alpha^{(i)} y^{(i)} b + \sum_i \alpha_i & \text{$\blacktriangleleft$ Plug } \boldsymbol{w} = \sum_i \alpha^{(i)} y^{(i)} x^{(i)} \\ & = \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left( x^{(i)} \cdot x^{(j)} \right) - \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left( x^{(i)} \cdot x^{(j)} \right) - \sum_i \alpha_i y^{(i)} b + \sum_i \alpha_i \\ & = \sum_i \alpha^{(i)} - \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left(x^{(i)} \cdot x^{(j)}\right) - b \sum_i \alpha^{(i)} y^{(i)}\\ & = \sum_i \alpha^{(i)} - \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left(x^{(i)} \cdot x^{(j)}\right) & \text{$\blacktriangleleft$ Plug } \sum_{i} \alpha^{(i)} y^{(i)} = 0 \end{align}$$

Hence, we finally got the dual formulation of the SVM problem: $$ \begin{align*} \text{maximize}_{\boldsymbol{\alpha}} & \ \sum_i \alpha^{(i)} - \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left(x^{(i)} \cdot x^{(j)}\right) \\ \text{subject to} & \ \alpha^{(i)} \geq 0 \end{align*} $$

This is an equivalent formulation of the same SVM problem, yet we are optimizing for $\boldsymbol{\alpha}$ instead of optimizing for $\boldsymbol{w}$ and $b$. Similar to our primal formulation, the optimal solution to $\boldsymbol{\alpha}$ can be calculated by an off-the-shelf solver (this is still a simple convex optimization problem).

After we obtained the solution to $\boldsymbol{\alpha}$, we could obtain the solution to $\boldsymbol{w}$ using: $$\boldsymbol{w} = \sum_i \alpha^{(i)} y^{(i)} x^{(i)} \quad \text{$\blacktriangleleft$ As derived above}$$

Using the fact that $\boldsymbol{w} \cdot \boldsymbol{x} + b = \pm 1$ for support vectors, we can find $b$ by using the support vectors: $$b = -\dfrac{\max_{i:y^{(i)}=-1} \boldsymbol{w} \cdot x^{(i)} + \min_{i:y^{(i)}=1} \boldsymbol{w} \cdot x^{(i)}}{2} \quad \text{$\blacktriangleleft$ Find the support vectors and take the average}$$

How do we find the support vectors? The KKT conditions become useful here. Condition 3 (Complementary slackness) tell us that $\alpha^{(i)} = 0$ if the constraint is "inactive" and $\alpha^{(i)} > 0$ if the constraint is "active". In the context of SVMs, $\alpha^{(i)} > 0$ if the sample is a support vector (constraint $i$ is needed), and $\alpha^{(i)} = 0$ otherwise (constraint $i$ can be removed).

That's it! Hope you're still surviving here!

Comparing the two Formulations

Let's summarize the two formulations first.

Primal Formulation: $$ \begin{align*} \text{minimize}_{\boldsymbol{w}, b} & \ \dfrac{1}{2} \lVert \boldsymbol{w} \rVert^2 \\ \text{subject to} & \ y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) \geq 1 \end{align*} $$

Dual Formulation: $$ \begin{align*} \text{maximize}_{\boldsymbol{\alpha}} & \ \sum_i \alpha^{(i)} - \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left(x^{(i)} \cdot x^{(j)}\right) \\ \text{subject to} & \ \alpha^{(i)} \geq 0 \end{align*} $$

When we solve the primal formulation (using an off-to-shelf solver), we will probably have to compute dot products between $\boldsymbol{w}$ (what we're optimizing) and the samples $x^{(i)}$, which is inefficient for high-dimensional data. However, note that in the dual formulation, the only part that involves the input of the samples is $\boldsymbol{x}^{(i)} \cdot \boldsymbol{x}^{(j)}$. This can be precomputed across all pairs of samples (i.e. in $O(n^2d)$ time) BEFORE doing the optimization. Hence, it is usually faster to solve the dual problem instead of the primal, when it comes to high-dimensional data.

Besides, the fact that the dual formulation only involves the dot product of the samples would enable us to apply the kernel trick, which is essential in capturing non-linear relationships.

Kernel Trick

The Kernel Trick is used to handle non-linear decision boundaries. This is a trick based on the dual formulation.

Typically, we handle non-linear relationships by adding in transformed features. Let's say we apply a feautre transformation $\phi$ to the data, so the input sample $\boldsymbol{x}$ becomes $\phi(\boldsymbol{x})$ instead. Here is an example $\phi$ that adds in the quadratic features to two-dimensional data: $$\phi(\boldsymbol{x}) = \begin{bmatrix} x_1 \\ x_2 \\ x_1^2 \\ x_2^2 \\ x_1 x_2 \\ \end{bmatrix}$$

Then, we would solve this dual problem: $$ \begin{align*} \text{maximize}_{\boldsymbol{\alpha}} & \ \sum_i \alpha^{(i)} - \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left(\phi(x^{(i)}) \cdot \phi(x^{(j)})\right) \\ \text{subject to} & \ \alpha^{(i)} \geq 0 \end{align*} $$

Notice an important property here - the dual formulation only depends on the pairwise dot product of the samples. As long as we figure out a way to compute $\phi(x^{(i)}) \cdot \phi(x^{(j)})$ for all pairs of samples $(i, j)$, we're good to go!

This inspires the kernel trick, we could replace the dot product $\phi(x^{(i)}) \cdot \phi(x^{(j)})$ by a kernel function $K(\boldsymbol{x}^{(i)}, \boldsymbol{x}^{(j)})$ that computes the dot product of two input samples after feature transformation. It doesn't matter whether the kernel function actually performs the feature transformation - the kernel is valid as long as it computes the correct dot product.

This brings us to the important property of a kernel function: $K(\boldsymbol{u}, \boldsymbol{v}) = \phi(\boldsymbol{u}) \cdot \phi(\boldsymbol{v})$ for some $\phi$. This simplifies the dual problem to: $$ \begin{align*} \text{maximize}_{\boldsymbol{\alpha}} & \ \sum_i \alpha^{(i)} - \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} K(\boldsymbol{x}^{(i)}, \boldsymbol{x}^{(j)}) \\ \text{subject to} & \ \alpha^{(i)} \geq 0 \end{align*} $$

Example 1: Quadratic Kernel

Consider the following the kernel function where $\boldsymbol{u}$ and $\boldsymbol{v}$ are three dimensional vectors: $$K(\boldsymbol{u}, \boldsymbol{v}) = (\boldsymbol{u} \cdot \boldsymbol{v})^2$$

Note that this kernel function only takes $O(m)$ time to compute, where $m$ is the number of features (it squares after computing the dot product).

However, by expanding the dot product, we get $$K(\boldsymbol{u}, \boldsymbol{v}) = (u_1 v_1 + u_2 v_2 + u_3 v_3)^2 = u_1^2 v_1^2 + u_2^2 v_2^2 + u_3^2 v_3^2 + 2 (u_1 u_2) (v_1 v_2) + 2 (u_1 u_3) (v_1 v_3) + 2 (u_2 u_3) (v_2 v_3)$$

By separating the components involving $\boldsymbol{u}$ and $\boldsymbol{v}$, this can be written as $$K(\boldsymbol{u}, \boldsymbol{v}) = \begin{bmatrix} u_1^2 \\ u_2^2 \\ u_3^2 \\ \sqrt{2} u_1 u_2 \\ \sqrt{2} u_1 u_3 \\ \sqrt{2} u_2 u_3 \\ \end{bmatrix} \cdot \begin{bmatrix} v_1^2 \\ v_2^2 \\ v_3^2 \\ \sqrt{2} v_1 v_2 \\ \sqrt{2} v_1 v_3 \\ \sqrt{2} v_2 v_3 \\ \end{bmatrix} = \phi(\boldsymbol{u}, \boldsymbol{v})$$

These are the transformed features! Essentially, the kernel function implicitly considers the transformed features $u_1^2$, $u_2^2$, $u_3^2$, $u_1 u_2$, $u_1 u_3$ and $u_2 u_3$ for each input sample, without explicitly computing them. It only computes the dot product $\phi(\boldsymbol{u}) \cdot \phi(\boldsymbol{v})$ directly, which can be done in $O(m)$ time. (But we're handling $O(m^2)$ transformed features!)

See Task 0 of the bonus question for a sample implementation / visualization. Training the SVM with the transformed features is exactly identical to training the SVM with the quadratic kernel function! (Note: That implementation uses two-dimensional data for ease of visualization.)

We can extend this further to $d$-dimensional by simply taking $K(\boldsymbol{u}, \boldsymbol{v}) = (\boldsymbol{u} \cdot \boldsymbol{v})^d$. Note that computing the $d$-th power of a dot product is very simple, yet we're considering $O(m^d)$ transformed features implicitly!

Example 2: Gaussian Kernel

Consider the following the kernel function where $u$ and $v$ are scalars (for simplicity): $$K(u, v) = \exp \left( -\dfrac{\lVert u - v \rVert^2}{2\sigma^2} \right)$$

By expanding using the Taylor series of $e^x$, we have (see tutorial solutions for the complete derivation) $$\begin{align} K(u, v) & = \sum_{k=0}^{\infty} \left[ \sqrt{\dfrac{2^k}{k!}} \exp(-u^2) u^k \times \sqrt{\dfrac{2^k}{k!}} \exp(-v^2) v^k \right] \\ & = \begin{bmatrix} \sqrt{\dfrac{2^0}{0!}} \exp(-u^2) u^0 \\ \sqrt{\dfrac{2^1}{1!}} \exp(-u^2) u^1 \\ \sqrt{\dfrac{2^2}{2!}} \exp(-u^2) u^2 \\ \vdots \end{bmatrix} \cdot \begin{bmatrix} \sqrt{\dfrac{2^0}{0!}} \exp(-v^2) v^0 \\ \sqrt{\dfrac{2^1}{1!}} \exp(-v^2) v^1 \\ \sqrt{\dfrac{2^2}{2!}} \exp(-v^2) v^2 \\ \vdots \end{bmatrix} \\ & = \phi(u) \cdot \phi(v) \end{align}$$

Note that $\phi(u)$ has terms $u^0, u^1, u^2, \cdots$, which means the transformation is infinite dimensional.

References

Last updated: 14 March 2025

CS2109S AY 2024/25 Semester 2