CS2109S AY 2024/25 Semester 2

Lecture 7 Math Explained - Support Vector Machines

The Problem
Calculating the Margin
Primal Formulation
Lagrange Multipliers and Duality
Dual Formulation
Comparing the two Formulations
Kernel Trick

The Problem

In the logistic regression lecture, we have used the following hypothesis for classification problems: $$h_{\boldsymbol{w}}(\boldsymbol{x}) = \sigma (\boldsymbol{w} \cdot \boldsymbol{x})$$

We classify points such that $\boldsymbol{w} \cdot \boldsymbol{x} > 0$ as positive, and points such that $\boldsymbol{w} \cdot \boldsymbol{x} < 0$ as negative. Here, we denote the line $\boldsymbol{w} \cdot \boldsymbol{x} = 0$ as the decision boundary, the boundary that separates the positive and negative samples.

A natural question arises: Can we choose a decision boundary that best separates the positive and negative data? Mathematically, we could compute the margin - the distance from the decision boundary to the nearest positive / negative sample. The best decision boundary would maximize the margin.

Calculating the Margin

Starting from now, we use $y = \{-1, 1\}$ (instead of $y = \{0, 1\}$) for the class labels. Negative samples are denoted by $-1$ to make it more "symmetric". We also write out the bias term explicitly (because we would need to use that), so the hypothesis becomes $h_{\boldsymbol{w}}(\boldsymbol{x}) = \sigma(\boldsymbol{w} \cdot \boldsymbol{x} + b)$.

Recall that we're classifying the data according to $\boldsymbol{w} \cdot \boldsymbol{x} + b$. If we would like to create a margin between the positive and negative data, then we would condition that $\boldsymbol{w} \cdot \boldsymbol{x} + b$ cannot be "close to zero": $$\begin{align*} \boldsymbol{w} \cdot \boldsymbol{x}^{+} + b & \geq c \\ \boldsymbol{w} \cdot \boldsymbol{x}^{-} + b & \leq -c \end{align*} \ \text{\ for some $c$}$$

But then... notice we can scale the weight $\boldsymbol{w}$ and the bias $b$ arbitraily. Instead of placing a free variable $c$ here, we can simply scale $\boldsymbol{w}$ and $b$ by a factor of $\frac{1}{c}$ to fix the right hand side to $1$: $$\begin{align*} \boldsymbol{w} \cdot \boldsymbol{x}^{+} + b & \geq 1 \\ \boldsymbol{w} \cdot \boldsymbol{x}^{-} + b & \leq -1 \end{align*}$$

Next, we take advantage that $y^{+} = 1$ and $y^{-} = -1$, and merge these two conditions to a single one: $$y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) \geq 1 \text{ for all samples $(\boldsymbol{x}^{(i)}, y^{(i)})$}$$

Ideally, we would want the decision boundary to be the "median" line separating the positive and negative samples. That is, the closest negative sample (with the largest negative value of $\boldsymbol{w} \cdot \boldsymbol{x} + b$) would be as far from $0$ as the closest positive sample (with the smallest positive value of $\boldsymbol{w} \cdot \boldsymbol{x} + b$). Hence, we have the following: $$\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b = \pm 1 \text{ for support vectors $(\boldsymbol{x}^{(i)}, y^{(i)})$}$$

Due to this "median" argument, we should "tune" the bias $b$ such that the support vectors on the negative hyperplane satisfies $\boldsymbol{w} \cdot \boldsymbol{x} + b = -c$ and the support vectors on the positive hyperplane satisfies $\boldsymbol{w} \cdot \boldsymbol{x} + b = c$.

However, recall that we can scale $\boldsymbol{w}$ and $b$ accordingly. We scale $\boldsymbol{w}$ and $b$ by a factor of $\frac{1}{c}$ such that the right hand side becomes $\pm 1$. This is how we got this equation.

In summary, this equation is the result of adjusting the values of $b$ and $\boldsymbol{w}$.

With this formulation, we can calculate the size of the margin with a dot product.

Note that $\boldsymbol{w}$ is a vector orthogonal (i.e. perpendicular) to the line $\boldsymbol{w} \cdot \boldsymbol{x} + b = 0$, so the margin would simply be a projection on the vector $\boldsymbol{w}$.

Which vector to project though? Of course we're projecting the support vectors! We could pick two support vectors, one on the positive hyperplane and one on the negative hyperplane, and project the vector $\boldsymbol{x}^+ - \boldsymbol{x}^-$ on $\boldsymbol{w}$. This would be our margin!

Let's calculate the size of the margin now: $$\begin{align} \text{margin} & = (\boldsymbol{x}^+ - \boldsymbol{x}^-) \cdot \dfrac{\boldsymbol{w}}{\lVert \boldsymbol{w} \rVert} & \text{$\blacktriangleleft$ Projection (see above)} \\ & = \dfrac{\boldsymbol{w} \cdot \boldsymbol{x}^{+} - \boldsymbol{w} \cdot \boldsymbol{x}^{-}}{\lVert \boldsymbol{w} \rVert} & \text{$\blacktriangleleft$ Expand} \\ & = \dfrac{(1 - b) - (-1 - b)}{\lVert \boldsymbol{w} \rVert} & \text{$\blacktriangleleft$ Plug in $\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} = \pm 1 - b$ above}\\ & = \dfrac{2}{\lVert \boldsymbol{w} \rVert} \end{align}$$

Primal Formulation

Summarizing what we did above, let's formulate SVM as an optimization problem!

We are finding a weight $\boldsymbol{w}$ that maximizes the margin, subject to the condition that all points are classified correctly.

This gives us the following optimization problem: $$ \begin{align*} \text{maximize}_{\boldsymbol{w}, b} & \ \dfrac{2}{\lVert \boldsymbol{w} \rVert} \\ \text{subject to} & \ y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) \geq 1 \end{align*} $$

Dividing by $\lVert \boldsymbol{w} \rVert$ doesn't look nice here, so let's rewrite it by a little bit. Notice that maximizing $\dfrac{2}{\lVert \boldsymbol{w} \rVert}$ is identical to minimizing $\lVert \boldsymbol{w} \rVert$, and hence identical to minimizing $\lVert \boldsymbol{w} \rVert^2$ (note that norms are nasty due to the square root involved, so we better work on its square). Similar to mean-squared error, we would like to add a $\dfrac{1}{2}$ constant factor so that it cancels out in the derivative too. Therefore, our optimization problem becomes:

$$ \begin{align*} \text{minimize}_{\boldsymbol{w}, b} & \ \dfrac{1}{2} \lVert \boldsymbol{w} \rVert^2 \\ \text{subject to} & \ y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) \geq 1 \end{align*} $$

This the primal formulation of the SVM problem! The objective function is convex and the constraints are linear, so this optimization problem could be solved easily using off-the-shelf tools (it's a straightforward convex optimization problem).

Lagrange Multipliers and Duality

Let's digress a little bit to discuss on the concept of duality. Here, we focus on general (convex) optimization problems with inequality constraints. This can be described as follows: $$ \begin{align*} \text{minimize}_{\boldsymbol{w}} & \ f(\boldsymbol{w}) \\ \text{subject to} & \ g_i(\boldsymbol{w}) \geq 0 \text{ for $i = 1, 2, \ldots, m$} \end{align*} $$

We call this the primal problem.

For these types of optimization problems, we can define the Lagrangian of this optimization problem as $$\mathcal{L}(\boldsymbol{w}, \boldsymbol{\alpha}) = f(\boldsymbol{w}) - \sum_i \alpha_i g_i(\boldsymbol{w})$$ where $\alpha = (\alpha_1, \alpha_2, \ldots, \alpha_m)$ are the Lagrange multipliers such that $\alpha_i \geq 0$. The idea is that we no longer insist that $g_i(\boldsymbol{w}) \geq 0$, but we pay a penalty (scaled by $\alpha_i$) if $g_i(\boldsymbol{w}) < 0$. Conversely, we are rewarded if $g_i(\boldsymbol{w}) > 0$.

If $\boldsymbol{w}$ satisfies all the constraints $g_i(\boldsymbol{w}) \geq 0$, then $\mathcal{L}(\boldsymbol{w}, \boldsymbol{\alpha}) \leq f(\boldsymbol{w})$ for any values of $\boldsymbol{\alpha}$ (because there is no penalty!).

When strong duality holds, the following optimization problem (called the dual problem) would be equivalent to the primal problem: $$ \begin{align*} \text{maximize}_{\boldsymbol{\alpha}} & \ \min_{\boldsymbol{w}} \mathcal{L}(\boldsymbol{w}, \boldsymbol{\alpha}) \\ \text{subject to} & \ \alpha_i \geq 0 \text{ for $i = 1, 2, \ldots, m$} \end{align*} $$

In addition, when the primal problem is convex, the following conditions (known as the KKT conditions) will be necessary and sufficient for $(\boldsymbol{w}^*, \boldsymbol{\alpha}^*)$ to be the optimal solution to BOTH the primal and dual problem:

  1. Primal feasability: $g_i(\boldsymbol{w}^*) \leq 0$ for $i = 1, 2, \ldots, m$.
    (This is straightforward - $\boldsymbol{w}^*$ must satisfy the constraints of the primal problem.)
  2. Dual feasability: $\alpha_i^* \geq 0$ for $i = 1, 2, \ldots, m$.
    (This is also straightforward - $\boldsymbol{\alpha}^*$ must satisfy the constraints of the dual problem.)
  3. Complementary slackness: $\alpha_i^* g_i(\boldsymbol{w}^*) = 0$ for $i = 1, 2, \ldots, m$.
    (This condition tells us that either $\alpha_i^* = 0$ or $g_i(\boldsymbol{w}^*) = 0$ - the former case happens when $g_i$ is an "inactive" constraint and the latter case happens when $g_i$ is an "active" constraint.)
  4. Vanishing gradient: $\nabla f(\boldsymbol{w}^*) + \sum_{i=1}^m \alpha_i^* \nabla g_i(\boldsymbol{w}^*) = 0$.
    (This is based on the concept behind Lagrange multipliers.)

Dual Formulation

Let's get back to SVMs now. Recall the primal formulation is as follows:

$$ \begin{align*} \text{minimize}_{\boldsymbol{w}, b} & \ \dfrac{1}{2} \lVert \boldsymbol{w} \rVert^2 \\ \text{subject to} & \ y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) \geq 1 \end{align*} $$

To match the constraint with its generic form $g_i(\boldsymbol{w}) \geq 0$, we write it as: $$y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) - 1 \geq 0$$

There is one constraint for each data point, so we need to introduce a Lagrange multiplier $\alpha^{(i)}$ for each sample. The Lagrangian is $$\mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha}) = \dfrac{1}{2} \lVert \boldsymbol{w} \rVert^2 - \sum_i \alpha^{(i)} \left[ y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) - 1 \right]$$

Therefore, the dual formulation becomes: $$ \begin{align*} \text{maximize}_{\boldsymbol{\alpha}} & \ \min_{\boldsymbol{w}, b} \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha}) \\ \text{subject to} & \ \alpha^{(i)} \geq 0 \end{align*} $$

To minimize $\mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha})$ w.r.t. $\boldsymbol{w}$ and $b$, we have to set the partial derivatives $\dfrac{\partial \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha})}{\partial \boldsymbol{w}}$ and $\dfrac{\partial \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha})}{\partial b}$ to 0: $$\dfrac{\partial \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha})}{\partial \boldsymbol{w}} = 0 \quad \Rightarrow \quad \boldsymbol{w} - \sum_i \alpha^{(i)} y^{(i)} x^{(i)} = 0 \quad \Rightarrow \quad \boldsymbol{w} = \sum_i \alpha^{(i)} y^{(i)} x^{(i)}$$ $$\dfrac{\partial \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha})}{\partial b} = 0 \quad \Rightarrow \quad \sum_{i} \alpha^{(i)} y^{(i)} = 0$$

Plugging the two conditions we got back to the definition of the Lagrangian, we have $$\begin{align} \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha}) & = \dfrac{1}{2} \lVert \boldsymbol{w} \rVert^2 - \sum_i \alpha^{(i)} \left[ y^{(i)} (\boldsymbol{w} \cdot x^{(i)} + b) - 1 \right] \\ & = \dfrac{1}{2} \lVert \boldsymbol{w} \rVert^2 - \sum_i \alpha^{(i)} y^{(i)} (\boldsymbol{w} \cdot x^{(i)}) - \sum_i \alpha^{(i)} y^{(i)} b + \sum_i \alpha^{(i)} \\ & = \dfrac{1}{2} \left\lVert \sum_i \alpha^{(i)} y^{(i)} x^{(i)} \right\rVert ^2 - \sum_i \alpha^{(i)} y^{(i)} \left( \sum_j \alpha^{(j)} y^{(j)} x^{(j)} \right) \cdot x^{(i)} - \sum_i \alpha^{(i)} y^{(i)} b + \sum_i \alpha_i & \text{$\blacktriangleleft$ Plug } \boldsymbol{w} = \sum_i \alpha^{(i)} y^{(i)} x^{(i)} \\ & = \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left( x^{(i)} \cdot x^{(j)} \right) - \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left( x^{(i)} \cdot x^{(j)} \right) - \sum_i \alpha_i y^{(i)} b + \sum_i \alpha_i \\ & = \sum_i \alpha^{(i)} - \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left(x^{(i)} \cdot x^{(j)}\right) - b \sum_i \alpha^{(i)} y^{(i)}\\ & = \sum_i \alpha^{(i)} - \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left(x^{(i)} \cdot x^{(j)}\right) & \text{$\blacktriangleleft$ Plug } \sum_{i} \alpha^{(i)} y^{(i)} = 0 \end{align}$$

Hence, we finally got the dual formulation of the SVM problem: $$ \begin{align*} \text{maximize}_{\boldsymbol{\alpha}} & \ \sum_i \alpha^{(i)} - \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left(x^{(i)} \cdot x^{(j)}\right) \\ \text{subject to} & \ \alpha^{(i)} \geq 0 \end{align*} $$

This is an equivalent formulation of the same SVM problem, yet we are optimizing for $\boldsymbol{\alpha}$ instead of optimizing for $\boldsymbol{w}$ and $b$. Similar to our primal formulation, the optimal solution to $\boldsymbol{\alpha}$ can be calculated by an off-the-shelf solver (this is still a simple convex optimization problem).

After we obtained the solution to $\boldsymbol{\alpha}$, we could obtain the solution to $\boldsymbol{w}$ using: $$\boldsymbol{w} = \sum_i \alpha^{(i)} y^{(i)} x^{(i)} \quad \text{$\blacktriangleleft$ As derived above}$$

Using the fact that $\boldsymbol{w} \cdot \boldsymbol{x} + b = \pm 1$ for support vectors, we can find $b$ by using the support vectors: $$b = -\dfrac{\max_{i:y^{(i)}=-1} \boldsymbol{w} \cdot x^{(i)} + \min_{i:y^{(i)}=1} \boldsymbol{w} \cdot x^{(i)}}{2} \quad \text{$\blacktriangleleft$ Find the support vectors and take the average}$$

How do we find the support vectors? The KKT conditions become useful here. Condition 3 (Complementary slackness) tell us that $\alpha^{(i)} = 0$ if the constraint is "inactive" and $\alpha^{(i)} > 0$ if the constraint is "active". In the context of SVMs, $\alpha^{(i)} > 0$ if the sample is a support vector (constraint $i$ is needed), and $\alpha^{(i)} = 0$ otherwise (constraint $i$ can be removed).

That's it! Hope you're still surviving here!

Comparing the two Formulations

Let's summarize the two formulations first.

Primal Formulation: $$ \begin{align*} \text{minimize}_{\boldsymbol{w}, b} & \ \dfrac{1}{2} \lVert \boldsymbol{w} \rVert^2 \\ \text{subject to} & \ y^{(i)} (\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b) \geq 1 \end{align*} $$

Dual Formulation: $$ \begin{align*} \text{maximize}_{\boldsymbol{\alpha}} & \ \sum_i \alpha^{(i)} - \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left(x^{(i)} \cdot x^{(j)}\right) \\ \text{subject to} & \ \alpha^{(i)} \geq 0 \end{align*} $$

When we solve the primal formulation (using an off-to-shelf solver), we will probably have to compute dot products between $\boldsymbol{w}$ (what we're optimizing) and the samples $x^{(i)}$, which is inefficient for high-dimensional data. However, note that in the dual formulation, the only part that involves the input of the samples is $\boldsymbol{x}^{(i)} \cdot \boldsymbol{x}^{(j)}$. This can be precomputed across all pairs of samples (i.e. in $O(n^2d)$ time) BEFORE doing the optimization. Hence, it is usually faster to solve the dual problem instead of the primal, when it comes to high-dimensional data.

Besides, the fact that the dual formulation only involves the dot product of the samples would enable us to apply the kernel trick, which is essential in capturing non-linear relationships.

Kernel Trick

The Kernel Trick is used to handle non-linear decision boundaries. This is a trick based on the dual formulation.

Typically, we handle non-linear relationships by adding in transformed features. Let's say we apply a feautre transformation $\phi$ to the data, so the input sample $\boldsymbol{x}$ becomes $\phi(\boldsymbol{x})$ instead. Here is an example $\phi$ that adds in the quadratic features to two-dimensional data: $$\phi(\boldsymbol{x}) = \begin{bmatrix} x_1 \\ x_2 \\ x_1^2 \\ x_2^2 \\ x_1 x_2 \\ \end{bmatrix}$$

Then, we would solve this dual problem: $$ \begin{align*} \text{maximize}_{\boldsymbol{\alpha}} & \ \sum_i \alpha^{(i)} - \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left(\phi(x^{(i)}) \cdot \phi(x^{(j)})\right) \\ \text{subject to} & \ \alpha^{(i)} \geq 0 \end{align*} $$

Notice an important property here - the dual formulation only depends on the pairwise dot product of the samples. As long as we figure out a way to compute $\phi(x^{(i)}) \cdot \phi(x^{(j)})$ for all pairs of samples $(i, j)$, we're good to go!

This inspires the kernel trick, we could replace the dot product $\phi(x^{(i)}) \cdot \phi(x^{(j)})$ by a kernel function $K(\boldsymbol{x}^{(i)}, \boldsymbol{x}^{(j)})$ that computes the dot product of two input samples after feature transformation. It doesn't matter whether the kernel function actually performs the feature transformation - the kernel is valid as long as it computes the correct dot product.

This brings us to the important property of a kernel function: $K(\boldsymbol{u}, \boldsymbol{v}) = \phi(\boldsymbol{u}) \cdot \phi(\boldsymbol{v})$ for some $\phi$. This simplifies the dual problem to: $$ \begin{align*} \text{maximize}_{\boldsymbol{\alpha}} & \ \sum_i \alpha^{(i)} - \dfrac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} K(\boldsymbol{x}^{(i)}, \boldsymbol{x}^{(j)}) \\ \text{subject to} & \ \alpha^{(i)} \geq 0 \end{align*} $$

Example 1: Quadratic Kernel

Consider the following the kernel function where $\boldsymbol{u}$ and $\boldsymbol{v}$ are three dimensional vectors: $$K(\boldsymbol{u}, \boldsymbol{v}) = (\boldsymbol{u} \cdot \boldsymbol{v})^2$$

Note that this kernel function only takes $O(m)$ time to compute, where $m$ is the number of features (it squares after computing the dot product).

However, by expanding the dot product, we get $$K(\boldsymbol{u}, \boldsymbol{v}) = (u_1 v_1 + u_2 v_2 + u_3 v_3)^2 = u_1^2 v_1^2 + u_2^2 v_2^2 + u_3^2 v_3^2 + 2 (u_1 u_2) (v_1 v_2) + 2 (u_1 u_3) (v_1 v_3) + 2 (u_2 u_3) (v_2 v_3)$$

By separating the components involving $\boldsymbol{u}$ and $\boldsymbol{v}$, this can be written as $$K(\boldsymbol{u}, \boldsymbol{v}) = \begin{bmatrix} u_1^2 \\ u_2^2 \\ u_3^2 \\ \sqrt{2} u_1 u_2 \\ \sqrt{2} u_1 u_3 \\ \sqrt{2} u_2 u_3 \\ \end{bmatrix} \cdot \begin{bmatrix} v_1^2 \\ v_2^2 \\ v_3^2 \\ \sqrt{2} v_1 v_2 \\ \sqrt{2} v_1 v_3 \\ \sqrt{2} v_2 v_3 \\ \end{bmatrix} = \phi(\boldsymbol{u}, \boldsymbol{v})$$

These are the transformed features! Essentially, the kernel function implicitly considers the transformed features $u_1^2$, $u_2^2$, $u_3^2$, $u_1 u_2$, $u_1 u_3$ and $u_2 u_3$ for each input sample, without explicitly computing them. It only computes the dot product $\phi(\boldsymbol{u}) \cdot \phi(\boldsymbol{v})$ directly, which can be done in $O(m)$ time. (But we're handling $O(m^2)$ transformed features!)

See Task 0 of the bonus question for a sample implementation / visualization. Training the SVM with the transformed features is exactly identical to training the SVM with the quadratic kernel function! (Note: That implementation uses two-dimensional data for ease of visualization.)

We can extend this further to $d$-dimensional by simply taking $K(\boldsymbol{u}, \boldsymbol{v}) = (\boldsymbol{u} \cdot \boldsymbol{v})^d$. Note that computing the $d$-th power of a dot product is very simple, yet we're considering $O(m^d)$ transformed features implicitly!

Example 2: Gaussian Kernel

Consider the following the kernel function where $u$ and $v$ are scalars (for simplicity): $$K(u, v) = \exp \left( -\dfrac{\lVert u - v \rVert^2}{2\sigma^2} \right)$$

By expanding using the Taylor series of $e^x$, we have (see tutorial solutions for the complete derivation) $$\begin{align} K(u, v) & = \sum_{k=0}^{\infty} \left[ \sqrt{\dfrac{2^k}{k!}} \exp(-u^2) u^k \times \sqrt{\dfrac{2^k}{k!}} \exp(-v^2) v^k \right] \\ & = \begin{bmatrix} \sqrt{\dfrac{2^0}{0!}} \exp(-u^2) u^0 \\ \sqrt{\dfrac{2^1}{1!}} \exp(-u^2) u^1 \\ \sqrt{\dfrac{2^2}{2!}} \exp(-u^2) u^2 \\ \vdots \end{bmatrix} \cdot \begin{bmatrix} \sqrt{\dfrac{2^0}{0!}} \exp(-v^2) v^0 \\ \sqrt{\dfrac{2^1}{1!}} \exp(-v^2) v^1 \\ \sqrt{\dfrac{2^2}{2!}} \exp(-v^2) v^2 \\ \vdots \end{bmatrix} \\ & = \phi(u) \cdot \phi(v) \end{align}$$

Note that $\phi(u)$ has terms $u^0, u^1, u^2, \cdots$, which means the transformation is infinite dimensional.

References


Last updated: 14 March 2025