1. Opening Words
In the previous blog, the function basis was briefly discussed. We began with viewing a function as an infinite vector, and then defined the inner product of functions. Similar to Rn\mathcal{R}^n space, we can also find orthogonal function basis for a function space.
This blog will move a step further discussing about kernel functions and reproducing kernel Hilbert space (RKHS). Kernel methods have been widely used in a variety of data analysis techniques. The motivation of kernel method arises in mapping a vector in Rn\mathcal{R}^n space as another vector in a feature space. For example, imagine there are some red points and some blue points as the next figure shows, which are not easily separable in Rn\mathcal{R}^n space. However, if we map them into a high-dimension feature space, we may be able to seperate them easily. This article will not provide strict theoretical definition, but rather intuitive description on the basic ideas.
2. Eigen Decomposition
For a real symmetric matrix A\mathbf{A} , there exists real number λ\lambda and vector x\mathbf{x} so that
Ax=λx\mathbf{A} \mathbf{x} = \lambda \mathbf{x}
Then λ\lambda is an eigenvalue of A\mathbf{A} and x\mathbf{x} is the corresponding eigenvector. If A\mathbf{A} has two different eigenvalues λ1\lambda_1 and λ2\lambda_2 , λ1≠λ2\lambda_1 \neq \lambda_2 , with corresponding eigenvectors x1\mathbf{x}_1 and x2\mathbf{x}_2 respectively,
λ1x1Tx2=x1TATx2=x1TAx2=λ2x1Tx2\lambda_1 \mathbf{x}_1^T \mathbf{x}_2 = \mathbf{x}_1^T \mathbf{A}^T \mathbf{x}_2 = \mathbf{x}_1^T \mathbf{A} \mathbf{x}_2 = \lambda_2 \mathbf{x}_1^T \mathbf{x}_2
Since λ1≠λ2\lambda_1 \neq \lambda_2 , we have x1Tx2=0\mathbf{x}_1^T \mathbf{x}_2 = 0 , i.e., x1\mathbf{x}_1 and x2\mathbf{x}_2 are orthogonal.
For A∈Rn×n\mathbf{A} \in \mathcal{R}^{n \times n} , we can find nn eigenvalues a long with nn orthogonal eigenvectors. As a result, A\mathbf{A} can be decomposited as
A=QDQT\mathbf{A} = \mathbf{Q} \mathbf{D} \mathbf{Q}^T
where Q\mathbf{Q} is an orthogonal matrix (i.e., QQT=I\mathbf{Q} \mathbf{Q}^T = \mathbf{I} ) and D=diag(λ1,λ2,⋯ ,λn)\mathbf{D} = \text{diag} (\lambda_1, \lambda_2, \cdots, \lambda_n) . If we write Q\mathbf{Q} column by column
Q=(q1,q2,⋯ ,qn)\mathbf{Q}=\left( \mathbf{q}_1, \mathbf{q}_2, \cdots, \mathbf{q}_n \right)
then
A=QDQT=(q1,q2,⋯ ,qn)(λ1 λ2 ⋱ λn)(q1Tq2T⋮qnT)=(λ1q1,λ2q2,⋯ ,λnqn)(q1Tq2T⋮qnT)=∑i=1nλiqiqiT\begin{array}{rl} \mathbf{A}=\mathbf{Q} \mathbf{D} \mathbf{Q}^T &= \left( \mathbf{q}_1, \mathbf{q}_2, \cdots, \mathbf{q}_n \right) \begin{pmatrix} \lambda_1\ &&& \\ & \lambda_2\ && \\ && \ddots\ & \\ &&&\lambda_n \end{pmatrix} \begin{pmatrix} \mathbf{q}_1^T \\ \mathbf{q}_2^T \\ \vdots \\ \mathbf{q}_n^T \end{pmatrix} \\ &= \left( \lambda_1 \mathbf{q}_1, \lambda_2 \mathbf{q}_2, \cdots, \lambda_n \mathbf{q}_n \right) \begin{pmatrix} \mathbf{q}_1^T \\ \mathbf{q}_2^T \\ \vdots \\ \mathbf{q}_n^T \end{pmatrix} \\ &=\sum_{i=1}^n \lambda_i \mathbf{q}_i \mathbf{q}_i^T \end{array}
Here {qi}i=1n{ \{\mathbf{q}_i \} }_{i=1}^n is a set of orthogonal basis of Rn\mathcal{R}^n .
3. Kernel Function
A function f(x)f(\mathbf{x}) can be viewed as an infinite vector, then for a function with two independent variables K(x,y)K(\mathbf{x},\mathbf{y}) , we can view it as an infinite matrix. Among them, if K(x,y)=K(y,x)K(\mathbf{x},\mathbf{y}) = K(\mathbf{y},\mathbf{x}) and
∫∫f(x)K(x,y)f(y)dxdy≥0\int \int f(\mathbf{x}) K(\mathbf{x},\mathbf{y}) f(\mathbf{y}) d\mathbf{x} d\mathbf{y} \geq 0
for any function ff , then K(x,y)K(\mathbf{x},\mathbf{y}) is symmetric and positive definite, in which case K(x,y)K(\mathbf{x},\mathbf{y}) is a kernel function.
Similar to matrix eigenvalue and eigenvector, there exists eigenvalue λ\lambda and eigenfunction ψ(x)\psi(\mathbf{x}) so that
∫K(x,y)ψ(x)dx=λψ(y)\int K(\mathbf{x},\mathbf{y}) \psi(\mathbf{x}) d\mathbf{x} = \lambda \psi(\mathbf{y})
For different eigenvalues λ1\lambda_1 and λ2\lambda_2 with corresponding eigenfunctions ψ1(x)\psi_1(\mathbf{x}) and ψ2(x)\psi_2(\mathbf{x}) , it is easy to show that
∫λ1ψ1(x)ψ2(x)dx=∫∫K(y,x)ψ1(y)dyψ2(x)dx=∫∫K(x,y)ψ2(x)dxψ1(y)dy=∫λ2ψ2(y)ψ1(y)dy=∫λ2ψ2(x)ψ1(x)dx\begin{array}{rl} \int \lambda_1 \psi_1(\mathbf{x}) \psi_2(\mathbf{x}) d\mathbf{x} & = \int \int K(\mathbf{y},\mathbf{x}) \psi_1(\mathbf{y}) d\mathbf{y} \psi_2(\mathbf{x}) d\mathbf{x} \\ & = \int \int K(\mathbf{x},\mathbf{y}) \psi_2(\mathbf{x}) d\mathbf{x} \psi_1(\mathbf{y}) d\mathbf{y} \\ & = \int \lambda_2 \psi_2(\mathbf{y}) \psi_1(\mathbf{y}) d\mathbf{y} \\ & = \int \lambda_2 \psi_2(\mathbf{x}) \psi_1(\mathbf{x}) d\mathbf{x} \end{array}
Therefore,
<ψ1,ψ2>=∫ψ1(x)ψ2(x)dx=0< \psi_1, \psi_2 > = \int \psi_1(\mathbf{x}) \psi_2(\mathbf{x}) d\mathbf{x} = 0
Again, the eigenfunctions are orthogonal. Here ψ\psi denotes the function (the infinite vector) itself.
For a kernel function, infinite eigenvalues {λi}i=1∞{ \{\lambda_i\} }_{i=1}^{\infty} along with infinite eigenfunctions {ψi}i=1∞{ \{\psi_i\} }_{i=1}^{\infty} may be found. Similar to matrix case,
K(x,y)=∑i=0∞λiψi(x)ψi(y)K(\mathbf{x},\mathbf{y}) = \sum_{i=0}^{\infty} \lambda_i \psi_i (\mathbf{x}) \psi_i (\mathbf{y})
which is the Mercer's theorem. Here <ψi,ψj>=0< \psi_i, \psi_j > = 0 for i≠ji \neq j . Therefore, {ψi}i=1∞{ \{\psi_i\} }_{i=1}^{\infty} construct a set of orthogonal basis for a function space.
Here are some commonly used kernels:
- Polynomial kernel K(x,y)=(γxTy+C)dK(\mathbf{x},\mathbf{y}) = ( \gamma \mathbf{x}^T \mathbf{y} + C)^d
- Gaussian radial basis kernel K(x,y)=exp(−γ∥x−y∥2)K(\mathbf{x},\mathbf{y}) = \exp (-\gamma \Vert \mathbf{x} - \mathbf{y} \Vert^2 )
- Sigmoid kernel K(x,y)=tanh(γxTy+C)K(\mathbf{x},\mathbf{y}) = \tanh (\gamma \mathbf{x}^T \mathbf{y} + C )
4. Reproducing Kernel Hilbert Space
Treat {λiψi}i=1∞{ \{\sqrt{\lambda_i} \psi_i\} }_{i=1}^{\infty} as a set of orthogonal basis and construct a Hilbert space H\mathcal{H} . Any function or vector in the space can be represented as the linear combination of the basis. Suppose
f=∑i=1∞fiλiψif = \sum_{i=1}^{\infty} f_i \sqrt{\lambda_i} \psi_i
we can denote ff as an infinite vector in H\mathcal{H} :
f=(f1,f2,...)HTf = (f_1, f_2, ...)_\mathcal{H}^T
For another function g=(g1,g2,...)HTg = (g_1, g_2, ...)_\mathcal{H}^T , we have
<f,g>H=∑i=1∞figi< f,g >_\mathcal{H} = \sum_{i=1}^{\infty} f_i g_i
For the kernel function KK, here I use K(x,y)K(\mathbf{x},\mathbf{y}) to denote the evaluation of KK at point x,y\mathbf{x},\mathbf{y} which is a scalar, use K(⋅,⋅)K(\cdot,\cdot) to denote the function (the infinite matrix) itself, and use K(x,⋅)K(\mathbf{x},\cdot) to denote the x\mathbf{x}th "row" of the matrix, i.e., we fix one parameter of the kernel function to be x\mathbf{x} then we can regard it as a function with one parameter or as an infinite vector. Then
K(x,⋅)=∑i=0∞λiψi(x)ψiK(\mathbf{x},\cdot) = \sum_{i=0}^{\infty} \lambda_i \psi_i (\mathbf{x}) \psi_i
In space H\mathcal{H} , we can denote
K(x,⋅)=(λ1ψ1(x),λ2ψ2(x),⋯ )HTK(\mathbf{x},\cdot) = (\sqrt{\lambda_1} \psi_1 (\mathbf{x}), \sqrt{\lambda_2} \psi_2 (\mathbf{x}), \cdots )_\mathcal{H}^T
Therefore
<K(x,⋅),K(y,⋅)>H=∑i=0∞λiψi(x)ψi(y)=K(x,y)< K(\mathbf{x},\cdot), K(\mathbf{y},\cdot) >_\mathcal{H} = \sum_{i=0}^{\infty} \lambda_i \psi_i (\mathbf{x}) \psi_i(\mathbf{y}) = K(\mathbf{x},\mathbf{y})
This is the reproducing property, thus H\mathcal{H} is called reproducing kernel Hilbert space (RKHS).
Now it is time to return to the problem from the beginning of this article: how to map a point into a feature space? If we define a mapping
Φ(x)=K(x,⋅)=(λ1ψ1(x),λ2ψ2(x),⋯ )T\bold{\Phi} (\mathbf{x}) = K(\mathbf{x},\cdot) = (\sqrt{\lambda_1} \psi_1 (\mathbf{x}), \sqrt{\lambda_2} \psi_2 (\mathbf{x}), \cdots )^T
then we can map the point x\mathbf{x} to H\mathcal{H} . Here Φ\bold{\Phi} is not a function, since it points to a vector or a funtion in the feature space H\mathcal{H} . Then
<Φ(x),Φ(y)>H=<K(x,⋅),K(y,⋅)>H=K(x,y)< \bold{\Phi} (\mathbf{x}), \bold{\Phi} (\mathbf{y}) >_\mathcal{H} = < K(\mathbf{x},\cdot), K(\mathbf{y},\cdot) >_\mathcal{H} = K(\mathbf{x},\mathbf{y})
As a result, we do not need to actually know what is the mapping, where is the feature space, or what is the basis of the feature space. For a symmetric positive-definite function KK , there must exist at least one mapping Φ\bold{\Phi} and one feature space H\mathcal{H} so that
<Φ(x),Φ(y)>=K(x,y)< \bold{\Phi} (\mathbf{x}), \bold{\Phi} (\mathbf{y}) > = K(\mathbf{x},\mathbf{y})
which is the so-called kernel trick.
5. A Simple Example
Consider kernel function
K(x,y)=(x1,x2,x1x2)(y1y2y1y2)=x1y1+x2y2+x1x2y1y2K(\mathbf{x},\mathbf{y}) = \left( x_1, x_2, x_1 x_2 \right) \begin{pmatrix} y_1 \\ y_2 \\ y_1 y_2 \end{pmatrix} = x_1 y_1 + x_2 y_2 + x_1 x_2 y_1 y_2
where x=(x1,x2)T,y=(y1,y2)T\mathbf{x}=(x_1,x_2)^T, \mathbf{y}=(y_1,y_2)^T . Let λ1=λ2=λ3=1\lambda_1=\lambda_2=\lambda_3=1 , ψ1(x)=x1\psi_1(\mathbf{x})=x_1 , ψ2(x)=x2\psi_2(\mathbf{x})=x_2 , ψ3(x)=x1x2\psi_3(\mathbf{x})=x_1 x_2 . We can define the mapping as
(x1x2)⟶Φ(x1x2x1x2)\begin{pmatrix} x_1 \\ x_2 \end{pmatrix} \overset{\bold{\Phi}}{\longrightarrow} \begin{pmatrix} x_1 \\ x_2 \\ x_1 x_2 \end{pmatrix}
Then
<Φ(x),Φ(y)>=(x1,x2,x1x2)(y1y2y1y2)=K(x,y)< \bold{\Phi} (\mathbf{x}), \bold{\Phi}(\mathbf{y}) > = \left( x_1, x_2, x_1 x_2 \right) \begin{pmatrix} y_1 \\ y_2 \\ y_1 y_2 \end{pmatrix} = K(\mathbf{x},\mathbf{y})
6. Support Vector Machine
Support vector machine (SVM) is one of the most widely known application of RKHS. Suppose we have data pairs (xi,yi)i=1n{ (\mathbf{x}_i, y_i) }_{i=1}^n where yiy_i is either 1 or -1 denoting the class of the point xi\mathbf{x}_i. SVM assumes a hyperplane to best seperate the two classes.
minβ,β012∥β∥2+C∑i=1nξi\min_{\boldsymbol{\beta}, \beta_0} \frac{1}{2} \Vert \boldsymbol{\beta} \Vert^2 + C \sum_{i=1}^n \xi_i
subject to ξi≥0,yi(xiTβ+β0)≥1−ξi,∀i\text{subject to } \xi_i \geq 0, y_i (\mathbf{x}_i^T \boldsymbol{\beta} + \beta_0 ) \geq 1 - \xi_i, \forall i
Sometimes the two classes cannot be easily seperated in Rn\mathcal{R}^n space, thus we can map xi\mathbf{x}_i into a high-dimension feature space where the two classes may be easily seperated. The original problem can be reformulated as
minβ,β012∥β∥2+C∑i=1nξi\min_{\boldsymbol{\beta}, \beta_0} \frac{1}{2} \Vert \boldsymbol{\beta} \Vert^2 + C \sum_{i=1}^n \xi_i
subject to ξi≥0,yi(Φ(xi)Tβ+β0)≥1−ξi,∀i\text{subject to } \xi_i \geq 0, y_i (\bold{\Phi}(\mathbf{x}_i)^T \boldsymbol{\beta} + \beta_0 ) \geq 1 - \xi_i, \forall i
The Lagrange function is
Lp=12∥β∥2+C∑i=1nξi−∑i=1nαi[yi(Φ(xi)Tβ+β0)−(1−ξi)]−∑i=1nμiξiL_p = \frac{1}{2} \Vert \boldsymbol{\beta} \Vert^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i [y_i (\bold{\Phi}(\mathbf{x}_i)^T \boldsymbol{\beta} + \beta_0) - (1-\xi_i)] -\sum_{i=1}^n \mu_i \xi_i
Since
∂Lp∂β=0\frac{\partial L_p}{\partial \boldsymbol{\beta}} = \mathbf{0}
we get
β=∑i=1nαiyiΦ(xi)\boldsymbol{\beta} = \sum_{i=1}^n \alpha_i y_i \bold{\Phi}(\mathbf{x}_i)
That is, β\boldsymbol{\beta} can be writen as the linear combination of xi\mathbf{x}_is! We can substitute β\boldsymbol{\beta} and get the new optimization problem. The objective function changes to:
12∥∑i=1nαiyiΦ(xi)∥2+C∑i=1nξi=12<∑i=1nαiyiΦ(xi),∑j=1nαjyjΦ(xj)>+C∑i=1nξi=12∑i=1n∑j=1nαiαjyiyj<Φ(xi),Φ(xj)>+C∑i=1nξi=12∑i=1n∑j=1nαiαjyiyjK(xi,xj)+C∑i=1nξi \begin{array}{rl} &\frac{1}{2} \Vert \sum_{i=1}^n \alpha_i y_i \bold{\Phi} (\mathbf{x}_i) \Vert^2 + C \sum_{i=1}^n \xi_i \\ =& \frac{1}{2} < \sum_{i=1}^n \alpha_i y_i \bold{\Phi} (\mathbf{x}_i), \sum_{j=1}^n \alpha_j y_j \bold{\Phi} (\mathbf{x}_j) > + C \sum_{i=1}^n \xi_i \\ =& \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j < \bold{\Phi} (\mathbf{x}_i), \bold{\Phi} (\mathbf{x}_j) > + C \sum_{i=1}^n \xi_i \\ = & \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j) + C \sum_{i=1}^n \xi_i \end{array}
The constraints changes to:
yi[Φ(xi)T(∑j=1nαjyjΦ(xj))+β0]=yi[(∑j=1nαjyj<Φ(xi),Φ(xj)>)+β0]=yi[(∑j=1nαjyjK(xi,xj))+β0]≥1−ξi,∀i \begin{array}{rl} & y_i \left[\bold{\Phi}(\mathbf{x}_i)^T \left( \sum_{j=1}^n \alpha_j y_j \bold{\Phi}(\mathbf{x}_j) \right) + \beta_0 \right] \\ =& y_i \left[ \left( \sum_{j=1}^n \alpha_j y_j < \bold{\Phi}(\mathbf{x}_i), \bold{\Phi}(\mathbf{x}_j) > \right) + \beta_0 \right] \\ =& y_i \left[ \left( \sum_{j=1}^n \alpha_j y_j K(\mathbf{x}_i, \mathbf{x}_j) \right) + \beta_0 \right] \geq 1 - \xi_i, \forall i \end{array}
What we need to do is determining a kernel function and solve for α,β0,ξi\boldsymbol{\alpha}, \beta_0, \xi_i . We do not need to actually construct the feature space. For a new data x\mathbf{x} with unknown class, we can predict its class by
y^=sign[Φ(x)Tβ+β0]=sign[Φ(x)T(∑i=1nαiyiΦ(xi))+β0]=sign(∑i=1nαiyi<Φ(x),Φ(xi)>+β0)=sign(∑i=1nαiyiK(x,xi)+β0)\begin{array}{ccl} \hat{y} &=& \text{sign} \left[ \bold{\Phi} (\mathbf{x})^T \boldsymbol{\beta} + \beta_0 \right] \\ &=& \text{sign} \left[ \bold{\Phi} (\mathbf{x})^T \left( \sum_{i=1}^n \alpha_i y_i \bold{\Phi}(\mathbf{x}_i) \right) + \beta_0 \right] \\ &=& \text{sign} \left( \sum_{i=1}^n \alpha_i y_i < \bold{\Phi} (\mathbf{x}), \bold{\Phi}(\mathbf{x}_i) > + \beta_0 \right) \\ &=& \text{sign} \left( \sum_{i=1}^n \alpha_i y_i K(\mathbf{x},\mathbf{x}_i) + \beta_0 \right) \end{array}
Kernel methods greatly strengthen the discriminative power of SVM.
7. Summary and Reference
Kernel method has been widely utilized in data analytics. Here, the fundamental property of RKHS is introduced. With kernel trick, we can easily map the data to a feature space and do analysis. Here is a video with nice demonstration on why we can easily do classification with kernel SVM in a high-dimension feature space.
The example in Section 5 is from
- Gretton A. (2015): Introduction to RKHS, and some simple kernel algorithms, Advanced Topics in Machine Learning, Lecture conducted from University College London.
Other reference includes
- Paulsen, V. I. (2009). An introduction to the theory of reproducing kernel Hilbert spaces. Lecture Notes.
- Daumé III, H. (2004). From zero to reproducing kernel hilbert spaces in twelve pages or less.
- Friedman, J., Hastie, T., and Tibshirani, R. (2001). The elements of statistical learning. Springer, Berlin: Springer series in statistics.)