Learning by Support Vector Machines

, $\mathcal{Y}\subset \mathbb{R}^{\tilde{d}}$ , where for simplicity only $\tilde{d} = 1$ is considered, and $\mathcal{Z}:= \mathcal{X}\times \mathcal{Y}$ . The aim of the following sections is to learn a target function $\mathcal{X} \rightarrow \mathcal{Y}$ from given training samples $Z:=\{ (x_{1},y_{1}),\ldots,(x_{m},y_{m})\} \subset \mathcal{Z}$ . A distinction is made between classification and regression tasks. In classification $\mathcal{Y}$ is a discrete set, in general as large as the number of classes the samples belong to. Here binary classification with just two labels in $\mathcal{Y}$ was most extensively studied. An example where binary classification is useful is SPAM detection. Another example in medical diagnostics is given in Fig. 1. Here it should be mentioned that in many practical applications, the original input variables are pre-processed to transform them into a new useful space which is often easier to handle but preserves the necessary discriminatory information. This process is also known as feature extraction .

Fig. 1

Examples of physiological (SR) and pathological (VT) electrical heart activity curves measured by an implanted cardioverter-defibrillator. For the classification of such signals, see Strauss and Steidl [95]

In contrast to classification, regression aims at approximating the “whole” real-valued function from some function values, so that $\mathcal{Y}$ is not countable here. The above examples, as all problems considered in this paper, are from the area of supervised learning. This means that all input vectors come along with their corresponding target function values (labeled data). In contrast, semi-supervised learning makes use of labeled and unlabeled Data, and in unsupervised learning labeled data are not available, so that one can only exploit the input vectors x _i. The latter methods can be applied for example to discover groups of similar exemplars within the data (clustering), to determine the distribution of the data within the input space (density estimation), or to perform projections of data from high-dimensional spaces to lower-dimensional spaces. There are also learning models which involve more complex interactions between the learner and the environment. An example is reinforcement learning which is concerned with the problem of finding suitable actions in a given situation in order to maximize the reward. In contrast to supervised learning, reinforcement learning does not start from given optimal (labeled) outputs but must instead find them by a process of trial and error. For reinforcement learning, the reader may consult the book of Sutton and Barto [97].

Learning models can also differ in the way in which the training data are generated and presented to the learner. For example, a distinction can be made between batch learning, where all the data are given at the start of the training phase, and online learning, where the learner receives one example at a time and updates the hypothesis function in response to each new example.

This paper is organized as follows: An overview of the historical background is given in Sect. 2. Section 3 contains an introduction into classical SVM methods. It starts with linear methods for (binary) support vector classification and regression and considers also linear least squares classification/regression. Then the kernel trick is explained and used to transfer the linear models into so-called feature spaces which results in nonlinear learning methods. Some other models related to SVM as well as multi-class classification and multitask learning are addressed at the end of the section. Section 4 provides some mathematical background concerning RKHSs and quadratic optimization. The last subsection sketches very briefly some results in statistical learning theory. Numerical methods to make the classical SVMs efficient in particular for large data sets are presented in Sect. 5. The paper ends with some conclusions in Sect. 6.

2 Historical Background

Modern learning theory has a long and interesting history going back as far as Gauss and Legendre but got its enormous impetus from the advent of computing machines. In the 1930s revolutionary changes took place in understanding the principles of inductive inference from a philosophical perspective, e.g., by Popper, and from the point of view of statistical theory, e.g., by Kolmogorov, Glivenko, and Cantelli, and applied statistics, e.g., by Fisher. A good overview over the leading ideas and developments in this time can be found in the comments and bibliographical remarks of Vapnik’s book, Vapnik [105]. The starting point of statistical learning theory which considers the task of minimizing a risk functional based on empirical data dates back to the 1960s. Support vector machines, including their RKHS interpretation, were only discovered in the 1990s and led to an explosion in applications and theoretical analysis.

Let us start with the problem of linear regression which is much older than linear classification. The method of least squares was first published by Legendre [63]. It was considered as a statistical procedure by Gauss [43], who claimed, to the annoyance of Legendre but in accordance with most historians, to have applied this method since 1795. The original least squares approach finds for given points $x_{i} \in \mathbb{R}^{d}$ and corresponding $y_{i} \in \mathbb{R}$ , ${i = 1, \ldots, m}$ a hyperplane $f(x) =\langle w,x\rangle + b$ having minimal least squares distance from the points (x _i, y _i):

$\displaystyle{ \sum _{i=1}^{m}(\langle w,x_{ i}\rangle + b - y_{i})^{2}\; \rightarrow \;\min _{ w,b}. }$

(1)

This leads to the solution of a linear system of equations which can be ill conditioned or possess several solutions. Therefore, regularized versions were introduced later. The linear least squares approach is optimal in the case of linear targets corrupted by Gaussian noise. Sometimes it is useful to find a linear function which does not minimize the least squares error, but, for example, the ℓ ₁-error

$\displaystyle{\sum _{i=1}^{m}\vert \langle w,x_{ i}\rangle + b - y_{i}\vert \;\rightarrow \min _{w,b}}$

which is more robust against outliers. This model with the constraint that the sum of the errors is equal to zero was already studied by Laplace in 1799; see Laplace [61]. Another popular choice is the ℓ _∞-error

$\displaystyle{\max _{i=1,\ldots,m}\vert \langle w,x_{i}\rangle + b - y_{i}\vert \;\rightarrow \min _{w,b}}$

which better incorporates outliers. In contrast to the least squares method, the solutions of the ℓ ₁– and ℓ _∞-problems cannot be computed via linear systems of equations but require to solve linear optimization problems. Figure 2 shows a one-dimensional example, where data are approximated by lines with minimal ℓ ₂-, ℓ ₁– and ℓ _∞error norm, respectively. For more information on (regularized) least squares problems, the reader may consult, e.g., the books of Golub and Van Loan [45] and of Björck [10].

Fig. 2

Linear approximation with respect to the ℓ ₂-, ℓ ₁– and ℓ _∞-norm of the error (left to right). The ℓ ₁ approximation is more robust against outliers while the ℓ _∞-norm takes them better into account

Regularized least squares methods which penalize the quadratic weight $\|w\|^{2}$ as in section “Linear Least Squares Classification and Regression” were examined under the name ridge regression by Hoerl and Kennard [49]. This method can be considered as a special case of the regularization theory for ill-posed problems developed by Tikhonov and Arsenin [102]. Others than the least squares loss function like the ε–insensitive loss were brought into play by Vapnik [105]. This loss function enforces a sparse representation of the weights in terms of so-called support vectors which are (small) subsets of the training samples {x _i: i = 1, …, m}.

The simplest form of classification is binary classification, where one has just to separate between two classes. Linear hyperplanes H(w, b) separating points, also called linear discriminants or perceptrons, were already studied by Fisher [41] and became interesting for neural network researchers in the early 1960s. One of the first algorithms that constructs a separating hyperplane for linearly separable points was Rosenblatt’s perceptron; see Rosenblatt [84]. It is an iterative online and mistake-driven algorithm which starts with an initial weight guess w for the hyperplane and adapts the weight at each time a training point is misclassified by the current weight. If the data are linearly separable, the procedure converges, and the number of mistakes (number of updates of w) does not exceed (2R∕γ)², where $R:=\min _{i=1,\ldots,m}\|x_{i}\|$ and γ is the smallest distance between a training point and the hyperplane. For linearly separable points, there exist various hyperplanes separating them.

An optimal hyperplane for linearly separable points in the sense that the minimal distance of the points from the plane becomes maximal was constructed as so-called generalized portrait algorithm by Vapnik and Lerner [108]. This learning method is also known as linear hard margin support vector classifier . The method was generalized to nonseparable points by Cortes and Vapnik [26] which leads to soft margin classifiers. Finally, the step from linear to nonlinear classifiers via feature maps was taken by Boser et al. [12]. Their idea to combine a linear algorithm with a kernel approach inspired the further examination of specific kernels for applications.

However, the theory of kernels and their applications is older than SVMs. Aronzajn [5], systematically developed the theory of RHKSs in the 1940s though it was discovered that many results were independently obtained by Povzner [83]. The work of Parzen [78] brought the RKHS to the fore in statistical problems; see also Kailath [54]. Kernels in pattern recognition were already applied by Aizerman et al. [1]. Empirical risk minimization over RKHSs was considered by Wahba [112] in connection with splines and by Poggio and Giriosi [81] in relation with neural networks. Schölkopf et al. [89] realized that the kernel trick works not only for SVMs but for many other methods as principal component analysis in unsupervised learning.

The invention of SVMs has led to a gigantic amount of developments in learning theory and practice. The size of this paper would be not enough to list the references on this topic. Beyond various applications, also advanced generalization results, suitable choices of kernels, efficient numerical methods in particular for large data sets, relations to other sparse representation methods, multi-class classification, and multitask learning were addressed. The reader will find some references in the corresponding sections.

3 Mathematical Modeling and Applications

Linear Learning

This section starts with linear classification and regression which provide the easiest algorithms to understand some of the main building blocks that appear also in the more sophisticated nonlinear support vector machines. Moreover, concerning the classification task, this seems to be the best approach to explain its geometrical background. The simplest function to feed a classifier with or to use as an approximation of some unknown function in regression tasks is a linear (multivariate) function

$\displaystyle{ f(x) = f_{w}(x):=\langle w,x\rangle,\quad x \in \mathcal{X} \subset \mathbb{R}^{d}. }$

(2)

Often it is combined with some appropriate real number b, i.e., one considers the linear polynomial $f(x) + b =\langle w,x\rangle + b$ . In the context of learning, w is called weight vector and b offset, intercept or bias.

Linear Support Vector Classification

Let us consider binary classification first and postpone multi-class classification to section “Multi-class Classification and Multitask Learning.” As binary classifier $F = F_{w,b}: \mathcal{X} \rightarrow \{-1,1\}$ , one can use

$\displaystyle{F(x):= {\mathrm{sgn}}(f_{w}(x) + b) = {\mathrm{sgn}}(\langle w,x\rangle + b)}$

with the agreement that sgn(0): = 0. The hyperplane

$\displaystyle{H(w,b):=\{ x:\langle w,x\rangle + b = 0\}}$

has the normal vector $w/\|w\|$ , and the distance of a point $\tilde{x} \in \mathbb{R}^{d}$ to the hyperplane is given by

$\displaystyle{\bigg\vert \bigg\langle \frac{w} {\|w\|},\tilde{x}\bigg\rangle + \frac{b} {\|w\|}\bigg\vert }$

see Fig. 3 left. In particular, $\vert b\vert /\|w\|$ is the distance of the hyperplane from the origin.

Fig. 3

Left: Hyperplane H with normal $w/\|w\|$ and distance $\vert b\vert /\|w\|$ from the origin. The distance of the point $\tilde{x}$ from the hyperplane is the value $\lambda$ fulfilling $\langle w,\tilde{x} -\lambda w/\|w\|\rangle + b = 0$ , i.e., $\lambda = (\langle w,\tilde{x}\rangle + b)/\|w\|$ . Right: Linearly separable training set together with a separating hyperplane and the corresponding margin γ

The training set Z consists of two classes labeled by ± 1 with indices $I_{+}:=\{ i: y_{i} = 1\}$ and $I_{-}:=\{ i: y_{i} = -1\}$ . The training set is said to be separable by the hyperplane H(w, b) if $$$\langle w,x_{i}\rangle + b > 0$$” src=”/wp-content/uploads/2016/04/A183156_2_En_22_Chapter_IEq29.gif”></SPAN> for <SPAN class=EmphasisTypeItalic>i</SPAN> ∈ <SPAN class=EmphasisTypeItalic>I</SPAN> <SUB>+</SUB> and <SPAN id=IEq30 class=InlineEquation><IMG alt=$ for i ∈ I ₋, i.e.,

$$$\displaystyle{y_{i}(\langle w,x_{i}\rangle + b) > 0.}$$” src=”/wp-content/uploads/2016/04/A183156_2_En_22_Chapter_Equf.gif”></DIV></DIV></DIV>The points in <SPAN class=EmphasisTypeItalic>Z</SPAN> are called (linearly) <SPAN class=EmphasisTypeItalic>separable</SPAN> if there exists a hyperplane separating them. In this case, their distance from a separating hyperplane is given by<br /> <DIV id=Equg class=Equation><br /> <DIV class=EquationContent><br /> <DIV class=MediaObject><IMG alt=$

The smallest distance of a point from the hyperplane

$\displaystyle{ \gamma:=\min _{i=1,\ldots,m}y_{i}\left (\bigg\langle \frac{w} {\|w\|},x_{i}\bigg\rangle + \frac{b} {\|w\|}\right ) }$

(3)

is called margin. Figure 3, right, shows a separating hyperplane of two classes together with its margin. Of course for a separable training set, there may exist various separating hyperplanes. One way to ensure a unique solution is to pose additional requirements on the hyperplane in form of minimizing a cost functional.

Hard Margin Classifier

One obvious way is to choose those separating hyperplane which has the maximal distance from the data, i.e., a maximal margin. The corresponding classifiers are called maximal margin classifiers or hard margin classifiers. The hyperplane and the corresponding half-spaces do not chance if the defining vectors is rescaled to (c w, c b), c > 0. The so-called generalized portrait algorithm of Vapnik and Lerner [108], constructs a hyperplane that maximizes γ under the constraint $\|w\| = 1$ . The same hyperplane can be obtained as follows: by (3), it holds

$\displaystyle{\gamma \,\|w\| =\min _{i=1,\ldots,m}y_{i}\left (\langle w,x_{i}\rangle + b\right )}$

so that one can use the scaling

$\displaystyle{\gamma \,\|w\| = 1\quad \Leftrightarrow \quad \gamma = \frac{1} {\|w\|}.}$

Now γ becomes maximal if and only if $\|w\|$ becomes minimal, and the scaling means that $y_{i}\left (\langle w,x_{i}\rangle + b\right ) \geq 1$ for all $i = 1,\ldots,m$ . Therefore, the hard margin classifier aims to find parameters $$w$$

and b solving the following quadratic optimization problem with linear constraints:

$\displaystyle{\begin{array}{c} \mbox{ Linear SV hard margin classification (Primal problem)} \\ \frac{1} {2}\|w\|^{2}\; \rightarrow \;\min \limits _{ w,b}\quad \mbox{ subject to}\quad y_{i}\left (\langle w,x_{i}\rangle + b\right ) \geq 1,\;i = 1,\ldots,m. \end{array} }$

If the training data are linearly separable, the problem has a unique solution. A brief introduction into quadratic programming methods is given in section “Quadratic Optimization.” To transform the problem into its dual form, consider the Lagrangian

$\displaystyle{L(w,b,\alpha ) = \frac{1} {2}\|w\|^{2} +\sum _{ i=1}^{m}\alpha _{ i}\left (1 - y_{i}\left (\langle w,x_{i}\rangle + b\right )\right ),\quad \alpha _{i} \geq 0.}$

Since

$\displaystyle\begin{array}{rcl} \frac{\partial L} {\partial w}& =& w -\sum _{i=1}^{m}\alpha _{ i}y_{i}x_{i}\; =\; 0\quad \Leftrightarrow \quad w\; =\;\sum _{ i=1}^{m}\alpha _{ i}y_{i}x_{i},{}\end{array}$

(4)

$\displaystyle\begin{array}{rcl} \frac{\partial L} {\partial b} & =& \sum _{i=1}^{m}\alpha _{ i}y_{i}\; =\; 0{}\end{array}$

(5)

the Lagrangian can be rewritten as

$\displaystyle{ L(w,b,\alpha ) = -\frac{1} {2}\sum _{i=1}^{m}\sum _{ j=1}^{m}y_{ i}\alpha _{i}y_{j}\alpha _{j}\langle x_{i},x_{j}\rangle +\sum _{ i=1}^{m}\alpha _{ i} }$

(6)

and the dual problem becomes

$\displaystyle{\begin{array}{c} \mbox{ Linear SV hard margin classification (Dual problem)} \\ \frac{1} {2}\sum \limits _{i=1}^{m}\sum \limits _{ j=1}^{m}y_{ i}\alpha _{i}y_{j}\alpha _{j}\langle x_{i},x_{j}\rangle -\sum \limits _{i=1}^{m}\alpha _{ i} \rightarrow \min \limits _{\alpha }\quad \mbox{ subject to}\quad \sum _{i=1}^{m}y_{ i}\alpha _{i} = 0, \\ \phantom{\frac{1} {2}\sum \limits _{i=1}^{m}\sum \limits _{ j=1}^{m}y_{ i}\alpha _{i}y_{j}\alpha _{j}\langle x_{i},x_{j}\rangle -\sum \limits _{i=1}^{m}\alpha _{ i} \rightarrow \min \limits _{\alpha }\quad \mbox{ subject to}\quad }\quad \quad \alpha _{i} \geq 0,\;i = 1,\ldots,m. \end{array} }$

Note that the dual maximization problem has been rewritten into a minimization problem by using that $\max \phi =\min -\phi$ . Let 1_mdenote the vector with m coefficients 1, $\alpha:= (\alpha _{i})_{i=1}^{m}$ , $y:= (y_{i})_{i=1}^{m}$ , $\mathbf{Y}:= {\mathrm{diag}}(y_{i})_{i=1}^{m}$ and

$\displaystyle{ \mathbf{K}:= \left (\langle x_{i},x_{j}\rangle \right )_{i,j=1}^{m}. }$

(7)

Note that K is symmetric and positive semi-definite. The dual problem can be rewritten in matrix-vector form as

$\displaystyle{\begin{array}{c} \mbox{ Linear SV hard margin classification (Dual problem)} \\ \frac{1} {2}\alpha ^{\mbox{ T}}\mathbf{Y}\mathbf{K}\mathbf{Y}\alpha -\langle 1_{ m},\alpha \rangle \;\rightarrow \;\min \limits _{\alpha }\quad \mbox{ subject to}\quad \langle y,\alpha \rangle = 0,\;\alpha \geq 0. \end{array} }$

Let α ^∗ be the minimizer of this dual problem. The intercept b does not appear in the dual problem, and one has to determine its optimal value in another way. By the Kuhn-Tucker conditions, the equations

$\displaystyle{\alpha _{i}^{{\ast}}\left (y_{ i}(\langle w^{{\ast}},x_{ i}\rangle + b^{{\ast}}) - 1\right ) = 0,\quad i = 1,\ldots,m}$

hold true, so that α _i^∗ > 0 is only possible for those training data with $y_{i}(\langle w^{{\ast}},x_{i}\rangle + b^{{\ast}}) = 1$ . These are exactly the (few) points having margin distance γ from the hyperplane H(w ^∗, b ^∗). Define

$$$\displaystyle{ I_{S}:=\{ i:\alpha _{ i}^{{\ast}} > 0\},\quad S:=\{ x_{ i}: i \in I_{S}\}. }$$” src=”/wp-content/uploads/2016/04/A183156_2_En_22_Chapter_Equ8.gif”></DIV></DIV><br /> <DIV class=EquationNumber>(8)</DIV></DIV>The vectors from <SPAN class=EmphasisTypeItalic>S</SPAN> are called <SPAN class=EmphasisTypeItalic>support vectors</SPAN>. In general | <SPAN class=EmphasisTypeItalic>S</SPAN> | ≪ <SPAN class=EmphasisTypeItalic>m</SPAN> and by (<SPAN class=InternalRef><A href=$ 4) the optimal weight w ^∗ and the optimal function $f_{w^{{\ast}}}$ have a sparse representation in terms of the support vectors

$\displaystyle{ w^{{\ast}} =\sum _{ i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}x_{i},\quad f_{w^{{\ast}}}(x) =\sum _{i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}\langle x_{i},x\rangle. }$

(9)

Moreover,

$\displaystyle{ b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle = y_{i} - f_{w^{{\ast}}}(x_{i}),\quad i \in I_{s} }$

(10)

and hence, using (5),

$\displaystyle\begin{array}{rcl} \|w^{{\ast}}\|^{2}& =& \sum _{ i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}\sum _{j\in I_{S}}\alpha _{j}^{{\ast}}y_{ j}\langle x_{i},x_{j}\rangle =\sum _{i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}f_{w^{{\ast}}}(x_{i}) {}\\ & =& \sum _{i\in I_{S}}\alpha _{i}^{{\ast}}(1 - y_{ i}b^{{\ast}}) =\sum _{ i\in I_{S}}\alpha _{i}^{{\ast}} {}\\ \end{array}$

so that

$\displaystyle{\gamma = 1/\|w\| =\big (\sum _{i\in I_{S}}\alpha _{i}^{{\ast}}\big)^{-1/2}.}$

Soft Margin Classifier

If the training data are not linearly separable which is the case in most applications, the hard margin classifier is not applicable. The extension of hard margin classifiers to the nonseparable case was done by Cortes and Vapnik [26] by bringing additional slack variables and a a parameter C > 0 into the constrained model:

$\displaystyle{\begin{array}{lll} \mbox{ Linear SV soft margin classification (Primal problem)} \\ \frac{1} {2}\|w\|^{2} + C\sum \limits _{ i=1}^{m}\xi _{ i}\; \rightarrow \;\min \limits _{w,b,\xi }\quad \mbox{ subject to}\quad y_{i}\left (\langle w,x_{i}\rangle + b\right ) \geq 1 -\xi _{i}, \\ \phantom{\frac{1} {2}\|w\|^{2} + C\sum \limits _{ i=1}^{m}\xi _{ i}\; \rightarrow \;\min \limits _{w,b,\xi }\quad \mbox{ subject to}\quad }\xi _{i} \geq 0,\;i = 1,\ldots,m. \end{array} }$

For C = ∞, this is again the hard margin classifier model. As before, the margin is defined as $\gamma = 1/\|w^{{\ast}}\|$ , where w ^∗ is the solution of the above problem. If the slack variable fulfills 0 ≤ ξ _i^∗ < 1, then x _iis correctly classified, and in the case $y_{i}\left (\langle w^{{\ast}},x_{i}\rangle + b^{{\ast}}\right ) = 1 -\xi _{i}^{{\ast}}$ the distance of x _ifrom the hyperplane is $\gamma -\xi _{i}^{{\ast}}/\|w^{{\ast}}\|$ . If 1 < ξ _i^∗, then one has a misclassification. By penalizing the sum of the slack variables, one tries to keep them small.

The above constrained minimization model can be rewritten as an unconstrained one by using a margin-based loss function. A function $L:\{ -1,1\} \times \mathbb{R} \rightarrow [0,\infty )$ is called margin based if there exists a representing function l: ℝ → [0, ∞) such that

$\displaystyle{L(y,t) = l(yt).}$

In soft margin classification the appropriate choice of a loss function is the hinge loss function l _hdetermined by

$\displaystyle{l_{h}(x):=\max \{ 0,1 - x\}.}$

Then the unconstrained primal problem reads

$\displaystyle{\begin{array}{c} \mbox{ Linear SV soft margin classification (Primal problem)} \\ \frac{1} {2}\|w\|^{2} + C\sum \limits _{ i=1}^{m}L_{ h}\left (y_{i},\left (\langle w,x_{i}\rangle + b\right )\right )\; \rightarrow \;\min \limits _{w,b}. \end{array} }$

The Lagrangian of the linear constrained problem has the form

$\displaystyle{L(w,b,\xi,\alpha ) = \frac{1} {2}\|w\|^{2} + C\sum _{ i=1}^{m}\xi _{ i} +\sum _{ i=1}^{m}\alpha _{ i}\left (1 -\xi _{i} - y_{i}\left (\langle w,x_{i}\rangle + b\right )\right ) -\sum _{i=1}^{m}\beta _{ i}\xi _{i},}$

where α _i, β _i≥ 0. Partial differentiation of the Lagrangian with respect to w and b results in (4), (5) and with respect to ξ in

$\displaystyle{\frac{\partial L} {\partial \xi } = C\,1_{m} -\alpha -\beta \; =\; 0.}$

Using these relations, the Lagrangian can be rewritten in the same form as in (6), and the dual problem becomes in matrix-vector form

$\displaystyle{\begin{array}{c} \mbox{ Linear SV soft margin classification (Dual problem)} \\ \frac{1} {2}\alpha ^{\mbox{ T}}\mathbf{Y}\mathbf{K}\mathbf{Y}\alpha -\langle 1_{ m},\alpha \rangle \quad \mbox{ subject to}\quad \langle y,\alpha \rangle = 0,\;0 \leq \alpha \leq C. \end{array} }$

Let α ^∗ be the minimizer of the dual problem. Then the optimal weight w ^∗ and $f_{w^{{\ast}}}$ are again given by (9) and depend only on the few support vectors defined by (8). By the Kuhn-Tucker conditions, the equations

$\displaystyle\begin{array}{rcl} & & \alpha _{i}^{{\ast}}\left (y_{ i}(\langle w^{{\ast}},x_{ i}\rangle + b^{{\ast}}) - 1 +\xi _{ i}^{{\ast}}\right ) = 0\quad {\mathrm{and}} {}\\ & & \quad \beta _{i}^{{\ast}}\xi _{ i}^{{\ast}} = (C -\alpha _{ i}^{{\ast}})\xi _{ i}^{{\ast}} = 0,\quad i = 1,\ldots,m {}\\ \end{array}$

hold true. For 0 < α _i^∗ < C, it follows that ξ _i^∗ = 0 and $y_{i}(\langle w^{{\ast}},x_{i}\rangle + b^{{\ast}}) = 1$ , i.e., the points x _ihave margin distance $\gamma = 1/\|w^{{\ast}}\|$ from H(w ^∗, b ^∗). Moreover,

$\displaystyle{ b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle,\quad i \in \tilde{ I}_{S}:=\{ i: 0<\alpha _{i}^{{\ast}}<C\}. }$

(11)

For α _i^∗ = C, one concludes that $y_{i}(\langle w^{{\ast}},x_{i}\rangle + b^{{\ast}}) = 1 -\xi _{i}^{{\ast}}$ , i.e., x _ihas distance $\gamma -\xi _{i}^{{\ast}}/\|w^{{\ast}}\|$ from the optimal hyperplane.

Linear Support Vector Regression

Of course one can also approximate unknown functions by linear (multivariate) polynomials of the form (2).

Hard Margin Regression

The model for linear hard margin regression is given by

$\displaystyle{\begin{array}{c} \mbox{ Linear SV hard margin regression (Primal problem)} \\ \frac{1} {2}\|w\|^{2}\; \rightarrow \;\min _{ w,b}\quad \mbox{ subject to}\quad \langle w,x_{i}\rangle + b - y_{i}\quad \leq \quad \epsilon, \\ \phantom{\frac{1} {2}\|w\|^{2}\; \rightarrow \;\min _{ w,b}\quad \mbox{ subject to}\quad }\quad \quad \quad \quad -\langle w,x_{i}\rangle - b + y_{i}\quad \leq \quad \epsilon,\quad i = 1,\ldots,m. \end{array} }$

The constraints make sure that the test data y _ilie within an ε distance from the value f(x _i) + b of the approximating linear polynomial. The Lagrangian reads

$\displaystyle\begin{array}{ll} L(w,b,\xi ^{\pm },\alpha ^{\pm },\beta ^{\pm })& = \frac{1} {2}\|w\|^{2} +\sum _{ i=1}^{m}\alpha _{ i}^{-}\left (\langle w,x_{ i}\rangle + b - y_{i}-\epsilon \right ) \\ & +\sum _{i=1}^{m}\alpha _{ i}^{+}\left (-\langle w,x_{ i}\rangle - b + y_{i}-\epsilon \right ), \end{array}$

where α _i^± ≥ 0. Setting partial derivatives to zero leads to

$\displaystyle\begin{array}{rcl} \frac{\partial L} {\partial w}& =& w +\sum _{ i=1}^{m}\left (\alpha _{ i}^{-}-\alpha _{ i}^{+}\right )x_{ i}\; =\; 0\quad \Leftrightarrow \quad w\; =\;\sum _{ i=1}^{m}\left (\alpha _{ i}^{+} -\alpha _{ i}^{-}\right )x_{ i},{}\end{array}$

(12)

$\displaystyle\begin{array}{rcl} \frac{\partial L} {\partial b} & =& \sum _{i=1}^{m}\left (\alpha _{ i}^{+} -\alpha _{ i}^{-}\right )\; =\; 0.{}\end{array}$

(13)

Using these relations and setting

$\displaystyle{\alpha:=\alpha ^{+} -\alpha ^{-},}$

the Lagrangian can be written as

$\displaystyle{L(w,b,\xi ^{\pm },\alpha ^{\pm },\beta ^{\pm }) = -\frac{1} {2}\sum _{i=1}^{m}\sum _{ j=1}^{m}\alpha _{ i}\alpha _{j}\langle x_{i},x_{j}\rangle -\epsilon \sum _{i=1}^{m}\left (\alpha _{ i}^{+} +\alpha _{ i}^{-}\right ) +\sum _{ i=1}^{m}y_{ i}\alpha _{i}}$

and the dual problem becomes in matrix-vector form

$\displaystyle{\begin{array}{c} \mbox{ Linear SV hard margin regression (Dual problem)} \\ \frac{1} {2}(\alpha ^{+} -\alpha ^{-})^{\mbox{ T}}\mathbf{K}(\alpha ^{+} -\alpha ^{-}) +\epsilon \langle 1_{ m},\alpha ^{+} +\alpha ^{-}\rangle -\langle y,\alpha ^{+} -\alpha ^{-}\rangle \;\rightarrow \min \limits _{\alpha ^{+ },\alpha ^{-}} \\ \mbox{ subject to}\quad \langle 1_{m},\alpha ^{+} -\alpha ^{-}\rangle = 0,\quad \alpha ^{\pm }\geq 0. \end{array} }$

This is a quadratic optimization problem with linear constraints. Let $(\alpha ^{+})^{{\ast}},(\alpha ^{-})^{{\ast}}$ be the solution of this problem and $\alpha ^{{\ast}} = (\alpha ^{+})^{{\ast}}- (\alpha ^{-})^{{\ast}}$ . Then, by (12), the optimal weight and the optimal function have in general a sparse representation in terms of the support vectors x _i, i ∈ I _S, namely,

$\displaystyle{ w^{{\ast}} =\sum _{ i\in I_{S}}\alpha _{i}^{{\ast}}x_{ i},\quad f_{w^{{\ast}}}(x) =\sum _{i\in I_{rS}}\alpha _{i}^{{\ast}}\langle x_{ i},x\rangle,\quad I_{rS}:=\{ i:\alpha _{ i}^{{\ast}}\not =0\}. }$

(14)

The corresponding Kuhn-Tucker conditions are

$\displaystyle\begin{array}{rcl} (\alpha _{i}^{-})^{{\ast}}\left (\epsilon -\langle w^{{\ast}},x_{ i}\rangle - b^{{\ast}} + y_{ i}\right )& =& 0,{} \\ (\alpha _{i}^{+})^{{\ast}}\left (\epsilon +\langle w^{{\ast}},x_{ i}\rangle + b^{{\ast}}- y_{ i}\right )& =& 0. \\ \end{array}$

(15)

If (α _i⁻)^∗ > 0 or (α _i⁺)^∗ > 0, then

$\displaystyle{b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle +\epsilon,\quad b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle -\epsilon,}$

respectively. Since both conditions cannot be fulfilled for the same index, it follows that $(\alpha _{i}^{-})^{{\ast}}(\alpha _{i}^{+})^{{\ast}} = 0$ and consequently, either $\alpha _{i}^{{\ast}} = (\alpha _{i}^{+})^{{\ast}}\geq 0$ or $\alpha _{i}^{{\ast}} = -(\alpha _{i}^{-})^{{\ast}}\leq 0$ . Thus, one can obtain the intercept by

$\displaystyle{ b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle -\epsilon,\quad i \in I_{S}. }$

(16)

Soft Margin Regression Relaxing the constraints in the hard margin model leads to the following linear soft margin regression problem with C > 0:

$\displaystyle{\begin{array}{c} \mbox{ Linear SV soft margin regression (Primal problem)} \\ \frac{1} {2}\|w\|^{2} + C\sum _{ i=1}^{m}\left (\xi _{ i}^{+} +\xi _{ i}^{-}\right ) \rightarrow \min _{ w,b,\xi _{i}^{\pm }}\quad \mbox{ subject to}\quad \langle w,x_{i}\rangle + b - y_{i}\quad \leq \quad \epsilon +\xi _{i}^{-}, \\ \phantom{\frac{1} {2}\|w\|^{2} + C\sum _{ i=1}^{m}\left (\xi _{ i}^{+} +\xi _{ i}^{-}\right ) \rightarrow \min _{ w,b,\xi _{i}^{\pm }}\quad \mbox{ subject to}} -\langle w,x_{i}\rangle - b + y_{i} \leq \epsilon +\xi _{i}^{+}, \\ \phantom{\frac{1} {2}\|w\|^{2} + C\sum _{ i=1}^{m}\left (\xi _{ i}^{+} +\xi _{ i}^{-}\right ) \rightarrow \min _{ w,b,\xi _{i}^{\pm }}\mbox{ subject to}}\xi _{i}^{+},\xi _{ i}^{-}\phantom{i\rangle - b + y_{ i}} \geq 0. \end{array} }$

For C = ∞, this recovers the linear hard margin regression problem. The above constrained model can be rewritten as an unconstrained one by using a distance-based loss function. A function $L: \mathcal{Y}\times \mathbb{R} \rightarrow [0,\infty )$ is called distance based if there exists a representing function $l: \mathbb{R} \rightarrow [0,\infty )$ such that

$\displaystyle{L(y,t) = l(y - t).}$

In soft margin regression, the appropriate choice is Vapnik’s ε–insensitive loss function defined by

$\displaystyle{l_{\epsilon }(x):=\max \{ 0,\vert x\vert -\epsilon \}.}$

The function l _εis depicted in Fig. 4, left. Then the unconstrained primal model reads

Fig. 4

Vapnik’s ε-insensitive loss function for ε = 2 (left). Example of linear SV soft margin regression (right)

$\displaystyle{\begin{array}{c} \mbox{ Linear SV soft margin regression (Primal problem)} \\ \frac{1} {2}\|w\|^{2} + C\sum \limits _{ i=1}^{m}L_{\epsilon }(y_{ i},\langle w,x_{i}\rangle + b)\; \rightarrow \;\min \limits _{w,b}. \end{array} }$

The Lagrangian of the constrained problem is given by

$\begin{array}{ll} L(w,b,\xi ^{\pm },\alpha ^{\pm },\beta ^{\pm }) = & \frac{1} {2}\|w\|^{2} + C\sum _{ i=1}^{m}(\xi _{ i}^{+} +\xi _{ i}^{-}) \\ & +\sum _{i=1}^{m}\alpha _{ i}^{-}\left (\langle w,x_{ i}\rangle + b - y_{i} -\epsilon -\xi _{i}^{-}\right ) \\ & +\sum _{i=1}^{m}\alpha _{ i}^{+}\left (-\langle w,x_{ i}\rangle - b + y_{i} -\epsilon -\xi _{i}^{+}\right ) \\ & -\sum _{i=1}^{m}\beta _{ i}^{+}\xi _{ i}^{+} -\sum _{ i=1}^{m}\beta _{ i}^{-}\xi _{ i}^{-}, {}\\ \end{array}$

where α _i^± ≥ 0 and β _i^± ≥ 0, i = 1, …, m. Setting the partial derivatives to zero leads to (12), (13) and

$\displaystyle{\frac{\partial L} {\partial \xi ^{+}} = C\,1_{m} -\alpha ^{+} -\beta ^{+} = 0,\quad \frac{\partial L} {\partial \xi ^{-}} = C\,1_{m} -\alpha ^{-}-\beta ^{-} = 0.}$

Using these relations the Lagrangian can be written exactly as in the hard margin problem, and the dual problem becomes in matrix-vector form

$\displaystyle{\begin{array}{c} \mbox{ Linear SV soft margin regression (Dual problem)} \\ \frac{1} {2}(\alpha ^{+} -\alpha ^{-})^{\mbox{ T}}\mathbf{K}(\alpha ^{+} -\alpha ^{-}) +\epsilon \langle 1_{ m},\alpha ^{+} +\alpha ^{-}\rangle -\langle y,\alpha ^{+} -\alpha ^{-}\rangle \;\rightarrow \min \limits _{\alpha ^{+ },\alpha ^{-}} \\ \phantom{\frac{1} {2}(\alpha ^{+} -\alpha ^{-}}\mbox{ subject to}\quad \langle 1_{ m},\alpha ^{+} -\alpha ^{-}\rangle = 0,\quad 0 \leq \alpha ^{+},\alpha ^{-}\leq C \end{array} }$

If $(\alpha ^{+})^{{\ast}},(\alpha ^{-})^{{\ast}}$ are the solution of this problem and $\alpha ^{{\ast}} = (\alpha ^{+})^{{\ast}}- (\alpha ^{-})^{{\ast}}$ , then the optimal weight w ^∗ and the optimal function $f_{w^{{\ast}}}$ are given by (14). The corresponding Kuhn-Tucker conditions are

$\displaystyle\begin{array}{rcl} (\alpha _{i}^{-})^{{\ast}}\left (\epsilon +(\xi _{ i}^{-})^{{\ast}}-\langle w^{{\ast}},x_{ i}\rangle - b^{{\ast}} + y_{ i}\right )& =& 0, {}\\ (\alpha _{i}^{+})^{{\ast}}\left (\epsilon +(\xi _{ i}^{+})^{{\ast}} +\langle w^{{\ast}},x_{ i}\rangle + b^{{\ast}}- y_{ i}\right )& =& 0, {}\\ (C - (\alpha _{i}^{+})^{{\ast}})(\xi _{ i}^{+})^{{\ast}}\; =\; 0,\quad (C - (\alpha _{ i}^{-})^{{\ast}})(\xi _{ i}^{-})^{{\ast}}& =& 0,\quad i = 1,\ldots,m. {}\\ \end{array}$

If 0 < (α _i⁺)^∗ or 0 < (α _i⁻)^∗, then

$\displaystyle{b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle +\epsilon +\xi _{i}^{+},\quad b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle -\epsilon -\xi _{i}^{-},}$

respectively. Both equations cannot be fulfilled at the same time so that one can conclude that either $\alpha _{i}^{{\ast}} = (\alpha _{i}^{+})^{{\ast}}\geq 0$ or $\alpha _{i}^{{\ast}} = -(\alpha _{i}^{-})^{{\ast}}\leq 0$ . In case $\alpha _{i}^{{\ast}} = (\alpha _{i}^{+})^{{\ast}} < C$ , this results in the intercept

$\displaystyle{ b^{{\ast}} = y_{ i} -\langle w^{{\ast}},x_{ i}\rangle -\epsilon,\quad i \in \tilde{ I}_{S}. }$

(17)

Linear Least Squares Classification and Regression

Instead of the hinge loss function for classification and the ε-insensitive loss function for regression, other loss functions can be used. Popular margin-based and distance-based loss functions are the logistic loss for classification and regression

$\displaystyle{l(yt):=\ln (1 + e^{-yt})\quad {\mathrm{and}}\quad l(y - t):= -\ln \frac{4e^{y-t}} {(1 + e^{y-t})^{2}},}$

respectively. In contrast to the loss functions in the previous subsections, logistic loss functions are differentiable in t so that often standard methods as gradient descent methods or Newton (like) methods can be applied for computing the minimizers of the corresponding problems. For details, see, e.g., the book of Mitchell [73] or of Hastie et al. [47]. Other loss functions for regression are the pinball loss, the Huber function, and the p–th power absolute distance loss | y − t | ^p, p > 0. For p = 2, the latter is the least squares loss

$\displaystyle{l_{{\mathrm{lsq}}}(y - t) = (y - t)^{2}.}$

Since $(y - t)^{2} = (1 - yt)^{2}$ for y ∈ {−1, 1}, the least squares loss is also margin based, and one can handle least squares classification and regression using just the same model with y ∈ {−1, 1} for classification and y ∈ ℝ for regression. In the unconstrained form, one has to minimize

$\displaystyle{\begin{array}{c} \mbox{ Linear LS classification/regression (Primal problem)} \\ \frac{1} {2}\|w\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}\mathop{\underbrace{ (\langle w,x_{ i}\rangle + b - y_{i})^{2}}}\limits _{ L_{{\mathrm{lsq}}}(y_{i},\langle w,x_{i}\rangle +b)}\; \rightarrow \;\min \limits _{w,b}. \end{array} }$

This model was published as ridge regression by Hoerl and Kennard [49] and is a regularized version of the Gaussian model (1). Therefore, it can be considered as a special case of regularization theory introduced by Tikhonov and Arsenin [102]. The minimizer can be computed via a linear system of equations. To this end, rewrite the unconstrained problem in matrix-vector form

$\displaystyle{\frac{1} {2}\|w\|^{2} + \frac{C} {2} \|{\mathbf{x}}^{\mbox{ T} }w + b1_{m} - y\|^{2}\; \rightarrow \;\min \limits _{ w,b},}$

where

$\displaystyle{{\mathbf{x}}:= (x_{1}\,\ldots \,x_{m}) = \left (\begin{array}{lll} x_{1,1} & \ldots & x_{m,1}\\ & \vdots & \\ x_{1,d}&\ldots &x_{m,d} \end{array} \right ).}$

Setting the gradient (with respect to w and b) to zero, one obtains

$\displaystyle\begin{array}{rcl} 0& =& \frac{1} {C}\,w + \mathbf{XX}^{\mbox{ T} }w + b{\mathbf{x}}1_{m} -{\mathbf{x}}y, \\ 0& =& 1_{m}^{\mbox{ T} }{\mathbf{x}}^{\mbox{ T} }w - 1_{m}^{\mbox{ T} }y + mb\quad \Leftrightarrow \quad b =\bar{ y} -\langle w,\bar{x}\rangle,{}\end{array}$

(18)

where $\bar{y}:= \frac{1} {m}\sum _{i=1}^{m}y_{ i}$ and $\bar{x}:= \frac{1} {m}\sum _{i=1}^{m}x_{ i}$ . Hence b and w can be obtained by solving the linear system of equations

$\displaystyle{ \left (\begin{array}{l|l} 1 &\bar{x}^{\mbox{ T}} \\ \bar{x}& \frac{1} {m}{\mathbf{x}}{\mathbf{x}}^{\mbox{ T}} + \frac{1} {mC}I \end{array} \right )\left (\begin{array}{*{10}c} b\\ w \end{array} \right ) = \left (\begin{array}{*{10}c} \bar{y} \\ \frac{1} {m}{\mathbf{x}}y \end{array} \right ). }$

(19)

Instead of the above problem, one solves in general the “centered” problem

$\displaystyle{\begin{array}{c} \mbox{ Linear LS classification/regression ${\it in centered version}$ (Primal problem)} \\ \frac{1} {2}\|\tilde{w}\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}(\langle \tilde{w},\tilde{x}_{ i}\rangle +\tilde{ b} - y_{i})^{2}\; \rightarrow \;\min \limits _{\tilde{ w},\tilde{b}}, \end{array} }$

where $\tilde{x}_{i}:= x_{i} -\bar{ x}$ , i = 1, …, m. The advantage is that $\bar{\tilde{x}} = 0_{m}$ , where 0_mis the vector consisting of m zeros. Thus, (19) with $\tilde{x}_{i}$ instead of x _ibecomes a separable system, and one obtains immediately that $\tilde{b}^{{\ast}} =\bar{ y}$ and that $\tilde{w}^{{\ast}}$ follows by solving the linear system with positive definite, symmetric coefficient matrix

$\displaystyle{\left (\tilde{{\mathbf{x}}}\tilde{{\mathbf{x}}}^{\mbox{ T} } + \frac{1} {C}I\right )w =\tilde{ {\mathbf{x}}}y.}$

This means that $\tilde{w}$ is just the solution of the centered primal problem without intercept. Finally, one can check by the following argument that indeed $w^{{\ast}} = \tilde{w}^{{\ast}}$ :

$\begin{array}{ll} (\tilde{w}^{{\ast}},\tilde{b}^{{\ast}}): & = \mathop{{\mathrm{argmin}}}_{\tilde{ w},\tilde{b}}\left \{\frac{1} {2}\|\tilde{w}\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}(\langle \tilde{w},\tilde{x}_{ i}\rangle +\tilde{ b} - y_{i})^{2}\right \}, \\ \tilde{w}^{{\ast}}& = \mathop{{\mathrm{argmin}}}_{\tilde{ w}}\left \{\frac{1} {2}\|\tilde{w}\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}(\langle \tilde{w},\tilde{x}_{ i}\rangle +\bar{ y} - y_{i})^{2}\right \} \\ & = \mathop{{\mathrm{argmin}}}_{\tilde{w}}\bigg\{\frac{1} {2}\|\tilde{w}\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}(\langle \tilde{w},x_{ i}\rangle +\bar{ y} -\langle \tilde{ w},\bar{x}\rangle - y_{i})^{2}\bigg\} \\ \end{array}$

and with (18) on the other hand

$\displaystyle\begin{array}{rcl} (w^{{\ast}},b^{{\ast}})&:=& \mathop{{\mathrm{argmin}}}_{ w,b}\left \{\frac{1} {2}\|w\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}(\langle w,x_{ i}\rangle + b - y_{i})^{2}\right \}, {}\\ w^{{\ast}}& =& \mathop{{\mathrm{argmin}}}_{ w}\left \{\frac{1} {2}\|w\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}(\langle w,x_{ i}\rangle +\bar{ y} -\langle w,\bar{x}\rangle - y_{i})^{2}\right \}. {}\\ \end{array}$

Note that ${\mathbf{x}}^{\mbox{ T}}{\mathbf{x}} = \mathbf{K} \in \mathbb{R}^{m,m},$ but this is not the coefficient matrix in (19). When turning to nonlinear methods in section “Nonlinear Learning,” it will be essential to work with K = X ^T X instead of XX ^T. This can be achieved by switching to the dual setting. In the following, this dual approach is shown although it makes often not sense for the linear setting since the size of the matrix K is in general larger than those of ${\mathbf{x}}{\mathbf{x}}^{\mbox{ T}} \in \mathbb{R}^{d,d}$ . First, one reformulates the primal problem into a constrained one:

$\displaystyle{\begin{array}{c} \mbox{ Linear LS classification/regression (Primal problem)} \\ \frac{1} {2}\|w\|^{2} + \frac{C} {2} \sum \limits _{i=1}^{m}\xi _{ i}^{2}\; \rightarrow \;\min \limits _{ w,b,\xi }\quad \mbox{ subject to}\quad \langle w,x_{i}\rangle + b - y_{i} =\xi _{i},\;i = 1,\ldots,m. \end{array} }$

The Lagrangian reads

$\displaystyle{L(w,b,\xi,\alpha ) = \frac{1} {2}\|w\|^{2} + \frac{C} {2} \sum _{i=1}^{m}\xi _{ i}^{2} -\sum _{ i=1}^{m}\alpha _{ i}\left (\langle w,x_{i}\rangle + b - y_{i} -\xi _{i}\right )}$

and

$\displaystyle\begin{array}{rcl} \frac{\partial L} {\partial w}& =& w -\sum _{i=1}^{m}\alpha _{ i}x_{i}\; =\; 0\quad \Leftrightarrow \quad w\; =\;\sum _{ i=1}^{m}\alpha _{ i}x_{i}, {}\\ \frac{\partial L} {\partial b} & =& \sum _{i=1}^{m}\alpha _{ i}\; =\; 0, {}\\ \frac{\partial L} {\partial \xi } & =& C\,\xi +\alpha \; =\; 0. {}\\ \end{array}$

The equality constraint in the primal problem together with the above equalities leads to the following linear system of equations to determine the optimal α ^∗ and b ^∗:

$\displaystyle{ \left (\begin{array}{c|c} 0 & 1_{m}^{\mbox{ T}} \\ 1_{m}&\mathbf{K} + \frac{1} {C}I\\ \end{array} \right )\left (\begin{array}{*{10}c} b\\ \alpha \end{array} \right ) = \left (\begin{array}{*{10}c} 0\\ y \end{array} \right ). }$

(20)

The optimal weight and the corresponding optimal function read

$\displaystyle{ w^{{\ast}} =\sum _{ i=1}^{m}\alpha _{ i}^{{\ast}}x_{ i},\quad f_{w^{{\ast}}}(x) =\sum _{ i=1}^{m}\alpha _{ i}^{{\ast}}\langle x_{ i},x\rangle. }$

(21)

In general there is no sparse representation with respect to the vectors x _i; see also Theorem 8.36 in the book of Steinwart and Christman [93]. Therefore, this method is not called a support vector method in this paper. Finally, note that the centered approach also helps to avoid the intercept in the dual approach. Since this is no longer true when turning to the nonlinear setting, the intercept is kept here.

Nonlinear Learning

A linear form of a decision or regression function may not be suitable for the task at hand. Figure 5 shows two examples, where a linear classifier is not appropriate.

Fig. 5

Two sets, where linear classification is not appropriate

A basic idea to handle such problems was proposed by Boser et al. [12] and consists of the following two steps which will be further explained in the rest of this subsection:

Mapping of the input data $X \subset \mathcal{X}$ into a feature space $\Phi (\mathcal{X}) \subset \ell_{2}(I)$ , where I is a countable (possibly finite) index set, by a nonlinear feature map

$\displaystyle{\Phi: \mathcal{X} \rightarrow \ell_{2}(I).}$

Application of the linear classification/regression model to the feature set

$\displaystyle{\{(\Phi (x_{1}),y_{1}),\ldots,(\Phi (x_{m}),y_{m})\}.}$

This means that instead of a linear function (2), we are searching for a function of the form

$\displaystyle{ f(x) = f_{w}(x):=\langle w,\Phi (x)\rangle _{\ell_{2}(I)} }$

(22)

now. This nonlinear function on $\mathcal{X}$ becomes linear in the feature space $\Phi (\mathcal{X})$ .

Figure 6 shows an example of a feature map. In this example, the set {(x _i, y _i): i = 1, …, 5} is not linearly separable while $\{(\Phi (x_{i}),y_{i}): i = 1,\ldots,5\}$ is linearly separable. In practical applications, in contrast to this example, the feature map often maps into a higher dimensional, possibly also infinite dimensional space.

Fig. 6

Linearly nonseparable training data in the original space $\mathcal{X} \subset \mathbb{R}^{2}$ (left) and become separable in the feature space $\Phi (\mathcal{X})$ , where $\Phi (x_{1},x_{2}):= (x_{1}^{2},x_{2}^{2})$ (right)

Together with the so-called kernel trick to avoid the direct work with the feature map $\Phi$ , this approach results in the successful support vector machine (SVM).

Kernel Trick

In general, one avoids to work directly with the feature map by dealing with the dual problem and applying the so-called kernel trick. For a feature map $\Phi$ , define a kernel $K: \mathcal{X}\times \mathcal{X} \rightarrow \mathbb{R}$ associated with $\Phi$ by

$\displaystyle{ K(x,t):=\langle \Phi (x),\Phi (t)\rangle _{l_{2}(I)}. }$

(23)

More precisely, in practice one often follows the opposite way, namely, one starts with a suitable kernel which is known to have a representation of the form (23) without knowing $\Phi$ explicitly.

A frequently applied group of kernels are continuous, symmetric, positive (semi-) definite kernels like the Gaussian

$\displaystyle{K(x,t) = e^{-\|x-t\|^{2}/c^{2} }.}$

These kernels, which are also called Mercer kernels , will be considered in detail in section “Reproducing Kernel Hilbert Spaces.” By Mercer’s theorem, it can be shown that a Mercer kernel possesses a representation

$\displaystyle{K(x,t) =\sum _{j\in I}\sqrt{\lambda _{j}}\psi _{j}(x)\sqrt{\lambda _{j}}\psi _{j}(t),\quad x,t \in \mathcal{X}}$

with L ₂-orthonormal functions ψ _jand positive values λ _j, where the right-hand side converges uniformly. Note that the existence of such a representation is clear, but in general without knowing the functions ψ _jexplicitly. The set $\{\varphi _{j}:= \sqrt{\lambda _{j}}\psi _{j}: j \in I\}$ forms an orthonormal basis of a reproducing kernel Hilbert space (RKHS) $\mathcal{H}_{K}$ . These spaces are considered in more detail in section “Reproducing Kernel Hilbert Spaces.” Then the feature map is defined by

$\displaystyle{\Phi (x):= \left (\varphi _{j}(x)\right )_{j\in I} = \left (\sqrt{\lambda _{j}}\psi _{j}(x)\right )_{j\in I}.}$

Using the orthonormal basis, one knows that for any $f \in \mathcal{H}_{K}$ there exists a unique sequence $w = {w_{f}}:= {({w_j})_{j\in I}} \in {\ell_{2}}(I)$ such that

$\displaystyle{ f(x) =\sum _{j\in I}w_{j}\varphi _{j}(x) =\langle w,\Phi (x)\rangle,\quad {\mathrm{and}}\quad w_{j} =\langle f,\varphi _{j}\rangle _{\mathcal{H}_{K}}, }$

(24)

where the convergence is uniformly. Conversely every sequence w ∈ ℓ ₂(I) defines a function f _wlying in $\mathcal{H}_{K}$ by (24). Moreover, Parseval’s equality says that

$\displaystyle{ \|f_{w}\|_{\mathcal{H}_{K}}:=\| w\|_{\ell_{2}(I)}. }$

(25)

For nonlinear classification and regression purposes, one can follow exactly the lines of section “Linear Learning” except that one has to work in $\Phi (\mathcal{X})$ instead of $\mathcal{X}$ . Using (22) instead of (2) and

$\displaystyle{ \mathbf{K}:= \left (\langle \Phi (x_{i}),\Phi (x_{j})\rangle _{\ell_{2}(I)}\right )_{i,j=1}^{m} = \left (K(x_{ i},x_{j})\right )_{i,j=1}^{m} }$

(26)

instead of the kernel matrix $\mathbf{K}:= \left (\langle x_{i},x_{j}\rangle \right )_{i,j=1}^{m}$ in (7), the linear models from section “Linear Learning” can be rewritten as in the following subsections. Note again that K is positive semi-definite.

Support Vector Classification

In the following, the linear classifiers are generalized to feature spaces.

Hard Margin Classifier The hard margin classification model is

$\displaystyle{\begin{array}{c} \mbox{ SVM hard margin classification (Primal problem)} \\ \frac{1} {2}\|w\|_{\ell_{2}(I)}^{2}\; \rightarrow \;\min \limits _{ w,b}\quad \mbox{ subject to}\quad y_{i}\left (\langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} + b\right ) \geq 1,\;i = 1,\ldots,m. \end{array} }$

Interestingly, if $\Phi$ is associated with a Mercer kernel K, then $f(x) =\langle w,\Phi (x)\rangle _{\ell_{2}(I)}$ lies in the RKHS $\mathcal{H}_{K}$ , and the model can be rewritten using (25) from the point of view of RKHS as

$\displaystyle{\begin{array}{c} \mbox{ SVM hard margin classification ${\it in RKHS}$ (Primal problem)} \\ \frac{1} {2}\|f\|_{\mathcal{H}_{K}}^{2}\; \rightarrow \;\min \limits _{f\in \mathcal{H}_{K}}\quad \mbox{ subject to}\quad y_{i}\left (f(x_{i}) + b\right ) \geq 1,\;i = 1,\ldots,m. \end{array} }$

The dual problem reads

$\displaystyle{\begin{array}{c} \mbox{ SVM hard margin classification (Dual problem)} \\ \frac{1} {2}\sum \limits _{i=1}^{m}\sum \limits _{ j=1}^{m}y_{ i}\alpha _{i}y_{j}\alpha _{j}\langle \Phi (x_{i}),\Phi (x_{j})\rangle _{\ell_{2}(I)} -\sum \limits _{i=1}^{m}\alpha _{ i}\; \rightarrow \;\min \limits _{\alpha }\quad \mbox{ subject to}\quad \sum \limits _{i=1}^{m}y_{ i}\alpha _{i} = 0, \\ \phantom{\frac{1} {2}\sum \limits _{i=1}^{m}\sum \limits _{ j=1}^{m}y_{ i}\alpha _{i}y_{j}\alpha _{j}\langle \Phi (x_{i}),\Phi (x_{j})\rangle _{\ell_{2}(I)} -\sum \limits _{i=1}^{m}\alpha _{ i}\; \rightarrow \;\min \limits _{\alpha }\quad \mbox{ subject to}\quad }\alpha _{i} \geq 0,\;i = 1,\ldots,m. \end{array} }$

and with K defined by (26) the matrix-vector form of the dual problem looks as those for the linear hard margin classifier.

Let α ^∗ be the minimizer of the dual problem. Then, by (9) together with the feature space modification, the optimal weight and the function $f_{w^{{\ast}}}$ become

$\displaystyle{ w^{{\ast}} =\sum _{ i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}\Phi (x_{i}),\ \ f_{w^{{\ast}}}(x) =\sum _{i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}\langle \Phi (x_{i}),\Phi (x)\rangle _{\ell_{2}(I)} =\sum _{i\in I_{S}}\alpha _{i}^{{\ast}}y_{ i}K(x_{i},x). }$

(27)

Thus, one can compute $f_{w^{{\ast}}}$ knowing only the kernel and not the feature map itself. One property of a Mercer kernel used for learning purposes should be that it can be simply evaluated at points from $\mathcal{X}\times \mathcal{X}$ . This is, for example, the case for the Gaussian. Finally, using (10) in the feature space, the intercept can be computed by

$\displaystyle{b^{{\ast}} = y_{ i} -\langle w^{{\ast}},\Phi (x_{ i})\rangle _{\ell_{2}(I)} = y_{i} -\sum _{j\in I_{S}}\alpha _{j}^{{\ast}}y_{ j}K(x_{j},x_{i}),\quad i \in I_{S}}$

and the margin $\gamma = 1/\|w^{{\ast}}\|_{\ell_{2}(I)}^{2}$ by using $\|w^{{\ast}}\|_{\ell_{2}(I)}^{2} = (\alpha ^{{\ast}})^{\mbox{ T}}\mathbf{YKY}\alpha ^{{\ast}}$ .

Soft Margin Classifier The soft margin classification model in the feature space is

$\displaystyle{\begin{array}{c} \mbox{ SVM soft margin classification (Primal problem)} \\ \frac{1} {2}\|w\|_{\ell_{2}(I)}^{2} + C\sum \limits _{ i=1}^{m}\xi _{ i} \rightarrow \min \limits _{w,b,\xi }\quad \mbox{ subject to}\quad y_{i}\left (\langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} + b\right ) \geq 1 -\xi _{i}, \\ \phantom{\frac{1} {2}\|w\|_{\ell_{2}(I)}^{2} + C\sum \limits _{ i=1}^{m}\xi _{ i} \rightarrow } i = 1,\ldots,m \\ \phantom{\frac{1} {2}\|w\|_{\ell_{2}(I)}^{2} + C\sum \limits _{ i=1}^{m}\xi _{ i} \rightarrow \min \limits _{w,b,\xi }}\xi _{i} \geq 0,\;i = 1,\ldots,m. \end{array} }$

If $\Phi$ is associated with a Mercer kernel K, the corresponding unconstrained version reads in the RKHS

$\displaystyle{\begin{array}{c} \mbox{ SVM soft margin classification ${\it in RKHS}$ (Primal problem) } \\ \frac{1} {2}\|f\|_{\mathcal{H}_{K}}^{2} + C\sum \limits _{i=1}^{m}L_{h}\left (y_{i},f(x_{i}) + b\right )+\; \rightarrow \;\min \limits _{f\in \mathcal{H}_{K}}. \end{array} }$

With K defined by (26), the matrix-vector form of the dual problem looks as those for the linear soft margin classifier. The function $f_{w^{{\ast}}}$ reads as in (27), and, using (11), the intercept can be computed by

$\displaystyle{b^{{\ast}} = y_{ i} -\langle w^{{\ast}},\Phi (x_{ i})\rangle _{\ell_{2}(I)} = y_{i} -\sum _{j\in \tilde{I}_{S}}\alpha _{j}^{{\ast}}y_{ j}K(x_{j},x_{i}),\quad i \in \tilde{ I}_{S}.}$

Support Vector Regression

In the following, the linear regression models are generalized to feature spaces.

Hard Margin Regression One obtains

$\displaystyle{\begin{array}{c} \mbox{ SVM hard margin regression (Primal problem)} \\ \frac{1} {2}\|w\|_{\ell_{2}(I)}^{2}\; \rightarrow \;\min _{ w,b}\quad \mbox{ subject to}\quad \langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} + b - y_{i}\quad \leq \quad \epsilon, \\ \phantom{\frac{1} {2}\|w\|_{\ell_{2}(I)}^{2}\; \rightarrow \;\min _{ w,b}\quad \mbox{ subject to}\quad \quad \quad \quad } -\langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} - b + y_{i} \leq \epsilon,\quad i = 1,\ldots,m. \end{array} }$

If $\Phi$ is associated with a Mercer kernel K, then $f(x) =\langle w,\Phi (x)\rangle _{\ell_{2}(I)}$ lies in the RKHS $\mathcal{H}_{K}$ , and the model can be rewritten using (25) from the point of view of RKHS as

$\displaystyle{\begin{array}{c} \mbox{ SVM hard margin regression ${\it in RKHS}$ (Primal problem)} \\ \frac{1} {2}\|f\|_{\mathcal{H}_{K}}^{2}\; \rightarrow \;\min _{f\in \mathcal{H}_{K}}\quad \mbox{ subject to}\quad f(x_{i}) + b - y_{i}\quad \leq \epsilon, \\ \phantom{\frac{1} {2}\|f\|_{\mathcal{H}_{K}}^{2}\; \rightarrow \;\min _{f\in \mathcal{H}_{K}}\quad \mbox{ subject to}\quad \quad \quad \quad } - f(x_{i}) - b + y_{i} \leq \epsilon,\quad i = 1,\ldots,m. \end{array} }$

The dual problem reads in matrix-vector form as the dual problem for the linear SV hard margin regression except that we have to use the kernel matrix $\mathbf{K}$ defined by (26). Let $(\alpha ^{+})^{{\ast}},(\alpha ^{-})^{{\ast}}$ be the solution of this dual problem and $\alpha ^{{\ast}} = (\alpha ^{+})^{{\ast}}- (\alpha ^{-})^{{\ast}}$ . Then one can compute the optimal function $f_{w^{{\ast}}}$ using (14) with the corresponding feature space modification as

$\displaystyle{ f_{w^{{\ast}}}(x) =\sum _{i\in I_{rS}}\alpha _{i}^{{\ast}}K(x_{ i},x). }$

(28)

One obtains sparse representations in terms of the support vectors. By (16), the intercept can be computed by

$\displaystyle{b^{{\ast}} = y_{ i} -\sum _{j\in I_{rS}}\alpha _{j}^{{\ast}}K(x_{ j},x_{i})-\epsilon,\quad i \in I_{S}.}$

Soft Margin Regression In the feature space, the soft margin regression model is

$\displaystyle{\begin{array}{c} \mbox{ SVM soft margin regression (Primal problem)} \\ \frac{1} {2}\|w\|_{\ell_{2}(I)}^{2} + C\sum _{ i=1}^{m}(\xi _{ i}^{+} +\xi _{ i}^{-})\; \rightarrow \;\min _{ w,b,\xi _{i}^{\pm }}\mbox{ s.t.}\quad \langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} + b - y_{i} \leq \epsilon +\xi _{i}^{-}, \\ \phantom{\frac{1} {2}\|w\|_{\ell_{2}(I)}^{2} + C\sum _{ i=1}^{m}(\xi _{ i}^{+} +\xi _{ i}^{-})\; \rightarrow \;\min _{ w,b,\xi _{i}^{\pm }}\mbox{ s.t.}\quad } -\langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} - b + y_{i} \leq \epsilon +\xi _{i}^{+}, \\ \phantom{\frac{1} {2}\|w\|_{\ell_{2}(I)}^{2} + C\sum _{ i=1}^{m}(\xi _{ i}^{+} +\xi _{ i}^{-})\; \rightarrow \;\min _{ w,b,\xi _{i}^{\pm }}\mbox{ s.t.}\quad -\langle w,\Phi (x_{i})\rangle _{\ell_{2}(I)} -}\xi _{i}^{+},\xi _{ i}^{-}\geq 0. \end{array} }$

Having a feature map associated with a Mercer kernel K, the corresponding unconstrained problem can be written as the following minimization problem in the RKHS $\mathcal{H}_{K}$ :

$\displaystyle{\begin{array}{c} \mbox{ SVM soft margin regression ${\it in RKHS}$ (Primal problem)} \\ \frac{1} {2}\|f\|_{\mathcal{H}_{K}} + C\sum \limits _{i=1}^{m}L_{\epsilon }(y_{i},f(x_{i}) + b)\; \rightarrow \;\min \limits _{f\in \mathcal{H}_{K}}. \end{array} }$

The dual problem looks as the dual problem for linear SV soft margin regression but with kernel (26). From the solution of the dual problem $(\alpha ^{+})^{{\ast}},(\alpha ^{-})^{{\ast}}$ , one can compute the optimal function $f_{w^{{\ast}}}$ as in (28) and the optimal intercept using (17) as

$\displaystyle{b^{{\ast}} = y_{ i} -\sum _{j\in I_{rS}}\alpha _{j}^{{\ast}}K(x_{ j},x_{i})-\epsilon,\quad i \in \tilde{ I}_{S}.}$

Figure 7, left, shows an SVM soft margin regression function for the data in Fig. 2.

Fig. 7

SVM regression using the Gaussian with c = 8 for the data in Fig. 2. SVM soft margin regression curve with C = 0. 5 and ε = 0. 2 (left). Least squares SVM regression curve with C = 40

Relations to Sparse Approximation in RKHSs, Interpolation by Radial Basis Functions, and Kriging

SVM regression is related to various tasks in approximation theory. Some of them will be sketched in the following.

Relation to Sparse Approximation in RKHSs Let $\mathcal{H}_{K}$ be a RKHS with kernel K. Consider the problem of finding for an unknown function $\tilde{f} \in \mathcal{H}_{K}$ with given $\tilde{f}(x_{i}) = y_{i}$ , i = 1, …, m an approximating function of the form

$\displaystyle{ f(x):=\sum _{ i=1}^{m}\alpha _{ i}K(x,x_{i}) \in \mathcal{H}_{K} }$

(29)

with only few summands. A starting point would be to minimize the $\mathcal{H}_{K}$ -norm of the error and to penalize the so-called ℓ ₀-norm of α given by $\|\alpha \|_{0}:= \vert \{i:\alpha _{i}\not =0\}\vert$ to enforce sparsity. Unfortunately, the complexity when solving such ℓ ₀-penalized problems increases exponentially with m. One remedy is to replace the ℓ ₀-norm by the ℓ ₁-norm, i.e., to deal with

$\displaystyle{ \frac{1} {2}\|\tilde{f}(x) -\sum _{i=1}^{m}\alpha _{ i}K(x,x_{i})\|_{\mathcal{H}_{K}}^{2} +\epsilon \|\alpha \| _{ 1}\; \rightarrow \;\min _{\alpha }, }$

(30)

where ε > 0. This problem and its relation to support vector regression were considered by Girosi [44] and Evgeniou et al. [37]. Using the relations in RKHS from section “Reproducing Kernel Hilbert Spaces,” in particular the reproducing property (H2), this problem becomes

$\displaystyle{ \frac{1} {2}\alpha ^{\mbox{ T} }\mathbf{K}\alpha -\sum _{i=1}^{m}\alpha _{ i}y_{i} + \frac{1} {2}\|\tilde{f}\|_{\mathcal{H}_{K}}^{2} +\epsilon \|\alpha \| _{ 1}\; \rightarrow \;\min _{\alpha }, }$

(31)

where $\mathbf{K}$ is defined by (26). With the splitting

$\displaystyle{\alpha _{i} =\alpha _{ i}^{+} -\alpha _{ i}^{-},\;\alpha _{ i}^{\pm }\geq 0,\;\alpha _{ i}^{+}\alpha _{ i}^{-} = 0,\;i = 1,\ldots,m}$

and consequently $\vert \alpha _{i}\vert =\alpha _{ i}^{+} +\alpha _{ i}^{-}$ , the sparse approximation model (30) has finally the form of the dual problem of the SVM hard margin regression without intercept:

Only gold members can continue reading. Log In or Register to continue