, , where for simplicity only is considered, and . The aim of the following sections is to learn a target function from given training samples . A distinction is made between classification and regression tasks. In classification is a discrete set, in general as large as the number of classes the samples belong to. Here binary classification with just two labels in was most extensively studied. An example where binary classification is useful is SPAM detection. Another example in medical diagnostics is given in Fig. 1. Here it should be mentioned that in many practical applications, the original input variables are pre-processed to transform them into a new useful space which is often easier to handle but preserves the necessary discriminatory information. This process is also known as feature extraction .
Learning by Support Vector Machines
Fig. 1
Examples of physiological (SR) and pathological (VT) electrical heart activity curves measured by an implanted cardioverter-defibrillator. For the classification of such signals, see Strauss and Steidl [95]
In contrast to classification, regression aims at approximating the “whole” real-valued function from some function values, so that is not countable here. The above examples, as all problems considered in this paper, are from the area of supervised learning. This means that all input vectors come along with their corresponding target function values (labeled data). In contrast, semi-supervised learning makes use of labeled and unlabeled Data, and in unsupervised learning labeled data are not available, so that one can only exploit the input vectors x i . The latter methods can be applied for example to discover groups of similar exemplars within the data (clustering), to determine the distribution of the data within the input space (density estimation), or to perform projections of data from high-dimensional spaces to lower-dimensional spaces. There are also learning models which involve more complex interactions between the learner and the environment. An example is reinforcement learning which is concerned with the problem of finding suitable actions in a given situation in order to maximize the reward. In contrast to supervised learning, reinforcement learning does not start from given optimal (labeled) outputs but must instead find them by a process of trial and error. For reinforcement learning, the reader may consult the book of Sutton and Barto [97].
Learning models can also differ in the way in which the training data are generated and presented to the learner. For example, a distinction can be made between batch learning, where all the data are given at the start of the training phase, and online learning, where the learner receives one example at a time and updates the hypothesis function in response to each new example.
This paper is organized as follows: An overview of the historical background is given in Sect. 2. Section 3 contains an introduction into classical SVM methods. It starts with linear methods for (binary) support vector classification and regression and considers also linear least squares classification/regression. Then the kernel trick is explained and used to transfer the linear models into so-called feature spaces which results in nonlinear learning methods. Some other models related to SVM as well as multi-class classification and multitask learning are addressed at the end of the section. Section 4 provides some mathematical background concerning RKHSs and quadratic optimization. The last subsection sketches very briefly some results in statistical learning theory. Numerical methods to make the classical SVMs efficient in particular for large data sets are presented in Sect. 5. The paper ends with some conclusions in Sect. 6.
Modern learning theory has a long and interesting history going back as far as Gauss and Legendre but got its enormous impetus from the advent of computing machines. In the 1930s revolutionary changes took place in understanding the principles of inductive inference from a philosophical perspective, e.g., by Popper, and from the point of view of statistical theory, e.g., by Kolmogorov, Glivenko, and Cantelli, and applied statistics, e.g., by Fisher. A good overview over the leading ideas and developments in this time can be found in the comments and bibliographical remarks of Vapnik’s book, Vapnik [105]. The starting point of statistical learning theory which considers the task of minimizing a risk functional based on empirical data dates back to the 1960s. Support vector machines, including their RKHS interpretation, were only discovered in the 1990s and led to an explosion in applications and theoretical analysis.
Let us start with the problem of linear regression which is much older than linear classification. The method of least squares was first published by Legendre [63]. It was considered as a statistical procedure by Gauss [43], who claimed, to the annoyance of Legendre but in accordance with most historians, to have applied this method since 1795. The original least squares approach finds for given points and corresponding , a hyperplane having minimal least squares distance from the points (x i , y i ):
This leads to the solution of a linear system of equations which can be ill conditioned or possess several solutions. Therefore, regularized versions were introduced later. The linear least squares approach is optimal in the case of linear targets corrupted by Gaussian noise. Sometimes it is useful to find a linear function which does not minimize the least squares error, but, for example, the ℓ 1-error
which is more robust against outliers. This model with the constraint that the sum of the errors is equal to zero was already studied by Laplace in 1799; see Laplace [61]. Another popular choice is the ℓ ∞ -error
which better incorporates outliers. In contrast to the least squares method, the solutions of the ℓ 1– and ℓ ∞ -problems cannot be computed via linear systems of equations but require to solve linear optimization problems. Figure 2 shows a one-dimensional example, where data are approximated by lines with minimal ℓ 2-, ℓ 1– and ℓ ∞ error norm, respectively. For more information on (regularized) least squares problems, the reader may consult, e.g., the books of Golub and Van Loan [45] and of Björck [10].
(1)
Fig. 2
Linear approximation with respect to the ℓ 2-, ℓ 1– and ℓ ∞ -norm of the error (left to right). The ℓ 1 approximation is more robust against outliers while the ℓ ∞ -norm takes them better into account
Regularized least squares methods which penalize the quadratic weight as in section “Linear Least Squares Classification and Regression” were examined under the name ridge regression by Hoerl and Kennard [49]. This method can be considered as a special case of the regularization theory for ill-posed problems developed by Tikhonov and Arsenin [102]. Others than the least squares loss function like the ε–insensitive loss were brought into play by Vapnik [105]. This loss function enforces a sparse representation of the weights in terms of so-called support vectors which are (small) subsets of the training samples {x i : i = 1, …, m}.
The simplest form of classification is binary classification, where one has just to separate between two classes. Linear hyperplanes H(w, b) separating points, also called linear discriminants or perceptrons, were already studied by Fisher [41] and became interesting for neural network researchers in the early 1960s. One of the first algorithms that constructs a separating hyperplane for linearly separable points was Rosenblatt’s perceptron; see Rosenblatt [84]. It is an iterative online and mistake-driven algorithm which starts with an initial weight guess w for the hyperplane and adapts the weight at each time a training point is misclassified by the current weight. If the data are linearly separable, the procedure converges, and the number of mistakes (number of updates of w) does not exceed (2R∕γ)2, where and γ is the smallest distance between a training point and the hyperplane. For linearly separable points, there exist various hyperplanes separating them.
An optimal hyperplane for linearly separable points in the sense that the minimal distance of the points from the plane becomes maximal was constructed as so-called generalized portrait algorithm by Vapnik and Lerner [108]. This learning method is also known as linear hard margin support vector classifier . The method was generalized to nonseparable points by Cortes and Vapnik [26] which leads to soft margin classifiers. Finally, the step from linear to nonlinear classifiers via feature maps was taken by Boser et al. [12]. Their idea to combine a linear algorithm with a kernel approach inspired the further examination of specific kernels for applications.
However, the theory of kernels and their applications is older than SVMs. Aronzajn [5], systematically developed the theory of RHKSs in the 1940s though it was discovered that many results were independently obtained by Povzner [83]. The work of Parzen [78] brought the RKHS to the fore in statistical problems; see also Kailath [54]. Kernels in pattern recognition were already applied by Aizerman et al. [1]. Empirical risk minimization over RKHSs was considered by Wahba [112] in connection with splines and by Poggio and Giriosi [81] in relation with neural networks. Schölkopf et al. [89] realized that the kernel trick works not only for SVMs but for many other methods as principal component analysis in unsupervised learning.
The invention of SVMs has led to a gigantic amount of developments in learning theory and practice. The size of this paper would be not enough to list the references on this topic. Beyond various applications, also advanced generalization results, suitable choices of kernels, efficient numerical methods in particular for large data sets, relations to other sparse representation methods, multi-class classification, and multitask learning were addressed. The reader will find some references in the corresponding sections.
This section starts with linear classification and regression which provide the easiest algorithms to understand some of the main building blocks that appear also in the more sophisticated nonlinear support vector machines. Moreover, concerning the classification task, this seems to be the best approach to explain its geometrical background. The simplest function to feed a classifier with or to use as an approximation of some unknown function in regression tasks is a linear (multivariate) function
Often it is combined with some appropriate real number b, i.e., one considers the linear polynomial . In the context of learning, w is called weight vector and b offset, intercept or bias.
(2)
Let us consider binary classification first and postpone multi-class classification to section “Multi-class Classification and Multitask Learning.” As binary classifier , one can use
with the agreement that sgn(0): = 0. The hyperplane
has the normal vector , and the distance of a point to the hyperplane is given by
see Fig. 3 left. In particular, is the distance of the hyperplane from the origin.
Fig. 3
Left: Hyperplane H with normal and distance from the origin. The distance of the point from the hyperplane is the value fulfilling , i.e., . Right: Linearly separable training set together with a separating hyperplane and the corresponding margin γ
The training set Z consists of two classes labeled by ± 1 with indices and . The training set is said to be separable by the hyperplane H(w, b) if for i ∈ I −, i.e.,
The smallest distance of a point from the hyperplane
(3)
is called margin. Figure 3, right, shows a separating hyperplane of two classes together with its margin. Of course for a separable training set, there may exist various separating hyperplanes. One way to ensure a unique solution is to pose additional requirements on the hyperplane in form of minimizing a cost functional.
Hard Margin Classifier
One obvious way is to choose those separating hyperplane which has the maximal distance from the data, i.e., a maximal margin. The corresponding classifiers are called maximal margin classifiers or hard margin classifiers. The hyperplane and the corresponding half-spaces do not chance if the defining vectors is rescaled to (c w, c b), c > 0. The so-called generalized portrait algorithm of Vapnik and Lerner [108], constructs a hyperplane that maximizes γ under the constraint . The same hyperplane can be obtained as follows: by (3), it holds
so that one can use the scaling
Now γ becomes maximal if and only if becomes minimal, and the scaling means that for all . Therefore, the hard margin classifier aims to find parameters and b solving the following quadratic optimization problem with linear constraints:
If the training data are linearly separable, the problem has a unique solution. A brief introduction into quadratic programming methods is given in section “Quadratic Optimization.” To transform the problem into its dual form, consider the Lagrangian
Since
the Lagrangian can be rewritten as
and the dual problem becomes
(4)
(5)
(6)
Note that the dual maximization problem has been rewritten into a minimization problem by using that . Let 1 m denote the vector with m coefficients 1, , , and
Note that K is symmetric and positive semi-definite. The dual problem can be rewritten in matrix-vector form as
(7)
Let α ∗ be the minimizer of this dual problem. The intercept b does not appear in the dual problem, and one has to determine its optimal value in another way. By the Kuhn-Tucker conditions, the equations
hold true, so that α i ∗ > 0 is only possible for those training data with . These are exactly the (few) points having margin distance γ from the hyperplane H(w ∗, b ∗). Define
4) the optimal weight w ∗ and the optimal function have a sparse representation in terms of the support vectors
Moreover,
and hence, using (5),
so that
(9)
(10)
Soft Margin Classifier
If the training data are not linearly separable which is the case in most applications, the hard margin classifier is not applicable. The extension of hard margin classifiers to the nonseparable case was done by Cortes and Vapnik [26] by bringing additional slack variables and a a parameter C > 0 into the constrained model:
For C = ∞, this is again the hard margin classifier model. As before, the margin is defined as , where w ∗ is the solution of the above problem. If the slack variable fulfills 0 ≤ ξ i ∗ < 1, then x i is correctly classified, and in the case the distance of x i from the hyperplane is . If 1 < ξ i ∗, then one has a misclassification. By penalizing the sum of the slack variables, one tries to keep them small.
The above constrained minimization model can be rewritten as an unconstrained one by using a margin-based loss function. A function is called margin based if there exists a representing function l: ℝ → [0, ∞) such that
In soft margin classification the appropriate choice of a loss function is the hinge loss function l h determined by
Then the unconstrained primal problem reads
The Lagrangian of the linear constrained problem has the form
where α i , β i ≥ 0. Partial differentiation of the Lagrangian with respect to w and b results in (4), (5) and with respect to ξ in
Using these relations, the Lagrangian can be rewritten in the same form as in (6), and the dual problem becomes in matrix-vector form
Let α ∗ be the minimizer of the dual problem. Then the optimal weight w ∗ and are again given by (9) and depend only on the few support vectors defined by (8). By the Kuhn-Tucker conditions, the equations
hold true. For 0 < α i ∗ < C, it follows that ξ i ∗ = 0 and , i.e., the points x i have margin distance from H(w ∗, b ∗). Moreover,
For α i ∗ = C, one concludes that , i.e., x i has distance from the optimal hyperplane.
(11)
Of course one can also approximate unknown functions by linear (multivariate) polynomials of the form (2).
Hard Margin Regression
The model for linear hard margin regression is given by
The constraints make sure that the test data y i lie within an ε distance from the value f(x i ) + b of the approximating linear polynomial. The Lagrangian reads
where α i ± ≥ 0. Setting partial derivatives to zero leads to
Using these relations and setting
the Lagrangian can be written as
(12)
(13)
and the dual problem becomes in matrix-vector form
This is a quadratic optimization problem with linear constraints. Let be the solution of this problem and . Then, by (12), the optimal weight and the optimal function have in general a sparse representation in terms of the support vectors x i , i ∈ I S , namely,
The corresponding Kuhn-Tucker conditions are
If (α i −)∗ > 0 or (α i +)∗ > 0, then
respectively. Since both conditions cannot be fulfilled for the same index, it follows that and consequently, either or . Thus, one can obtain the intercept by
(14)
(15)
(16)
Soft Margin Regression Relaxing the constraints in the hard margin model leads to the following linear soft margin regression problem with C > 0:
For C = ∞, this recovers the linear hard margin regression problem. The above constrained model can be rewritten as an unconstrained one by using a distance-based loss function. A function is called distance based if there exists a representing function such that
In soft margin regression, the appropriate choice is Vapnik’s ε–insensitive loss function defined by
The function l ε is depicted in Fig. 4, left. Then the unconstrained primal model reads
Fig. 4
Vapnik’s ε-insensitive loss function for ε = 2 (left). Example of linear SV soft margin regression (right)
The Lagrangian of the constrained problem is given by
where α i ± ≥ 0 and β i ± ≥ 0, i = 1, …, m. Setting the partial derivatives to zero leads to (12), (13) and
Using these relations the Lagrangian can be written exactly as in the hard margin problem, and the dual problem becomes in matrix-vector form
If are the solution of this problem and , then the optimal weight w ∗ and the optimal function are given by (14). The corresponding Kuhn-Tucker conditions are
If 0 < (α i +)∗ or 0 < (α i −)∗, then
respectively. Both equations cannot be fulfilled at the same time so that one can conclude that either or . In case , this results in the intercept
(17)
Instead of the hinge loss function for classification and the ε-insensitive loss function for regression, other loss functions can be used. Popular margin-based and distance-based loss functions are the logistic loss for classification and regression
respectively. In contrast to the loss functions in the previous subsections, logistic loss functions are differentiable in t so that often standard methods as gradient descent methods or Newton (like) methods can be applied for computing the minimizers of the corresponding problems. For details, see, e.g., the book of Mitchell [73] or of Hastie et al. [47]. Other loss functions for regression are the pinball loss, the Huber function, and the p–th power absolute distance loss | y − t | p , p > 0. For p = 2, the latter is the least squares loss
Since for y ∈ {−1, 1}, the least squares loss is also margin based, and one can handle least squares classification and regression using just the same model with y ∈ {−1, 1} for classification and y ∈ ℝ for regression. In the unconstrained form, one has to minimize
This model was published as ridge regression by Hoerl and Kennard [49] and is a regularized version of the Gaussian model (1). Therefore, it can be considered as a special case of regularization theory introduced by Tikhonov and Arsenin [102]. The minimizer can be computed via a linear system of equations. To this end, rewrite the unconstrained problem in matrix-vector form
where
Setting the gradient (with respect to w and b) to zero, one obtains
where and . Hence b and w can be obtained by solving the linear system of equations
Instead of the above problem, one solves in general the “centered” problem
(18)
(19)
where , i = 1, …, m. The advantage is that , where 0 m is the vector consisting of m zeros. Thus, (19) with instead of x i becomes a separable system, and one obtains immediately that and that follows by solving the linear system with positive definite, symmetric coefficient matrix
This means that is just the solution of the centered primal problem without intercept. Finally, one can check by the following argument that indeed :
and with (18) on the other hand
Note that but this is not the coefficient matrix in (19). When turning to nonlinear methods in section “Nonlinear Learning,” it will be essential to work with K = X T X instead of XX T. This can be achieved by switching to the dual setting. In the following, this dual approach is shown although it makes often not sense for the linear setting since the size of the matrix K is in general larger than those of . First, one reformulates the primal problem into a constrained one:
The Lagrangian reads
and
The equality constraint in the primal problem together with the above equalities leads to the following linear system of equations to determine the optimal α ∗ and b ∗:
The optimal weight and the corresponding optimal function read
In general there is no sparse representation with respect to the vectors x i ; see also Theorem 8.36 in the book of Steinwart and Christman [93]. Therefore, this method is not called a support vector method in this paper. Finally, note that the centered approach also helps to avoid the intercept in the dual approach. Since this is no longer true when turning to the nonlinear setting, the intercept is kept here.
(20)
(21)
A linear form of a decision or regression function may not be suitable for the task at hand. Figure 5 shows two examples, where a linear classifier is not appropriate.
Fig. 5
Two sets, where linear classification is not appropriate
A basic idea to handle such problems was proposed by Boser et al. [12] and consists of the following two steps which will be further explained in the rest of this subsection:
1.
Mapping of the input data into a feature space , where I is a countable (possibly finite) index set, by a nonlinear feature map
2.
Application of the linear classification/regression model to the feature set
This means that instead of a linear function (2), we are searching for a function of the form
now. This nonlinear function on becomes linear in the feature space .
(22)
Figure 6 shows an example of a feature map. In this example, the set {(x i , y i ): i = 1, …, 5} is not linearly separable while is linearly separable. In practical applications, in contrast to this example, the feature map often maps into a higher dimensional, possibly also infinite dimensional space.
Fig. 6
Linearly nonseparable training data in the original space (left) and become separable in the feature space , where (right)
Together with the so-called kernel trick to avoid the direct work with the feature map , this approach results in the successful support vector machine (SVM).
In general, one avoids to work directly with the feature map by dealing with the dual problem and applying the so-called kernel trick. For a feature map , define a kernel associated with by
More precisely, in practice one often follows the opposite way, namely, one starts with a suitable kernel which is known to have a representation of the form (23) without knowing explicitly.
(23)
A frequently applied group of kernels are continuous, symmetric, positive (semi-) definite kernels like the Gaussian
These kernels, which are also called Mercer kernels , will be considered in detail in section “Reproducing Kernel Hilbert Spaces.” By Mercer’s theorem, it can be shown that a Mercer kernel possesses a representation
with L 2-orthonormal functions ψ j and positive values λ j , where the right-hand side converges uniformly. Note that the existence of such a representation is clear, but in general without knowing the functions ψ j explicitly. The set forms an orthonormal basis of a reproducing kernel Hilbert space (RKHS) . These spaces are considered in more detail in section “Reproducing Kernel Hilbert Spaces.” Then the feature map is defined by
Using the orthonormal basis, one knows that for any there exists a unique sequence such that
where the convergence is uniformly. Conversely every sequence w ∈ ℓ 2(I) defines a function f w lying in by (24). Moreover, Parseval’s equality says that
(24)
(25)
For nonlinear classification and regression purposes, one can follow exactly the lines of section “Linear Learning” except that one has to work in instead of . Using (22) instead of (2) and
instead of the kernel matrix in (7), the linear models from section “Linear Learning” can be rewritten as in the following subsections. Note again that K is positive semi-definite.
(26)
In the following, the linear classifiers are generalized to feature spaces.
Hard Margin Classifier The hard margin classification model is
Interestingly, if is associated with a Mercer kernel K, then lies in the RKHS , and the model can be rewritten using (25) from the point of view of RKHS as
The dual problem reads
and with K defined by (26) the matrix-vector form of the dual problem looks as those for the linear hard margin classifier.
Let α ∗ be the minimizer of the dual problem. Then, by (9) together with the feature space modification, the optimal weight and the function become
Thus, one can compute knowing only the kernel and not the feature map itself. One property of a Mercer kernel used for learning purposes should be that it can be simply evaluated at points from . This is, for example, the case for the Gaussian. Finally, using (10) in the feature space, the intercept can be computed by
and the margin by using .
(27)
Soft Margin Classifier The soft margin classification model in the feature space is
If is associated with a Mercer kernel K, the corresponding unconstrained version reads in the RKHS
In the following, the linear regression models are generalized to feature spaces.
Hard Margin Regression One obtains
If is associated with a Mercer kernel K, then lies in the RKHS , and the model can be rewritten using (25) from the point of view of RKHS as
The dual problem reads in matrix-vector form as the dual problem for the linear SV hard margin regression except that we have to use the kernel matrix defined by (26). Let be the solution of this dual problem and . Then one can compute the optimal function using (14) with the corresponding feature space modification as
One obtains sparse representations in terms of the support vectors. By (16), the intercept can be computed by
(28)
Soft Margin Regression In the feature space, the soft margin regression model is
Having a feature map associated with a Mercer kernel K, the corresponding unconstrained problem can be written as the following minimization problem in the RKHS :
The dual problem looks as the dual problem for linear SV soft margin regression but with kernel (26). From the solution of the dual problem , one can compute the optimal function as in (28) and the optimal intercept using (17) as
Figure 7, left, shows an SVM soft margin regression function for the data in Fig. 2.
Fig. 7
SVM regression using the Gaussian with c = 8 for the data in Fig. 2. SVM soft margin regression curve with C = 0. 5 and ε = 0. 2 (left). Least squares SVM regression curve with C = 40
SVM regression is related to various tasks in approximation theory. Some of them will be sketched in the following.
Relation to Sparse Approximation in RKHSs Let be a RKHS with kernel K. Consider the problem of finding for an unknown function with given , i = 1, …, m an approximating function of the form
with only few summands. A starting point would be to minimize the -norm of the error and to penalize the so-called ℓ 0-norm of α given by to enforce sparsity. Unfortunately, the complexity when solving such ℓ 0-penalized problems increases exponentially with m. One remedy is to replace the ℓ 0-norm by the ℓ 1-norm, i.e., to deal with
where ε > 0. This problem and its relation to support vector regression were considered by Girosi [44] and Evgeniou et al. [37]. Using the relations in RKHS from section “Reproducing Kernel Hilbert Spaces,” in particular the reproducing property (H2), this problem becomes
where is defined by (26). With the splitting
and consequently , the sparse approximation model (30) has finally the form of the dual problem of the SVM hard margin regression without intercept:
(29)
(30)
(31)