In contrast to classification,
regression aims at approximating the “whole” real-valued function from some function values, so that

is not countable here. The above examples, as all problems considered in this paper, are from the area of
supervised learning. This means that all input vectors come along with their corresponding target function values (labeled data). In contrast,
semi-supervised learning makes use of labeled and unlabeled Data, and in
unsupervised learning labeled data are not available, so that one can only exploit the input vectors
x i . The latter methods can be applied for example to discover groups of similar exemplars within the data (clustering), to determine the distribution of the data within the input space (density estimation), or to perform projections of data from high-dimensional spaces to lower-dimensional spaces. There are also learning models which involve more complex interactions between the learner and the environment. An example is
reinforcement learning which is concerned with the problem of finding suitable actions in a given situation in order to maximize the reward. In contrast to supervised learning, reinforcement learning does not start from given optimal (labeled) outputs but must instead find them by a process of trial and error. For reinforcement learning, the reader may consult the book of Sutton and Barto [
97].
Learning models can also differ in the way in which the training data are generated and presented to the learner. For example, a distinction can be made between batch learning, where all the data are given at the start of the training phase, and online learning, where the learner receives one example at a time and updates the hypothesis function in response to each new example.
This paper is organized as follows: An overview of the historical background is given in Sect.
2. Section
3 contains an introduction into classical SVM methods. It starts with linear methods for (binary) support vector classification and regression and considers also linear least squares classification/regression. Then the kernel trick is explained and used to transfer the linear models into so-called feature spaces which results in nonlinear learning methods. Some other models related to SVM as well as multi-class classification and multitask learning are addressed at the end of the section. Section
4 provides some mathematical background concerning RKHSs and quadratic optimization. The last subsection sketches very briefly some results in statistical learning theory. Numerical methods to make the classical SVMs efficient in particular for large data sets are presented in Sect.
5. The paper ends with some conclusions in Sect.
6.
2 Historical Background
Modern learning theory has a long and interesting history going back as far as Gauss and Legendre but got its enormous impetus from the advent of computing machines. In the 1930s revolutionary changes took place in understanding the principles of inductive inference from a philosophical perspective, e.g., by Popper, and from the point of view of statistical theory, e.g., by Kolmogorov, Glivenko, and Cantelli, and applied statistics, e.g., by Fisher. A good overview over the leading ideas and developments in this time can be found in the comments and bibliographical remarks of Vapnik’s book, Vapnik [
105]. The starting point of statistical learning theory which considers the task of minimizing a risk functional based on empirical data dates back to the 1960s. Support vector machines, including their RKHS interpretation, were only discovered in the 1990s and led to an explosion in applications and theoretical analysis.
Let us start with the problem of linear regression which is much older than linear classification. The method of least squares was first published by Legendre [
63]. It was considered as a statistical procedure by Gauss [
43], who claimed, to the annoyance of Legendre but in accordance with most historians, to have applied this method since 1795. The original least squares approach finds for given points

and corresponding

,

a hyperplane

having minimal least squares distance from the points (
x i ,
y i ):
This leads to the solution of a linear system of equations which can be ill conditioned or possess several solutions. Therefore, regularized versions were introduced later. The linear least squares approach is optimal in the case of linear targets corrupted by Gaussian noise. Sometimes it is useful to find a linear function which does not minimize the least squares error, but, for example, the
ℓ 1-error
which is more robust against outliers. This model with the constraint that the sum of the errors is equal to zero was already studied by Laplace in 1799; see Laplace [
61]. Another popular choice is the
ℓ ∞ -error
which better incorporates outliers. In contrast to the least squares method, the solutions of the
ℓ 1– and
ℓ ∞ -problems cannot be computed via linear systems of equations but require to solve linear optimization problems. Figure
2 shows a one-dimensional example, where data are approximated by lines with minimal
ℓ 2-,
ℓ 1– and
ℓ ∞ error norm, respectively. For more information on (regularized) least squares problems, the reader may consult, e.g., the books of Golub and Van Loan [
45] and of Björck [
10].
Regularized least squares methods which penalize the quadratic weight

as in section “Linear Least Squares Classification and Regression” were examined under the name
ridge regression by Hoerl and Kennard [
49]. This method can be considered as a special case of the regularization theory for ill-posed problems developed by Tikhonov and Arsenin [
102]. Others than the least squares loss function like the
ε–
insensitive loss were brought into play by Vapnik [
105]. This loss function enforces a sparse representation of the weights in terms of so-called support vectors which are (small) subsets of the training samples {
x i :
i = 1,
…,
m}.
The simplest form of classification is
binary classification, where one has just to separate between two classes. Linear hyperplanes
H(
w,
b) separating points, also called
linear discriminants or
perceptrons, were already studied by Fisher [
41] and became interesting for neural network researchers in the early 1960s. One of the first algorithms that constructs a separating hyperplane for linearly separable points was Rosenblatt’s perceptron; see Rosenblatt [
84]. It is an iterative online and mistake-driven algorithm which starts with an initial weight guess
w for the hyperplane and adapts the weight at each time a training point is misclassified by the current weight. If the data are linearly separable, the procedure converges, and the number of mistakes (number of updates of
w) does not exceed (2
R∕
γ)
2, where

and
γ is the smallest distance between a training point and the hyperplane. For linearly separable points, there exist various hyperplanes separating them.
An
optimal hyperplane for linearly separable points in the sense that the minimal distance of the points from the plane becomes maximal was constructed as so-called generalized portrait algorithm by Vapnik and Lerner [
108]. This learning method is also known as
linear hard margin support vector classifier . The method was generalized to nonseparable points by Cortes and Vapnik [
26] which leads to
soft margin classifiers. Finally, the step from linear to nonlinear classifiers via feature maps was taken by Boser et al. [
12]. Their idea to combine a linear algorithm with a kernel approach inspired the further examination of specific kernels for applications.
However, the theory of kernels and their applications is older than SVMs. Aronzajn [
5], systematically developed the theory of RHKSs in the 1940s though it was discovered that many results were independently obtained by Povzner [
83]. The work of Parzen [
78] brought the RKHS to the fore in statistical problems; see also Kailath [
54]. Kernels in pattern recognition were already applied by Aizerman et al. [
1]. Empirical risk minimization over RKHSs was considered by Wahba [
112] in connection with splines and by Poggio and Giriosi [
81] in relation with neural networks. Schölkopf et al. [
89] realized that the
kernel trick works not only for SVMs but for many other methods as principal component analysis in unsupervised learning.
The invention of SVMs has led to a gigantic amount of developments in learning theory and practice. The size of this paper would be not enough to list the references on this topic. Beyond various applications, also advanced generalization results, suitable choices of kernels, efficient numerical methods in particular for large data sets, relations to other sparse representation methods, multi-class classification, and multitask learning were addressed. The reader will find some references in the corresponding sections.