Methods in Medical Imaging

with labels $(y_{i})_{1\leqslant i\leqslant m} \in \mathcal{Y}^{m}$ . If the number of possible labels is finite and small, then we can be interested in classification, i.e. in finding the label to assign to a new object based on the given examples. Otherwise, if the labels are values in a vector space, we can be interested in regression, i.e. in extrapolating previously observed values to any new object. Thus classification and regression tasks can be embedded in a similar framework, one aiming to predict discrete labels and the other one to continuous values.

The objects $(x_{i})_{1\leqslant i\leqslant m}$ are often named patterns, or cases, inputs, instances, or observations. The $(y_{i})_{1\leqslant i\leqslant m}$ are called labels, or targets, outputs or sometimes also observations. The set of all correspondences $(x_{i},y_{i})_{1\leqslant i\leqslant m}$ given as examples is called the training set, whereas we name test set the set of new objects for which we would like to guess the label by extracting knowledge from the examples in the training set.

In both cases, classification or regression, we aim to generalize the correspondences (x _i, y _i) to a function f defined on the set $\mathcal{X}$ of all possible objects and with values in the set $\mathcal{Y}$ of all possible labels. The label predicted for a new test object x would then be f(x). Here we have no particular assumption on the spaces $\mathcal{X}$ and $\mathcal{Y}$ except that $\mathcal{Y}$ should be a vector space if we are interested in regression (in order to extrapolate continuously between any two values). But we have a strong intuitive assumption on f: it should generalize as well as possible the given examples, i.e. if x is close to an already observed input x _i, its output f(x) should be close to the already observed output y _i. The whole difficulty consists in defining precisely what we mean by “close” in the spaces $\mathcal{X}$ and $\mathcal{Y}$ . More precisely, we need to quantify the similarity of inputs in $\mathcal{X}$ and the cost of assigning wrong outputs in $\mathcal{Y}$ .

Loss function

Generally, expressing a distance or similarity measure in $\mathcal{Y}$ is easy. In the case of regression, the Euclidean distance in $\mathcal{Y}$ is often a simple, convenient choice. However we can consider other functions than distances, provided they express the cost of assigning a wrong label. We call the loss function the sum of the costs (or losses) of all mistakes made when we consider a particular possible solution f and apply it to all known examples. For instance we can choose:

$\displaystyle{ L(f,(x_{i},y_{i})_{1\leqslant i\leqslant m}) =\sum _{ i=1}^{m}\|f(x_{ i}) - y_{i}\|_{\mathcal{Y}} }$

Duality between features and similarity measures

On the other hand, expressing a similarity measure in $\mathcal{X}$ is much more difficult and lies at the core of machine learning. Either the space $\mathcal{X}$ has been carefully chosen so that the representation of the observed objects x _iare meaningful, in the sense that their “natural” distance in $\mathcal{X}$ (say the Euclidean distance if $\mathcal{X}$ is a vector space) is meaningful, in which case learning will be easy; either $\mathcal{X}$ is non-trivial and we need to choose a set of N sensible features (seen as a function Φ from $\mathcal{X}$ to $\mathcal{H} = \mathbb{R}^{N}$ ), so that if we compute these features Φ(x _i) for each x _i, we can consider a more natural distance in the feature space $\mathcal{H}$ . From a certain point of view, choosing a sensible feature map Φ or choosing a sensible distance in $\mathcal{X}$ (or in the feature space $\mathcal{H}$ ) are equivalent problems, and hence equivalently hard in the general case.

Optimization problem over functions

The problem of classification or regression can be written as an optimization problem over all possible functions f: find the best function f from $\mathcal{X}$ to $\mathcal{Y}$ such that it minimizes

$\displaystyle{L(f,(x_{i},y_{i})_{1\leqslant i\leqslant m}) + R(f)}$

where R(f) is a regularizer constraining f to be smooth in some way with respect to the similarity measure chosen in $\mathcal{X}$ . Note that we could also have restricted f to be a member of a small function space $\mathcal{F}$ . There are very nice theoretical results concerning the function space in the kernel case (see for example Sect. 2.3 about ridge regression).

2.2 Kernels

This section aims to define kernels and to explain all facets of the concept. It is a preliminary step to the following sections dedicated to kernel algorithms themselves.

A kernel is any symmetric similarity measure on $\mathcal{X}$

$\displaystyle\begin{array}{rcl} k: \mathcal{X}\times \mathcal{X}& \rightarrow & \mathbb{R} {}\\ (x,x')& \mapsto & k(x,x'), {}\\ \end{array}$

that is, a symmetric function that, given two inputs x and x′, returns a real number characterizing their similarity (cf. [1, 3, 4, 7, 10]).

Kernels as inner products in the feature space

In the general case, either $\mathcal{X}$ is not a vector space, or the natural Euclidean inner product in $\mathcal{X}$ is not particularly relevant as a similarity measure. Most often, a set of possibly-meaningful features is available, and we can consequently use the feature map

$\displaystyle\begin{array}{rcl} \varPhi: \mathcal{X}& \rightarrow & \mathcal{H} {}\\ x& \mapsto & \mathbf{x}:=\varPhi (x). {}\\ \end{array}$

Φ will typically be a nonlinear map with values in a vector space. It could for example compute products of components of the input x. We have used a bold face $\mathbf{x}$ to denote the vectorial representation of x in the feature space $\mathcal{H}$ . We will follow this convention throughout the chapter.

We can use the non-linear embedding of the data into the linear space $\mathcal{H}$ via Φ to define a similarity measure from the dot product in $\mathcal{H}$ ,

$\displaystyle{ k(x,x'):= \left \langle \mathbf{x},\mathbf{x}'\right \rangle _{\mathcal{H}} = \left \langle \varPhi (x),\varPhi (x')\right \rangle _{\mathcal{H}}. }$

(1)

The freedom to choose the mapping Φ will enable us to design a large variety of similarity measures and learning algorithms. The transformation of x _iinto $\varPhi (x_{i}) = \mathbf{x}_{i}$ can be seen as a change of the inputs, i.e. as a new model of the initial problem. However, we will see later that, in some cases, we won’t need to do this transformation explicitly, which is very convenient if the number of features considered (or the dimension of $\mathcal{H}$ ) is high.

Geometrical interpretation and kernel trick

Through the definition of k, we can provide geometric interpretation of the input data:

$\displaystyle{\|\mathbf{x}\|_{\mathcal{H}} =\|\varPhi (x)\|_{\mathcal{H}} = \sqrt{\left \langle \varPhi (x),\varPhi (x) \right \rangle _{\mathcal{H}}} = \sqrt{k(x, x)}}$

is the length (or norm) of x in the feature space. Similarly, k(x, x′) computes the cosine of the angle between the vectors $\mathbf{x}$ and $\mathbf{x}'$ , provided they are normalized to length 1. Likewise, the distance between two vectors is computed as the length of the difference vector:

$\displaystyle{\|\mathbf{x} -\mathbf{x}'\|_{\mathcal{H}}^{2} =\| \mathbf{x}\|^{2} +\| \mathbf{x}'\|^{2} - 2\left \langle \varPhi (x),\varPhi (x')\right \rangle = k(x,x) + k(x',x') - 2k(x,x').}$

The interesting point is that we could consider any such similarity measure k and forget about the associated Φ: we would still be able to compute lengths, distances and angles with the only knowledge of k thanks to these formulas. This framework allows us to deal with the patterns geometrically through a understated non-linear embedding, and thus lets us study learning algorithms using linear algebra and analytic geometry. This is known as the kernel trick: any algorithm dedicated to Euclidean geometry involving only distances, lengths and angles can be kernelized by replacing all occurrences of these geometric quantities by their expressions as a function of k. Next section is dedicated to such kernelizations.

Examples of kernels

Let us introduce the most-commonly used kernels. They are namely: the polynomial kernel

$\displaystyle{ k(x,x') = \left \langle x,x'\right \rangle ^{d}, }$

and the Gaussian

$\displaystyle{ k(x,x') =\exp \left (-\frac{\|x - x'\|^{2}} {2\;\sigma ^{2}} \right ) }$

for suitable choices of d and σ. Let us focus on the Gaussian case: the similarity measure k(x, x′) between x and x′ is always positive, and is maximal when x = x′. All points $\mathbf{x}$ have the same unit norm (since $k(x,x) = 1\;\forall x$ ) and consequently the images of all points x in the associated feature space $\mathcal{H}$ lie on the unit sphere.

Reproducing kernels as feature maps

One could wonder what is the feature map Φ which was used to build the Gaussian kernel. In fact kernel theory goes far beyond the way we introduced kernels. Let us consider any symmetric function k, not necessarily related to a feature map. Let us suppose also that k, seen as an operator, is positive definite, that is to say that for any non-zero L ² function $\alpha: \mathcal{X} \rightarrow \mathbb{R}$ :

$$$\displaystyle{\int _{\mathcal{X}\times \mathcal{X}}\alpha (x)k(x,x')\alpha (x')\,dx\,dx'\; > 0.}$$ ” src=”/wp-content/uploads/2016/09/A151032_1_En_4_Chapter_Equg.gif”></DIV></DIV></DIV>Note that to be able to integrate over <SPAN id=IEq46 class=InlineEquation><IMG alt=$ , we need a measure on $\mathcal{X}$ . This measure is often thought of as a probability measure over $\mathcal{X}$ , giving more weight to objects that are more likely to appear.

Then we can define from this kernel k an associated feature map by:

$\displaystyle\begin{array}{rcl} \varPhi: \mathcal{X}& \rightarrow & \mathcal{F}(\mathcal{X}) \\ x& \mapsto & \mathbf{x}:= k(x,\cdot ).{}\end{array}$

(2)

This image of any input x by Φ is the function

$\displaystyle\begin{array}{rcl} k(x,\cdot ):\;\; \mathcal{X}& \rightarrow & \mathbb{R} \\ x'& \mapsto & k(x,x').{}\end{array}$

(3)

Φ has now values in the space $\mathcal{F}(\mathcal{X})$ of functions over $\mathcal{X}$ instead of having values in just a finite dimensioned vector space like $\mathbb{R}^{N}$ .

The magic comes from Moore-Aronszajn theorem [2] which states that it is always possible, for any symmetric positive definite function k, to build a reproducing kernel Hilbert space (RKHS) $\mathcal{H}\subset \mathcal{F}(\mathcal{X})$ so that

$\displaystyle\begin{array}{rcl} \forall x,x' \in \mathcal{X},\;\;\;\;\;k(x,x') = \left \langle k(x,\cdot ),k(x',\cdot )\right \rangle _{\mathcal{H}} = \left \langle \varPhi (x),\varPhi (x')\right \rangle _{\mathcal{H}}.& &{}\end{array}$

(4)

Because of such a property, symmetric positive definite kernels are also called reproducing kernels. This theorem highlights the duality between reproducing kernels k and feature maps Φ: choosing the feature space or choosing the kernel is equivalent, since one determines the other.

The Gaussian case (details)

We can make explicit the inner product on $\mathcal{H}$ in the Gaussian case. The associated norm is

$\displaystyle{\|f\|_{\mathcal{H}}^{2} =\int _{ \mathcal{X}}\;\sum _{n=0}^{\infty }\; \frac{\sigma ^{2n}} {n!\;2^{n}}\left ( \frac{d^{n}} {dx^{n}}f\right )^{2}(x)\;dx =\sum _{ n=0}^{\infty }\; \frac{\sigma ^{2n}} {n!\;2^{n}}\left \| \frac{d^{n}} {dx^{n}}f\right \|_{L^{2}(\mathcal{X})}^{2}}$

which penalizes all fast variations of f at all derivative orders. We refer to [6] for a more general mathematical study of radial basis functions. Intuitively, consider the operator $P = e^{-\frac{\sigma ^{2}} {2} \frac{d^{2}} {dx^{2}} }:=\sum _{n}\frac{(-\sigma ^{2}/2)^{n}} {n!} \frac{d^{2n}} {dx^{2n}}$ . In the Fourier domain, it writes $e^{\sigma ^{2}w^{2}/2 }$ , whereas k(x, ⋅ ) becomes $\sigma e^{-\sigma ^{2}w^{2}/2 }e^{-iwx}$ . Thus $\frac{1} {\sigma } P(k(x,\cdot )) =\delta _{x}(\cdot )$ is a Dirac peak in the space $\mathcal{D}(\mathcal{X})$ of distributions over $\mathcal{X}$ . The inner product $\left \langle f,g\right \rangle _{\mathcal{H}}:= \left \langle \frac{1} {\sigma } P\left (f\right ),\;g\right \rangle _{\mathcal{D}(\mathcal{X})}$ on $\mathcal{H}$ will therefore satisfy:

$\displaystyle{\left \langle k(x,\cdot ),f\right \rangle _{\mathcal{H}}:= \left \langle \frac{1} {\sigma } P\left (k(x,\cdot )\right ),\;f\right \rangle _{\mathcal{D}(\mathcal{X})} = \left \langle \delta _{x},f\right \rangle _{\mathcal{D}(\mathcal{X})} = f(x)}$

hence, for the particular case $f = k(x',\cdot )$ ,

$\displaystyle{\left \langle k(x,\cdot ),\;k(x',\cdot )\right \rangle _{\mathcal{H}} = k(x,x').}$

The overfitting problem

The kernel k should be chosen carefully, since it is the core of the generalization process: if the neighborhood induced by k is too small (for instance if k is a Gaussian with a tiny standard deviation σ), then we will overfit the given examples without being able to generalize to new points (which would be found very dissimilar to all examples). On the contrary, if the neighborhood is too large (for instance if k is a Gaussian with a standard deviation so huge that all examples are considered as very similar), then it is not possible to distinguish any clusters or classes.

Only gold members can continue reading. Log In or Register to continue