Sensing

from only 30 samples (indicated by the red dots in Fig. 1b). From a first look at the time-domain signal, one would rather believe that reconstruction should be impossible from only 30 samples. Indeed, the spectrum reconstructed by traditional ℓ ₂-minimization is very different from the true spectrum. Quite surprisingly, ℓ ₁-minimization performs nevertheless an exact reconstruction, that is, with no recovery error at all!

Fig. 1

(a) 10-sparse Fourier spectrum, (b) time-domain signal of length 300 with 30 samples, (c) reconstruction via ℓ ₂-minimization, (d) exact reconstruction via ℓ ₁-minimization

An example from nuclear magnetic resonance imaging serves as a second illustration. Here, the device scans a patient by taking 2D or 3D frequency measurements within a radial geometry. Figure 2a describes such a sampling set of a 2D Fourier transform. Since a lengthy scanning procedure is very uncomfortable for the patient, it is desired to take only a minimal amount of measurements. Total variation minimization, which is closely related to ℓ ₁-minimization, is then considered as recovery method. For comparison, Fig. 2b shows the recovery by a traditional back-projection algorithm. Figure 2c, d displays iterations of an algorithm, which was proposed and analyzed in [72] to perform efficient large-scale total variation minimization. The reconstruction in Fig. 2d is again exact!

Fig. 2

(a) Sampling data of the NMR image in the Fourier domain which corresponds to only 0. 11 % of all samples. (b) Reconstruction by back projection. (c) Intermediate iteration of an efficient algorithm for large-scale total variation minimization. (d) The final reconstruction is exact

2 Background

Although the term compressed sensing (compressive sensing) was coined only recently with the paper by Donoho [47], followed by a huge research activity, such a development did not start out of thin air. There were certain roots and predecessors in application areas such as image processing, geophysics, medical imaging, computer science, as well as in pure mathematics. An attempt is made to put such roots and current developments into context below, although only a partial overview can be given due to the numerous and diverse connections and developments.

Early Developments in Applications

Presumably the first algorithm which can be connected to sparse recovery is due to the French mathematician de Prony [127]. The so-called Prony method, which has found numerous applications [109], estimates nonzero amplitudes and corresponding frequencies of a sparse trigonometric polynomial from a small number of equispaced samples by solving an eigenvalue problem. The use of ℓ ₁-minimization appears already in the Ph.D. thesis of B. Logan [106] in connection with sparse frequency estimation, where he observed that L ₁-minimization may recover exactly a frequency-sparse signal from undersampled data provided the sparsity is small enough. The paper by Donoho and Logan [52] is perhaps the earliest theoretical work on sparse recovery using L ₁-minimization. Nevertheless, geophysicists observed in the late 1970s and 1980s that ℓ ₁-minimization can be successfully employed in reflection seismology where a sparse reflection function indicating changes between subsurface layers is sought [140, 148]. In NMR spectroscopy the idea to recover sparse Fourier spectra from undersampled non-equispaced samples was first introduced in the 1990s [158] and has seen a significant development since then.

In image processing the use of total variation minimization, which is closely connected to ℓ ₁-minimization and compressive sensing, first appears in the 1990s in the work of Rudin, Osher, and Fatemi [139] and was widely applied later on. In statistics where the corresponding area is usually called model selection, the use of ℓ ₁-minimization and related methods was greatly popularized with the work of Tibshirani [149] on the so-called LASSO (Least Absolute Shrinkage and Selection Operator).

Sparse Approximation

Many lossy compression techniques such as JPEG, JPEG-2000, MPEG, or MP3 rely on the empirical observation that audio signals and digital images have a sparse representation in terms of a suitable basis. Roughly speaking one compresses the signal by simply keeping only the largest coefficients. In certain scenarios such as audio signal processing, one considers the generalized situation where sparsity appears in terms of a redundant system – a so-called dictionary or frame [36] – rather than a basis. The problem of finding the sparsest representation/approximation in terms of the given dictionary turns out to be significantly harder than in the case of sparsity with respect to a basis where the expansion coefficients are unique. Indeed, in [108, 114], it was shown that the general ℓ ₀-problem of finding the sparsest solution of an underdetermined system is NP-hard. Greedy strategies such as matching pursuit algorithms [108], FOCUSS [86] and ℓ ₁-minimization [35] were subsequently introduced as tractable alternatives. The theoretical understanding under which conditions greedy methods and ℓ ₁-minimization recover the sparsest solutions began to develop with the work in [50, 51, 62, 78, 81, 87, 151, 152].

Information-Based Complexity and Gelfand Widths

Information-based complexity (IBC) considers the general question of how well a function f belonging to a certain class $\mathcal{F}$ can be recovered from n sample values or, more generally, the evaluation of n linear or nonlinear functionals applied to f [150]. The optimal recovery error which is defined as the maximal reconstruction error for the “best” sampling method and “best” recovery method (within a specified class of methods) over all functions in the class $\mathcal{F}$ is closely related to the so-called Gelfand width of $\mathcal{F}$ [38, 47, 117]. Of particular interest for compressive sensing is $\mathcal{F} = B_{1}^{N}$ , the ℓ ₁-ball in $\mathbb{R}^{N}$ since its elements can be well approximated by sparse ones. A famous result due to Kashin [96] and Gluskin and Garnaev [79, 84] sharply bounds the Gelfand widths of B ₁ ^N(as well as their duals, the Kolmogorov widths) from above and below; see also [77]. While the original interest of Kashin was in the estimate of n-widths of Sobolev classes, these results give precise performance bounds in compressive sensing on how well any method may recover (approximately) sparse vectors from linear measurements [38, 47]. The upper bounds on Gelfand widths were derived in [96] and [79] using (Bernoulli and Gaussian) random matrices (see also [107]), and in fact such type of matrices have become very useful also in compressive sensing [26, 47].

Compressive Sensing

The numerous developments in compressive sensing began with the seminal work [30, 47]. Although key ingredients were already in the air at that time, as mentioned above, the major contribution of these papers was to realize that one can combine the power of ℓ ₁-minimization and random matrices in order to show optimal results on the ability of ℓ ₁-minimization of recovering (approximately) sparse vectors. Moreover, the authors made very clear that such ideas have strong potential for numerous application areas. In their work [26, 30] Candès, Romberg, and Tao introduced the restricted isometry property (which they initially called the uniform uncertainty principle) which is a key property of compressive sensing matrices. It was shown that Gaussian, Bernoulli, and partial random Fourier matrices [26, 129, 138] possess this important property. These results require many tools from probability theory and finite-dimensional Banach space geometry, which have been developed for a rather long time now; see, e.g., [95, 103].

Donoho [49] developed a different path and approached the problem of characterizing sparse recovery by ℓ ₁-minimization via polytope geometry, more precisely, via the notion of k-neighborliness. In several papers sharp phase transition curves were shown for Gaussian random matrices separating regions where recovery fails or succeeds with high probability [49, 53, 54]. These results build on previous work in pure mathematics by Affentranger and Schneider [2] on randomly projected polytopes.

Developments in Computer Science

In computer science the related area is usually addressed as the heavy hitters detection or sketching. Here one is interested not only in recovering signals (such as huge data streams on the Internet) from vastly undersampled data, but one requires sublinear runtime in the signal length N of the recovery algorithm. This is no impossibility as one only has to report the locations and values of the nonzero (most significant) coefficients of the sparse vector. Quite remarkably sublinear algorithms are available for sparse Fourier recovery [80]. Such algorithms use ideas from group testing which date back to World War II, when Dorfman [56] invented an efficient method for detecting draftees with syphilis.

In sketching algorithms from computer science, one actually designs the matrix and the fast algorithm simultaneously [42, 82]. More recently, bipartite expander graphs have been successfully used in order to construct good compressed sensing matrices together with associated fast reconstruction algorithms [11].

3 Mathematical Modelling and Analysis

This section introduces the concept of sparsity and the recovery of sparse vectors from incomplete linear and nonadaptive measurements. In particular, an analysis of ℓ ₁-minimization as a recovery method is provided. The null space property and the restricted isometry property are introduced, and it is shown that they ensure robust sparse recovery. It is actually difficult to show these properties for deterministic matrices and the optimal number m of measurements, and the major breakthrough in compressive sensing results is obtained for random matrices. Examples of several types of random matrices which ensure sparse recovery are given, such as Gaussian, Bernoulli, and partial random Fourier matrices.

Preliminaries and Notation

This exposition mostly treats complex vectors in $\mathbb{C}^{N}$ although sometimes the considerations will be restricted to the real case $\mathbb{R}^{N}$ . The ℓ _p-norm of a vector $x \in \mathbb{C}^{N}$ is defined as

$\displaystyle\begin{array}{rcl} \|x\|_{p}&:=& \left (\sum _{j=1}^{N}\vert x_{ j}\vert ^{p}\right )^{1/p},\quad 0 < p < \infty, \\ \|x\|_{\infty }&:=& \max _{j=1,\mathop{\ldots },N}\vert x_{j}\vert. {}\end{array}$

(1)

For 1 ≤ p ≤ ∞, it is indeed a norm, while for 0 < p < 1, it is only a quasi-norm. When emphasizing the norm, the term ℓ _p^Nis used instead of $\mathbb{C}^{N}$ or $\mathbb{R}^{N}$ . The unit ball in ℓ _p^Nis $B_{p}^{N} =\{ x \in \mathbb{C}^{N},\|x\|_{p} \leq 1\}$ . The operator norm of a matrix $A \in \mathbb{C}^{m\times N}$ from ℓ _p^Nto ℓ _p^mis denoted

$\displaystyle\begin{array}{rcl} \|A\|_{p\rightarrow p} =\max _{\|x\|_{p}=1}\|Ax\|_{p}.& &{}\end{array}$

(2)

In the important special case p = 2, the operator norm is the maximal singular value σ _max(A) of A.

For a subset $T \subset \{ 1,\mathop{\ldots },N\}$ , one denotes by $x_{T} \in \mathbb{C}^{N}$ the vector which coincides with $x \in \mathbb{C}^{N}$ on the entries in T and is zero outside T. Similarly, A _Tdenotes the column submatrix of A corresponding to the columns indexed by T. Further, $T^{c} =\{ 1,\mathop{\ldots },N\}\setminus T$ denotes the complement of T and # T or | T | indicates the cardinality of T. The kernel of a matrix A is denoted by $\ker A =\{ x,Ax = 0\}$ .

Sparsity and Compression

Compressive sensing is based on the empirical observation that many types of real-world signals and images have a sparse expansion in terms of a suitable basis or frame, for instance, a wavelet expansion. This means that the expansion has only a small number of significant terms, or, in other words, that the coefficient vector can be well approximated with one having only a small number of nonvanishing entries.

The support of a vector x is denoted $\mathop{{\mathrm{supp}}}\nolimits (x) =\{ j:\, x_{j}\neq 0\}$ and

$\displaystyle{\|x\|_{0}:= \vert \mathop{{\mathrm{supp}}}\nolimits (x)\vert.}$

It has become common to call $\|\cdot \|_{0}$ the ℓ ₀-norm, although it is not even a quasi-norm. A vector x is called k–sparse if $\|x\|_{0} \leq k$ . For $k \in \{ 1,2,\mathop{\ldots },N\}$ ,

$\displaystyle{\Sigma _{k}:=\{ x \in \mathbb{C}^{N}:\| x\|_{ 0} \leq k\}}$

denotes the set of k-sparse vectors. Furthermore, the best k -term approximation error of a vector $x \in \mathbb{C}^{N}$ in ℓ _pis defined as

$\displaystyle{\sigma _{k}(x)_{p} =\inf _{z\in \Sigma _{k}}\|x - z\|_{p}.}$

If σ _k(x) decays quickly in k, then x is called compressible. Indeed, in order to compress x, one may simply store only the k largest entries. When reconstructing x from its compressed version, the non-stored entries are simply set to zero, and the reconstruction error is σ _k(x)_p. It is emphasized at this point that the procedure of obtaining the compressed version of x is adaptive and nonlinear since it requires the search of the largest entries of x in absolute value. In particular, the location of the nonzeros is a nonlinear type of information.

The best k -term approximation of x can be obtained using the nonincreasing rearrangement $r(x) = (\vert x_{i_{1}}\vert,\ldots,\vert x_{i_{N}}\vert )^{T}$ , where i _jdenotes a permutation of the indexes such that $\vert x_{i_{j}}\vert \geq \vert x_{i_{j+1}}\vert$ for $j = 1,\ldots,N - 1$ . Then it is straightforward to check that

$\displaystyle{\sigma _{k}(x)_{p}:= \left (\sum _{j=k+1}^{N}r_{ j}(x)^{p}\right )^{1/p},\quad 0 < p < \infty.}$

And the vector x _[k] derived from x by setting to zero all the N − k smallest entries in absolute value is the best k -term approximation,

$\displaystyle{x_{[k]} =\arg \min _{z\in \Sigma _{k}}\|x - z\|_{p},}$

for any 0 < p ≤ ∞.

The next lemma states essentially that ℓ _q-balls with small q (ideally q ≤ 1) are good models for compressible vectors.

Lemma 1.

Let 0 < q < p ≤∞ and set $r = \frac{1} {q} -\frac{1} {p}$ . Then

$\displaystyle{\sigma _{k}(x)_{p} \leq k^{-r},\quad k = 1,2,\ldots,N\quad \mbox{ for all }x \in B_{ q}^{N}.}$

Proof.

Let T be the set of indexes of the k-largest entries of x in absolute value. The nonincreasing rearrangement satisfies | r _k(x) | ≤ | x _j| for all j ∈ T, and therefore

$\displaystyle{kr_{k}(x)^{q} \leq \sum _{ j\in T}\vert x_{j}\vert ^{q} \leq \| x\|_{ q}^{q} \leq 1.}$

Hence, $r_{k}(x) \leq k^{-\frac{1} {q} }$ . Therefore,

$\displaystyle\begin{array}{rcl} \sigma _{k}(x)_{p}^{p} =\sum _{ j\notin T}\vert x_{j}\vert ^{p} \leq \sum _{ j\notin T}r_{k}(x)^{p-q}\vert x_{ j}\vert ^{q} \leq k^{-\frac{p-q} {q} }\|x\|_{q}^{q} \leq k^{-\frac{p-q} {q} },& & {}\\ \end{array}$

which implies $\sigma _{k}(x)_{p} \leq k^{-r}.$ ■

Compressive Sensing

The above outlined adaptive strategy of compressing a signal x by only keeping its largest coefficients is certainly valid when full information on x is available. If, however, the signal first has to be acquired or measured by a somewhat costly or lengthy procedure, then this seems to be a waste of resources: At first, large efforts are made to acquire the full signal and then most of the information is thrown away when compressing it. One may ask whether it is possible to obtain more directly a compressed version of the signal by taking only a small amount of linear and nonadaptive measurements. Since one does not know a priori the large coefficients, this seems a daunting task at first sight. Quite surprisingly, compressive sensing nevertheless predicts that reconstruction from vastly undersampled nonadaptive measurements is possible – even by using efficient recovery algorithms.

Taking m linear measurements of a signal $x \in \mathbb{C}^{N}$ corresponds to applying a matrix $A \in \mathbb{C}^{m\times N}$ – the measurement matrix –

$\displaystyle{ y = Ax. }$

(3)

The vector $y \in \mathbb{C}^{m}$ is called the measurement vector. The main interest is in the vastly undersampled case m ≪ N. Without further information, it is, of course, impossible to recover x from y since the linear system (3) is highly underdetermined and has therefore infinitely many solutions. However, if the additional assumption that the vector x is k-sparse is imposed, then the situation dramatically changes as will be outlined.

The approach for a recovery procedure that probably comes first to mind is to search for the sparsest vector x which is consistent with the measurement vector y = Ax. This leads to solving the ℓ ₀ -minimization problem

$\displaystyle{ \min \|z\|_{0}\quad \mbox{ subject to }Az = y. }$

(4)

Unfortunately, this combinatorial minimization problem is NP-hard in general [108, 114]. In other words, an algorithm that solves (4) for any matrix A and any right-hand side y is necessarily computationally intractable. Therefore, essentially two practical and tractable alternatives to (4) have been proposed in the literature: convex relaxation leading to ℓ ₁-minimization – also called basis pursuit [35] – and greedy algorithms, such as various matching pursuits [151, 153]. Quite surprisingly for both types of approaches, various recovery results are available, which provide conditions on the matrix A and on the sparsity $\|x\|_{0}$ such that the recovered solution coincides with the original x and consequently also with the solution of (4). This is no contradiction to the NP-hardness of (4) since these results apply only to a subclass of matrices A and right-hand sides y.

The ℓ ₁-minimization approach considers the solution of

$\displaystyle{ \min \|z\|_{1}\quad \mbox{ subject to }Az = y, }$

(5)

which is a convex optimization problem and can be seen as a convex relaxation of (4). Various efficient convex optimization techniques apply for its solution [17]. In the real-valued case, (5) is equivalent to a linear program, and in the complex-valued case, it is equivalent to a second-order cone program. Therefore, standard software applies for its solution – although algorithms which are specialized to (5) outperform such standard software; see Sect. 4.

The hope is, of course, that the solution of (5) coincides with the solution of (4) and with the original sparse vector x. Figure 3 provides an intuitive explanation why ℓ ₁-minimization promotes sparse solutions. Here, N = 2 and m = 1, so one deals with a line of solutions $\mathcal{F}(y) =\{ z: Az = y\}$ in $\mathbb{R}^{2}$ . Except for pathological situations where kerA is parallel to one of the faces of the polytope B ₁ ², there is a unique solution of the ℓ ₁-minimization problem, which has minimal sparsity, i.e., only one nonzero entry.

Fig. 3

The ℓ ₁-minimizer within the affine space of solutions of the linear system Az = y coincides with a sparsest solution

Recovery results in the next sections make rigorous the intuition that ℓ ₁-minimization indeed promotes sparsity.

For sparse recovery via greedy algorithms, one refers to the literature [151, 153].

The Null Space Property

The null space property is fundamental in the analysis of ℓ ₁-minimization.

Definition 1.

A matrix $A \in \mathbb{C}^{m\times N}$ is said to satisfy the null space property (NSP) of order k with constant γ ∈ (0, 1) if

$\displaystyle{\|\eta _{T}\|_{1} \leq \gamma \|\eta _{T^{c}}\|_{1},}$

for all sets $T \subset \{ 1,\ldots,N\}$ , # T ≤ k and for all η ∈ kerA.

The following sparse recovery result is based on this notion.

Theorem 1.

Let $A \in \mathbb{C}^{m\times N}$ be a matrix that satisfies the NSP of order k with constant γ ∈ (0,1). Let $x \in \mathbb{C}^{N}$ and y = Ax and let x ^∗ be a solution of the ℓ ₁ -minimization problem (5). Then

$\displaystyle{ \|x - x^{{\ast}}\|_{ 1} \leq \frac{2(1+\gamma )} {1-\gamma } \sigma _{k}(x)_{1}. }$

(6)

In particular, if x is k-sparse, then x ^∗ = x.

Proof.

Let η = x ^∗− x. Then η ∈ kerA and

$\displaystyle{\|x^{{\ast}}\|_{ 1} \leq \| x\|_{1}}$

because x ^∗ is a solution of the ℓ ₁-minimization problem (5). Let T be the set of the k-largest entries of x in absolute value. One has

$\displaystyle{\|x_{T}^{{\ast}}\|_{ 1} +\| x_{T^{c}}^{{\ast}}\|_{ 1} \leq \| x_{T}\|_{1} +\| x_{T^{c}}\|_{1}.}$

It follows immediately from the triangle inequality that

$\displaystyle{\|x_{T}\|_{1} -\|\eta _{T}\|_{1} +\|\eta _{T^{c}}\|_{1} -\| x_{T^{c}}\|_{1} \leq \| x_{T}\|_{1} +\| x_{T^{c}}\|_{1}.}$

Hence,

$\displaystyle{\|\eta _{T^{c}}\|_{1} \leq \|\eta _{T}\|_{1} + 2\|x_{T^{c}}\|_{1} \leq \gamma \|\eta _{T^{c}}\|_{1} + 2\sigma _{k}(x)_{1},}$

or, equivalently,

$\displaystyle{ \|\eta _{T^{c}}\|_{1} \leq \frac{2} {1-\gamma }\sigma _{k}(x)_{1}. }$

(7)

Finally,

$\displaystyle\begin{array}{rcl} \|x - x^{{\ast}}\|_{ 1} =\|\eta _{T}\|_{1} +\|\eta _{T^{c}}\|_{1} \leq (\gamma +1)\|\eta _{T^{c}}\|_{1} \leq \frac{2(1+\gamma )} {1-\gamma } \sigma _{k}(x)_{1}& & {}\\ \end{array}$

and the proof is completed. ■

One can also show that if all k-sparse x can be recovered from y = Ax using ℓ ₁-minimization, then necessarily A satisfies the NSP of order k with some constant γ ∈ (0, 1) [38, 87]. Therefore, the NSP is actually equivalent to sparse ℓ ₁-recovery.

The Restricted Isometry Property

The NSP is somewhat difficult to show directly. The restricted isometry property (RIP) is easier to handle, and it also implies stability under noise as stated below.

Definition 2.

The restricted isometry constant δ _kof a matrix $A \in \mathbb{C}^{m\times N}$ is the smallest number such that

$\displaystyle{ (1 -\delta _{k})\|z\|_{2}^{2} \leq \| Az\|_{ 2}^{2} \leq (1 +\delta _{ k})\|z\|_{2}^{2}, }$

(8)

for all $z \in \Sigma _{k}$ .

A matrix A is said to satisfy the restricted isometry property of order k with constant δ _kif δ _k∈ (0, 1). It is easily seen that δ _kcan be equivalently defined as

$\displaystyle{\delta _{k} =\max _{T\subset \{1,\mathop{\ldots },N\},\#T\leq k}\|A_{T}^{{\ast}}A_{ T} -\mathop{{\mathrm{Id}}}\nolimits \|_{2\rightarrow 2},}$

which means that all column submatrices of A with at most k columns are required to be well conditioned. The RIP implies the NSP as shown in the following lemma.

Lemma 2.

Assume that $A \in \mathbb{C}^{m\times N}$ satisfies the RIP of order K = k + h with constant δ _K ∈ (0,1). Then A has the NSP of order k with constant $\gamma = \sqrt{\frac{k} {h} \frac{1+\delta _{K}} {1-\delta _{K}}}$ .

Proof.

Let $\eta \in \mathcal{N} =\ker A$ and $T \subset \{ 1,\ldots,N\}$ , # T ≤ k. Define T ₀ = T and $T_{1},T_{2},\ldots,T_{s}$ to be disjoint sets of indexes of size at most h, associated to a nonincreasing rearrangement of the entries of $\eta \in \mathcal{N}$ , i.e.,

$\displaystyle{ \vert \eta _{j}\vert \leq \vert \eta _{i}\vert \quad \mbox{ for all }j \in T_{\ell},i \in T_{\ell'},\ell\geq \ell'\geq 1. }$

(9)

Note that A η = 0 implies $A\eta _{T_{0}\cup T_{1}} = -\sum _{j=2}^{s}A\eta _{T_{j}}$ . Then, from the Cauchy–Schwarz inequality, the RIP, and the triangle inequality, the following sequence of inequalities is deduced:

$\displaystyle\begin{array}{rcl} \|\eta _{T}\|_{1}& \leq & \sqrt{k}\|\eta _{T}\|_{2} \leq \sqrt{k}\|\eta _{T_{0}\cup T_{1}}\|_{2} \\ & \leq & \sqrt{ \frac{k} {1 -\delta _{K}}}\|A\eta _{T_{0}\cup T_{1}}\|_{2} = \sqrt{ \frac{k} {1 -\delta _{K}}}\|A\eta _{T_{2}\cup T_{3}\cup \ldots \cup T_{s}}\|_{2} \\ & \leq & \sqrt{ \frac{k} {1 -\delta _{K}}}\sum _{j=2}^{s}\|A\eta _{ T_{j}}\|_{2} \leq \sqrt{\frac{1 +\delta _{K } } {1 -\delta _{K}}}\sqrt{k}\sum _{j=2}^{s}\|\eta _{ T_{j}}\|_{2}.{}\end{array}$

(10)

It follows from (9) that | η _i| ≤ | η _ℓ| for all i ∈ T _j+1 and ℓ ∈ T _j. Taking the sum over ℓ ∈ T _jfirst and then the ℓ ₂-norm over i ∈ T _j+1 yields

$\displaystyle{\vert \eta _{i}\vert \leq h^{-1}\|\eta _{ T_{j}}\|_{1},\quad \mbox{ and }\quad \|\eta _{T_{j+1}}\|_{2} \leq h^{-1/2}\|\eta _{ T_{j}}\|_{1}.}$

Using the latter estimates in (10) gives

$\displaystyle{ \|\eta _{T}\|_{1} \leq \sqrt{\frac{1 +\delta _{K } } {1 -\delta _{K}} \frac{k} {h}}\sum _{j=1}^{s-1}\|\eta _{ T_{j}}\|_{1} \leq \sqrt{\frac{1 +\delta _{K } } {1 -\delta _{K}} \frac{k} {h}}\|\eta _{T^{c}}\|_{1}, }$

(11)

and the proof is finished. ■

Taking h = 2k above shows that δ _3k< 1∕3 implies γ < 1. By Theorem 1, recovery of all k-sparse vectors by ℓ ₁-minimization is then guaranteed. Additionally, stability in ℓ ₁ is also ensured. The next theorem shows that RIP implies also a bound on the reconstruction error in ℓ ₂.

Theorem 2.

Assume $A \in \mathbb{C}^{m\times N}$ satisfies the RIP of order 3k with δ _3k < 1∕3. For $x \in \mathbb{C}^{N}$ , let y = Ax and x ^∗ be the solution of the ℓ ₁ -minimization problem (5). Then

$\displaystyle{\|x - x^{{\ast}}\|_{ 2} \leq C\frac{\sigma _{k}(x)_{1}} {\sqrt{k}} }$

with $C = \frac{2} {1-\gamma }\left (\frac{\gamma +1} {\sqrt{2}}+\gamma \right )$ , $\gamma = \sqrt{ \frac{1+\delta _{3k } } {2(1-\delta _{3k})}}$ .

Proof.

Similarly as in the proof of Lemma 2, denote $\eta = x^{{\ast}}- x \in \mathcal{N} =\ker A$ , T ₀ = T the set of the 2k-largest entries of η in absolute value, and T _js of size at most k corresponding to the nonincreasing rearrangement of η. Then, using (10) and (11) with h = 2k of the previous proof,

$\displaystyle{\|\eta _{T}\|_{2} \leq \sqrt{ \frac{1 +\delta _{3k } } {2(1 -\delta _{3k})}}k^{-1/2}\|\eta _{ T^{c}}\|_{1}.}$

From the assumption δ _3k< 1∕3, it follows that $\gamma:= \sqrt{ \frac{1+\delta _{3k } } {2(1-\delta _{3k})}} < 1$ . Lemmas 1 and 2 yield

$\displaystyle\begin{array}{rcl} \|\eta _{T^{c}}\|_{2}& =& \sigma _{2k}(\eta )_{2} \leq (2k)^{-\frac{1} {2} }\|\eta \|_{1} = (2k)^{-1/2}\left (\|\eta _{T}\|_{1} +\|\eta _{T^{c}}\|_{1}\right ) {}\\ & \leq & (2k)^{-1/2}\left (\gamma \|\eta _{ T^{c}}\|_{1} +\|\eta _{T^{c}}\|_{1}\right ) \leq \frac{\gamma +1} {\sqrt{2}}k^{-1/2}\|\eta _{ T^{c}}\|_{1}. {}\\ \end{array}$

Since T is the set of 2k-largest entries of η in absolute value, it holds

$\displaystyle{ \|\eta _{T^{c}}\|_{1} \leq \|\eta _{(\mathop{{\mathrm{supp}}}\nolimits x_{[2k]})^{c}}\|_{1} \leq \|\eta _{(\mathop{{\mathrm{supp}}}\nolimits x_{[k]})^{c}}\|_{1}, }$

(12)

where x _[k] is the best k-term approximation to x. The use of this latter estimate, combined with inequality (7), finally gives

$\displaystyle\begin{array}{rcl} \|x - x^{{\ast}}\|_{ 2}& \leq & \|\eta _{T}\|_{2} +\|\eta _{T^{c}}\|_{2} \leq \left (\frac{\gamma +1} {\sqrt{2}}+\gamma \right )k^{-1/2}\|\eta _{ T^{c}}\|_{1} {}\\ & \leq & \frac{2} {1-\gamma }\left (\frac{\gamma +1} {\sqrt{2}}+\gamma \right )k^{-1/2}\sigma _{ k}(x)_{1}. {}\\ \end{array}$

This concludes the proof. ■

The restricted isometry property implies also robustness under noise on the measurements. This fact was first noted in [26, 30].

Theorem 3.

Assume that the restricted isometry constant δ _2k of the matrix $A \in \mathbb{C}^{m\times N}$ satisfies

$\displaystyle{ \delta _{2k} < 1/\sqrt{2} \approx 0.7071 }$

(13)

Then the following holds for all $x \in \mathbb{C}^{N}$ . Let noisy measurements y = Ax + e be given with $\|e\|_{2} \leq \eta$ . Let x ^∗ be the solution of

$\displaystyle{ \min \|z\|_{1}\quad \mathrm{\ subject\ to\ }\|Az - y\|_{2} \leq \eta. }$

(14)

Then

$\displaystyle{\|x - x^{{\ast}}\|_{ 2} \leq C_{1}\eta + C_{2}\frac{\sigma _{k}(x)_{1}} {\sqrt{k}} }$

for some constants C ₁, C ₂ > 0 that depend only on δ _2k.

The constant in (13) was improved several times [22, 38, 74–76] until the present statement was reached in [19], which is actually optimal [45].

Coherence

The coherence is a by now classical way of analyzing the recovery abilities of a measurement matrix [50, 151]. For a matrix $A = (a_{1}\vert a_{2}\vert \cdots \vert a_{N}) \in \mathbb{C}^{m\times N}$ with normalized columns, $\|a_{\ell}\|_{2} = 1$ , it is defined as

$\displaystyle{\mu:=\max _{\ell\neq k}\vert \langle a_{\ell},a_{k}\rangle \vert.}$

Applying Gershgorin’s disc theorem [93] to A _T^∗ A _T− I with # T = k shows that

$\displaystyle{ \delta _{k} \leq (k - 1)\mu. }$

(15)

Several explicit examples of matrices are known which have small coherence $\mu = \mathcal{O}(1/\sqrt{m})$ . A simple one is the concatenation $A = (I\vert F) \in \mathbb{C}^{m\times 2m}$ of the identity matrix and the unitary Fourier matrix $F \in \mathbb{C}^{m\times m}$ with entries F _{j, k}= m ^−1∕2 e ^{2π i j k∕m}. It is easily seen that $\mu = 1/\sqrt{m}$ in this case. Furthermore, [143] gives several matrices $A \in \mathbb{C}^{m\times m^{2} }$ with coherence $\mu = 1/\sqrt{m}$ . In all these cases, $\delta _{k} \leq C \frac{k} {\sqrt{m}}$ . Combining this estimate with the recovery results for ℓ ₁-minimization above shows that all k-sparse vectors x can be (stably) recovered from y = Ax via ℓ ₁-minimization provided

$\displaystyle{ m \geq C'k^{2}. }$

(16)

At first sight, one might be satisfied with this condition since if k is very small compared to N, then still m might be chosen smaller than N and all k-sparse vectors can be recovered from the undersampled measurements y = Ax. Although this is great news for a start, one might nevertheless hope that (16) can be improved. In particular, one may expect that actually a linear scaling of m in k should be enough to guarantee sparse recovery by ℓ ₁-minimization. The existence of matrices, which indeed provide recovery conditions of the form m ≥ Cklog^α(N) (or similar) with some α ≥ 1, is shown in the next section. Unfortunately, such results cannot be shown using simply the coherence because of the generally lower bound [143]

$\displaystyle{\mu \geq \sqrt{ \frac{N - m} {m(N - 1)}} \sim \frac{1} {\sqrt{m}}\qquad (N\mbox{ sufficiently large}).}$

In particular, it is not possible to overcome the “quadratic bottleneck” in (16) by using Gershgorin’s theorem or Riesz–Thorin interpolation between $\|\cdot \|_{1\rightarrow 1}$ and $\|\cdot \|_{\infty \rightarrow \infty }$ ; see also [131, 141]. In order to improve on (16), one has to take into account also cancellations in the Gramian A _T^∗ A _T− I, and this task seems to be quite difficult using deterministic methods. Therefore, it will not come as a surprise that the major breakthrough in compressive sensing was obtained with random matrices. It is indeed easier to deal with cancellations in the Gramian using probabilistic techniques.

RIP for Gaussian and Bernoulli Random Matrices

Optimal estimates for the RIP constants in terms of the number m of measurement matrices can be obtained for Gaussian, Bernoulli, or more general subgaussian random matrices.

Let X be a random variable. Then one defines a random matrix A = A(ω), $\omega \in \Omega$ , as the matrix whose entries are independent realizations of X, where $(\Omega,\Sigma, \mathbb{P})$ is their common probability space. One assumes further that for any $x \in \mathbb{R}^{N}$ one has the identity $\mathbb{E}\|Ax\|_{2}^{2} =\| x\|_{2}^{2}$ , $\mathbb{E}$ denoting expectation.

The starting point for the simple approach in [7] is a concentration inequality of the form

$\displaystyle{ \mathbb{P}\left (\left \vert \|Ax\|_{2}^{2} -\| x\|_{ 2}^{2}\right \vert \geq \delta \| x\|_{ 2}^{2}\right ) \leq 2e^{-c_{0}\delta ^{2}m },\quad 0 <\delta < 1, }$

(17)

where c ₀ > 0 is some constant.

The two most relevant examples of random matrices which satisfy the above concentration are the following:

Gaussian matrices. Here the entries of A are chosen as i.i.d. Gaussian random variables with expectation 0 and variance 1∕m. As shown in [1], Gaussian matrices satisfy (17).

Bernoulli matrices The entries of a Bernoulli matrices are independent realizations of $\pm 1/\sqrt{m}$ Bernoulli random variables, that is, each entry takes the value $+1/\sqrt{m}$ or $-1/\sqrt{m}$ with equal probability. Bernoulli matrices also satisfy the concentration inequality (17) [1].

Based on the concentration inequality (17), the following estimate on RIP constants can be shown [7, 26, 76, 110].

Theorem 4.

Assume $A \in \mathbb{R}^{m\times N}$ to be a random matrix satisfying the concentration property (17). Then there exists a constant C depending only on c ₀ such that the restricted isometry constant of A satisfies δ _k ≤δ with probability exceeding 1 −ɛ provided

$\displaystyle{m \geq C\delta ^{-2}(k\log (N/m) +\log (\varepsilon ^{-1})).}$

Combining this RIP estimate with the recovery results for ℓ ₁-minimization shows that all k-sparse vectors $x \in \mathbb{C}^{N}$ can be stably recovered from a random draw of A satisfying (17) with high probability provided

$\displaystyle{ m \geq Ck\log (N/m). }$

(18)

Up to the logarithmic factor, this provides the desired linear scaling of the number m of measurements with respect to the sparsity k. Furthermore, as shown in Sect. 3 below, condition (18) cannot be further improved; in particular, the log-factor cannot be removed.

It is useful to observe that the concentration inequality is invariant under unitary transforms. Indeed, suppose that z is not sparse with respect to the canonical basis but with respect to a different orthonormal basis. Then z = Ux for a sparse x and a unitary matrix $U \in \mathbb{C}^{N\times N}$ . Applying the measurement matrix A yields

$\displaystyle{Az = AUx,}$

so that this situation is equivalent to working with the new measurement matrix A′ = AU and again sparsity with respect to the canonical basis. The crucial point is that A′ satisfies again the concentration inequality (17) once A does. Indeed, choosing x = U ⁻¹ x′ and using unitarity gives

$\displaystyle\begin{array}{rcl} & & \mathbb{P}\left (\left \vert \|AUx\|_{2}^{2} -\| x\|_{ 2}^{2}\right \vert \geq \delta \| x\|_{\ell_{ 2}^{N}}^{2}\right ) = \mathbb{P}\left (\left \vert \|Ax'\|_{ 2}^{2} -\| U^{-1}x'\|_{ 2}^{2}\right \vert \geq \delta \| U^{-1}x'\|_{\ell_{ 2}^{N}}^{2}\right ) {}\\ & & \quad = \mathbb{P}\left (\left \vert \|Ax'\|_{2}^{2} -\| x'\|_{ 2}^{2}\right \vert \geq \delta \| x'\|_{\ell_{ 2}^{N}}^{2}\right ) \leq 2e^{-c_{0}\delta ^{-2}m }. {}\\ \end{array}$

Hence, Theorem 4 also applies to A′ = AU. This fact is sometimes referred to as the universality of the Gaussian or Bernoulli random matrices. It does not matter in which basis the signal x is actually sparse. At the coding stage, where one takes random measurements y = Az, knowledge of this basis is not even required. Only the decoding procedure needs to know U.

Random Partial Fourier Matrices

While Gaussian and Bernoulli matrices provide optimal conditions for the minimal number of required samples for sparse recovery, they are of somewhat limited use for practical applications for several reasons. Often the application imposes physical or other constraints on the measurement matrix, so that assuming A to be Gaussian may not be justifiable in practice. One usually has only limited freedom to inject randomness in the measurements. Furthermore, Gaussian or Bernoulli matrices are not structured, so there is no fast matrix-vector multiplication available which may speed up recovery algorithms, such as the ones described in Sect. 4. Thus, Gaussian random matrices are not applicable in large-scale problems.

A very important class of structured random matrices that overcomes these drawbacks are random partial Fourier matrices, which were also the object of study in the very first papers on compressive sensing [26, 29, 128, 129]. A random partial Fourier matrix $A \in \mathbb{C}^{m\times N}$ is derived from the discrete Fourier matrix $F \in \mathbb{C}^{N\times N}$ with entries

$\displaystyle{F_{j,k} = \frac{1} {\sqrt{N}}e^{2\pi jk/N},}$

by selecting m rows uniformly at random among all N rows. Taking measurements of a sparse $x \in \mathbb{C}^{N}$ corresponds then to observing m of the entries of its discrete Fourier transform $\hat{x} = Fx$ . It is important to note that the fast Fourier transform may be used to compute matrix-vector multiplications with A and A ^∗ with complexity $\mathcal{O}(N\log (N))$ . The following theorem concerning the RIP constant was proven in [131] and improves slightly on the results in [26, 129, 138].

Theorem 5.

Let $A \in \mathbb{C}^{m\times N}$ be the random partial Fourier matrix as just described. Then the restricted isometry constant of the rescaled matrix $\sqrt{\frac{N} {m}}A$ satisfies δ _k ≤δ with probability at least $1 - N^{-\gamma \log ^{3}(N) }$ provided

$\displaystyle\begin{array}{rcl} m \geq C\delta ^{-2}k\log ^{4}(N).& &{}\end{array}$

(19)

The constants C,γ > 1 are universal.

Combining this estimate with the ℓ ₁-minimization results above shows that recovery with high probability can be ensured for all k-sparse x provided

$\displaystyle{m \geq Ck\log ^{4}(N).}$

The plots in Fig. 1 illustrate an example of successful recovery from partial Fourier measurements.

The proof of the above theorem is not straightforward and involves Dudley’s inequality as a main tool [131, 138]. Compared to the recovery condition (18) for Gaussian matrices, one suffers a higher exponent at the log-factor, but the linear scaling of m in k is preserved. Also a nonuniform recovery result for ℓ ₁-minimization is available [29, 128, 131], which states that each k-sparse x can be recovered using a random draw of the random partial Fourier matrix A with probability at least 1 −ɛ provided m ≥ Cklog(N∕ɛ). The difference to the statement in Theorem 5 is that for each sparse x, recovery is ensured with high probability for a new random draw of A. It does not imply the existence of a matrix which allows recovery of all k-sparse x simultaneously. The proof of such recovery results does not make use of the restricted isometry property or the null space property.

One may generalize the above results to a much broader class of structured random matrices which arise from random sampling in bounded orthonormal systems. The interested reader is referred to [128, 129, 131, 132].

Another class of structured random matrices, for which recovery results are known, consist of partial random circulant and Toeplitz matrices. These correspond to subsampling the convolution of x with a random vector b at m fixed (deterministic) entries. The reader is referred to [130, 131, 133] for detailed information. Near-optimal estimates of the RIP constants of such type of random matrices have been established in [101]. Further types of random measurement matrices are discussed in [101, 122, 124, 137, 154]; see also [99] for an overview.

Compressive Sensing and Gelfand Widths

In this section a quite general viewpoint is taken. The question is investigated how well any measurement matrix and any reconstruction method – in this context usually called the decoder – may perform. This leads to the study of Gelfand widths, already mentioned in Sect. 2. The corresponding analysis allows to draw the conclusion that Gaussian random matrices in connection with ℓ ₁-minimization provide optimal performance guarantees.

Following the tradition of the literature in this context, only the real-valued case will be treated. The complex-valued case is easily deduced from the real-valued case by identifying $\mathbb{C}^{N}$ with $\mathbb{R}^{2N}$ and by corresponding norm equivalences of ℓ _p-norms.

The measurement matrix $A \in \mathbb{R}^{m\times N}$ is here also referred to as the encoder. The set $\mathcal{A}_{m,N}$ denotes all possible encoder/decoder pairs $(A,\Delta )$ where $A \in \mathbb{R}^{m\times N}$ and $\Delta: \mathbb{R}^{m} \rightarrow \mathbb{R}^{N}$ is any (nonlinear) function. Then, for 1 ≤ k ≤ N, the reconstruction errors over subsets $K \subset \mathbb{R}^{N}$ , where $\mathbb{R}^{N}$ is endowed with a norm $\|\cdot \|_{X}$ , are defined as

$\displaystyle\begin{array}{rcl} \sigma _{k}(K)_{X}&:=& \sup _{x\in K}\sigma _{k}(x)_{X}, {}\\ E_{m}(K,X)&:=& \inf _{(A,\Delta )\in A_{m,N}}\sup _{x\in K}\|x - \Delta (Ax)\|_{X}. {}\\ \end{array}$

In words, E _n(K, X) is the worst reconstruction error for the best pair of encoder/decoder. The goal is to find the largest k such that

$\displaystyle{E_{m}(K,X) \leq C_{0}\sigma _{k}(K)_{X}.}$

Of particular interest for compressive sensing are the unit balls K = B _p^Nfor 0 < p ≤ 1 and X = ℓ ₂ ^Nbecause the elements of B _p^Nare well approximated by sparse vectors due to Lemma 1. The proper estimate of E _m(K, X) turns out to be linked to the geometrical concept of Gelfand width.

Definition 3.

Let K be a compact set in a normed space X. Then the Gelfand width of K of order m is

$\displaystyle{d^{m}(K,X):=\inf _{\begin{array}{c} Y \subset X \\ \mathop{{\mathrm{codim}}}\nolimits (Y ) \leq m \end{array} }\sup \{\|x\|_{X}: x \in K\cap Y \},}$

where the infimum is over all linear subspaces Y of X of codimension less or equal to m.

The following fundamental relationship between E _m(K, X) and the Gelfand widths holds.

Proposition 1.

Let $K \subset \mathbb{R}^{N}$ be a closed compact set such that K = −K and K + K ⊂ C ₀ K for some constant C ₀. Let $X = (\mathbb{R}^{N},\|\cdot \|_{X})$ be a normed space. Then

$\displaystyle{d^{m}(K,X) \leq E_{ m}(K,X) \leq C_{0}d^{m}(K,X).}$

Proof.

For a matrix $A \in \mathbb{R}^{m\times N}$ , the subspace $Y =\ker A$ has codimension less or equal to m. Conversely, to any subspace $Y \subset \mathbb{R}^{N}$ of codimension less or equal to m, a matrix $A \in \mathbb{R}^{m\times N}$ can be associated, the rows of which form a basis for $Y ^{\perp }$ . This identification yields

$\displaystyle{d^{m}(K,X) =\inf _{ A\in \mathbb{R}^{m\times N}}\sup \{\|\eta \|_{X}:\eta \in \ker A \cap K\}.}$

Let $(A,\Delta )$ be an encoder/decoder pair in $\mathcal{A}_{m,N}$ and $z = \Delta (0)$ . Denote Y = ker(A). Then with η ∈ Y also −η ∈ Y, and either $\|\eta -z\|_{X} \geq \|\eta \|_{X}$ or $\|-\eta - z\|_{X} \geq \|\eta \|_{X}$ . Indeed, if both inequalities were false, then

$\displaystyle{\|2\eta \|_{X} =\|\eta -z + z +\eta \| _{X} \leq \|\eta -z\|_{X} +\| -\eta - z\|_{X} < 2\|\eta \|_{X},}$

a contradiction. Since K = −K, it follows that

$\displaystyle\begin{array}{rcl} d^{m}(K,X)& =& \inf _{ A\in \mathbb{R}^{m\times N}}\sup \{\|\eta \|_{X}:\eta \in Y \cap K\} \leq \sup _{\eta \in Y \cap K}\|\eta - z\|_{X} {}\\ & =& \sup _{\eta \in Y \cap K}\|\eta - \Delta (A\eta )\|_{X} \leq \sup _{x\in K}\|x - \Delta (Ax)\|_{X}. {}\\ \end{array}$

Taking the infimum over all $(A,\Delta ) \in \mathcal{A}_{m,N}$ yields

$\displaystyle{d^{m}(K,X) \leq E_{ m}(K,X).}$

To prove the converse inequality, choose an optimal Y such that

$\displaystyle{d^{m}(K,X) =\sup \{\| x\|_{ X}: x \in Y \cap K\}.}$

(An optimal subspace Y always exists [107].) Let A be a matrix whose rows form a basis for $Y ^{\perp }$ . Denote the affine solution space $\mathcal{F}(y):=\{ x: Ax = y\}$ . One defines then a decoder as follows. If $\mathcal{F}(y) \cap K\neq \emptyset$ , then choose some $\bar{x}(y) \in \mathcal{F}(y)$ and set $\Delta (y) =\bar{ x}(y)$ . If $\mathcal{F}(y) \cap K =\emptyset$ , then $\Delta (y) \in \mathcal{F}(y)$ . The following chain of inequalities is then deduced:

$\displaystyle\begin{array}{rcl} E_{m}(K,X)& \leq & \sup _{y}\sup _{x,x'\in \mathcal{F}(y)\cap K}\|x - x'\|_{X} {}\\ & \leq & \sup _{\eta \in C_{0}(Y \cap K)}\|\eta \|_{X} \leq C_{0}d^{m}(K,X), {}\\ \end{array}$

which concludes the proof. ■

The assumption K + K ⊂ C ₀ K clearly holds for norm balls with C ₀ = 2 and for quasi-norm balls with some C ₀ ≥ 2. The next theorem provides a two-sided estimate of the Gelfand widths d ^m(B _p^N, ℓ ₂ ^N) [48, 77, 157]. Note that the case p = 1 was considered much earlier in [77, 79, 96].

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Tags: Handbook of Mathematical Methods in Imaging

Apr 9, 2016 | Posted by admin in GENERAL RADIOLOGY | Comments Off

Radiology Key

Fastest Radiology Insight Engine

Sensing

2 Background

Early Developments in Applications

Sparse Approximation

Information-Based Complexity and Gelfand Widths

Compressive Sensing

Developments in Computer Science

3 Mathematical Modelling and Analysis

Preliminaries and Notation

Sparsity and Compression

Compressive Sensing

The Null Space Property

The Restricted Isometry Property

Coherence

RIP for Gaussian and Bernoulli Random Matrices

Random Partial Fourier Matrices

Compressive Sensing and Gelfand Widths

Stay updated, free articles. Join our Telegram channel

Full access? Get Clinical Tree

Radiology Key

Fastest Radiology Insight Engine

Sensing

2 Background

Early Developments in Applications

Sparse Approximation

Information-Based Complexity and Gelfand Widths

Compressive Sensing

Developments in Computer Science

3 Mathematical Modelling and Analysis

Preliminaries and Notation

Sparsity and Compression

Compressive Sensing

The Null Space Property

The Restricted Isometry Property

Coherence

RIP for Gaussian and Bernoulli Random Matrices

Random Partial Fourier Matrices

Compressive Sensing and Gelfand Widths

Share this:

Related posts:

Stay updated, free articles. Join our Telegram channel

Full access? Get Clinical Tree