Optimal Transport Averaging of Neuroimaging Data

. Simply put, we consider normalized histograms on a grid of locations, and assume the distance between any two locations is provided. Next, we extend Wasserstein distances to non-negative vectors of bounded mass.

Notations. Let d be the cardinal of $\varOmega$ . Relabeling arbitrarily all the elements of $\varOmega$ as $\{1,\cdots , d\}$ we represent the set of non-negative measures on $\varOmega$ by the non-negative orthant $\mathbb {R}^{d}_+$ . Let $\mathbf {1}_d$ be the d-dimensional vector of ones. For any vector $u\in \mathbb {R}^d$ we write

for the

norm of u, $\sum _{i=1}^d |u_i|$ . When $u\in \mathbb {R}^{d}_+$ we also call

the (total) mass of vector u. Let $M=[m_{ij}]\in \mathbb {R}^{d\times d}_+$ be the matrix that describes the metric between all locations in $\varOmega$ , namely $m_{ij}=D(i,j), i,j\le d$ .

Wasserstein Distance for Normalized Histograms. Consider two vectors $a,b\in \mathbb {R}^{d}_+$ such that

. Both can be interpreted as histograms on $\varOmega$ of the same mass. A non-trivial example of such normalized data in medical imaging is the discretized ODF used for diffusion imaging data [10]. For $p\ge 1$ , the p-Wasserstein distance

between a and b is the $p^\text {th}$ root of the optimum of a linear program, known as a transportation problem [4, §7.2]. A transport problem is a network flow problem on a bipartite graph with cost

(the pairwise distance matrix M raised element-wise to the power p), and feasible set of flows U(a, b) (known as the transportation polytope of a and b), where U(a, b) is the set of $d\times d$ nonnegative matrices such that their row and column marginals are equal to a and b respectively:

$\begin{aligned} U(a,b) \overset{\text {def}}{=}\{ T\in \mathbb {R}_+^{d\times d}\; |\; T\mathbf {1}_d = a,\, T^T\mathbf {1}_d = b \}. \end{aligned}$

(1)

Given the constraints induced by a and b, one naturally has that U(a, b) is empty when $|a|_1\ne |b|_1$ and non-empty when

(in which case one can easily check that the matrix

belongs to that set). The p-Wasserstein distance

raised to the power p (written

below) is equal to the optimum of a parametric Optimal Transport ( $\mathbf {OT}$ ) problem on

variables,

$\begin{aligned} W_p^p(a,b) = \mathbf {OT}(a,b,M^p)\overset{\text {def}}{=}\min _{T\in U(a,b)}\langle T , M^p\,\rangle , \end{aligned}$

(2)

parameterized by the marginals a, b and matrix

Optimal Transport for Unnormalized Measures. If the total masses of a and b differ, namely $|a|_1\ne |b|_1$ , the definition provided above is not useful because $U(a,b)=\emptyset$ . Several extensions of the OT problem have been proposed in that setting; we recall them here for the sake of completeness. In the computer vision literature, [23] proposed to handle that case by: (i) relaxing the equality constraints of U(a, b) to inequality constraints $T\mathbf {1}_d \le a,\, T^T\mathbf {1}_d \le b$ in Eq. (1); (ii) adding an equality constraint on the total mass of the solution $\mathbf {1}_d^TT\mathbf {1}_d=\min (|a|_1,|b|_1)$ ; (iii) dividing the minimum of $\langle T , M\,\rangle$ under constraints (i,ii) by $\min (|a|_1,|b|_1)$ . This modification does not, however, result in a metric. [20] proposed later a variant of this approach called EMD-hat that incorporates constraints (i,ii) but (iii’) adds to the optimal cost $\langle T^\star , M\,\rangle$ a constant times $\min (|a|_1,|b|_1)$ . When that constant is large enough M, [20] claim that EMD-hat is a metric. We also note that [2] proposed a quadratic penalty between the differences of masses and made use of a dynamic formulation of the transportation problem.

Kantorovich Norms for Signed Measures. We propose to build on early contributions by Kantorovich to define a generalization of optimal transport distance for unnormalized measures, making optimal transport applicable to a wider class of problems, such as the averaging of functional imaging data. [18] proposed such a generalization as an intermediary result of a more general definition, the Kantorovich norm for signed measures on a compact metric space, which was itself extended to separable metric spaces by [15]. We summarize this idea here by simplifying it to the case of interest in this paper where $\varOmega$ is a finite (of size d) probability space, in which case signed measures are equivalent to vectors in $\mathbb {R}^d$ . [18] propose first a norm for vectors z in the orthogonal of $\mathbf {1}_d$ (vectors z such that $z^T\mathbf {1}_d=0$ ), by considering the 1-Wasserstein distance between the positive and negative parts of z,

. A penalty vector $\varDelta \in \mathbb {R}^{d}_+$ is then introduced to define the norm $||x||_{K}$ of any vector x as the minimal value of $||z||_K + \varDelta ^T|z-x|$ when z is taken in the space of all vectors z with zero sum, and

is the element-wise absolute value of the difference of vectors z and x. For this to define a true norm in $\mathbb {R}^d$ , $\varDelta$ must be such that $\varDelta _i\ge \max _j m_{ij}$ and $|\varDelta _i-\varDelta _j|\le m_{ij}$ . The distance between two arbitrary non-negative vectors a, b of different mass is then defined as

. As highlighted by [26, p.108], and if we write

for the $i^{\text {th}}$ vector of the canonical basis of $\mathbb {R}^d$ , this norm is the maximal norm in $\mathbb {R}^d$ such that for any $i,j \le d$ , $||e_i-e_j||_K=m_{ij}$ , namely the maximal norm in the space of signed measures on $\varOmega$ such that the norm between two Dirac measures coincides with $\varOmega$ ’s metric between these points.

Kantorovich Distances for Unnormalized Nonnegative Measures. Reference [14] noticed that Kantorovich’s distance between unnormalized measures can be cast as a regular optimal transport problem. Indeed, one simply needs to: (i) add a virtual point $\omega$ to the set $\varOmega =\{1,\cdots ,d\}$ whose distance $D(i,\omega )=D(\omega ,i)$ to any element i in $\varOmega$ is set to $\varDelta _i\,$ ; (ii) use that point $\omega$ as a buffer when comparing two measures of different mass. The appeal of Kantorovich’s formulation in the context of this work is that it boils down to a classic optimal transport problem, which can be approximated efficiently using the smoothing approach of [6] as discussed in Sect. 3. To simplify our analysis in the next section, we only consider non-negative vectors (histograms) $a\in \mathbb {R}^{d}_+$ such that their total mass is upper bounded by a known positive constant. This assumption alleviates the definition of our distance below, since it does not require to treat separately the cases where either $|a|_1 \ge |b|_1$ or

when comparing $a,b\in \mathbb {R}^{d}_+$ . Note also that this assumption always holds when dealing with finite collections of data. Without loss of generality, this is equivalent to considering vectors a in $\mathbb {R}^{d}_+$ such that $|a|_1\le 1$ with a simple rescaling of all vectors by that constant. We define next the Kantorovich metric on

, where $S_d=\{u\in \mathbb {R}^{d}_+, |u|_1 \le 1\}$ .

Definition 1

(Kantorovich Distances on

). Let $\varDelta \in \mathbb {R}^{d}_+$ such that $\varDelta _i\ge \max _j m_{ij}$ and $|\varDelta _i-\varDelta _j|\le m_{ij}$ . Let $p\ge 0$ . For two elements a and b of

, we write $\alpha = 1-|a|_1\ge 0$ and $\beta = 1-|b|_1\ge 0$ . Their p-Kantorovich distance raised to the power p is

$\begin{aligned} K^p_{p \varDelta }(a,b)= \mathbf {OT}\left( \left[ {\begin{matrix}a\\ \alpha \end{matrix}}\right] ,\left[ {\begin{matrix}b\\ \beta \end{matrix}}\right] ,\hat{M}^p\right) , \text { where } \hat{M}=\begin{bmatrix}M&\varDelta \\\varDelta ^T&0\end{bmatrix}\in \mathbb {R}_+^{d+1\times d+1}.\end{aligned}$