Semantic Descriptions from Medical Images with Convolutional Neural Networks

Fig. 1.

Multi-scale CNN architecture used in our experiments. One stack of three pairs of convolution (Conv) and max-pooling layers (MP) uses input image patches of size $35\times 35$ . The second stack comprises two pairs of convolution and max-pooling layers and uses input image patches of size $71\times 71$ (centered at the same position). The resulting outputs $$Output_k$$

of the k pairs of convolution $$Conv_k$$

and max-pooling $$MP_k$$

layers are the inputs for the succeeding layers, with corresponding convolution filter sizes $$f_s$$

and sizes of the pooling regions $$p_s$$

. Our CNN also comprises two fully connected layers (FC) and a terminal classification layer (CL). The outputs of both stacks are connected densely with all neurons of the first fully-connected layer. The location parameters are fed jointly with the activations of the second fully-connected layer into the classification layer.

2.1 Representing Inputs and Targets

The overall data comprises M tuples of medical imaging data, corresponding clinical reports and voxel-wise ground-truth class labels $\langle \mathbf {I}^m, \mathbf {T}^m, \mathbf {L}^m \rangle$ , with $m=1,2,\dots ,M$ , where $\mathbf {I}^m \in \mathbb {R}^{n \times n}$ is an intensity image (e.g., a slice of an SD-OCT volume scan of the retina) of size $n \times n$ , $\mathbf {L}^m \in \{1,\ldots ,K+1\}^{n \times n}$ is an array of the same size containing the corresponding ground-truth class labels and $\mathbf {T}^m$ is the corresponding textual report. During training we are only given $\langle \mathbf {I}^m, \mathbf {T}^m\rangle$ and train a classifier to predict $\mathbf {L}^m$ from $\mathbf {I}^m$ on new testing data. In this paper we propose a weakly supervised learning approach using semantic descriptions, where the voxel-level ground-truth class labels $\mathbf {L}^m$ are not used for training but only for evaluation of the voxel-wise prediction accuracy.

Visual and Coordinate Input Information. To capture visual information at different levels of detail we extract small square-shaped image patches $\mathbf {\dot{x}}_i^m \in \mathbb {R}^{\alpha \times \alpha }$ of size $\alpha$ and larger square-shaped image patches $\mathbf {\ddot{x}}_i^m \in \mathbb {R}^{\beta \times \beta }$ of size $\beta$ with $\alpha <\beta <n$ centered at the same spatial position $\mathbf {c}_i^m$ in volume $\mathbf {I}^m$ , where i is the index of the centroid of the image patch. For each image patch, we provide two additional quantitative location parameters to the network: (i) the 3D spatial coordinates $\mathbf {c}_i^m \in \varOmega \subset \mathbb {R}^3$ of the centroid i of the image patches and (ii) the Euclidean distance $d_i^m \in \varOmega \subset \mathbb {R}$ of the patch center i to a given reference structure (in our case: fovea) within the volume. We do not need to integrate these location parameters in the deep feature representation computation but inject them below the classification layer by concatenating the location parameters and activations of the fully-connected layer representing visual information (see Fig. 1). The same input information is provided for all experiments.

Semantic Target Labels. We assume that objects (e.g. pathology) are reported together with a textual description of their approximate spatial location. Thus a report $\mathbf {T}^m$ consists of K pairs of text snippets $\langle t_{P}^{m,k}, t_{Loc}^{m,k}\rangle$ , with $k=1,2,\dots ,K$ , where $t_{P}^{m,k} \in \mathcal {P}$ describes the occurrence of a specific object class term and $t_{Loc}^{m,k} \in \mathcal {L}$ represents the semantic description of its spatial locations. These spatial locations can be both abstract subregions (e.g., centrally located) of the volume or concrete anatomical structures. Note that $t_{Loc}^{m,k}$ does not contain quantitative values, and we do not know the link between these descriptions and image coordinate information. This semantic information can come in $\varGamma$ orthogonal semantic groups (e.g., in (1) the lowest layer and (2) close to the fovea). That is, different groups represent different location concepts found in clinical reports. The extraction of these pairs from the textual document is based on semantic parsing [15] and is not subject of this paper. We decompose the textual report $\mathbf {T}^m$ into the corresponding semantic target label $\mathbf {s}^m \in \{0,1\}^{K \cdot \sum _{\gamma } n_{\gamma }}$ , with $\gamma =1,2,\dots ,\varGamma$ , where K is the number of different object classes which should be classified (e.g. cyst), and $n_{\gamma }$ is the number of nominal region classes in one semantic group $\gamma$ of descriptions (e.g., $n_{\gamma }=3$ for upper vs. central vs. lower layer, $n_{\gamma } = 2$ for close vs. far from reference structure). I.e., lets assume we have two groups, then $\mathbf {s}^m$ is a K-fold concatenation of pairs of a binary layer group $g_1^k \in \{0,1\}^{n_1}$ with

bits representing different layer classes and a binary reference location group $g_2^k \in \{0,1\}^{n_2}$ with

bits representing relative locations to a reference structure. For all object classes, all bits of the layer group, and all bits of the reference location group are set to 1, if they are mentioned mutually with the respective object class in the textual report. All bits of the corresponding layer group and all bits of the corresponding reference location group are set to 0, where the respective object class is not mentioned in the report. The vector $\mathbf {s}^m$ of semantic target labels is assigned to all input tuples $\langle \mathbf {\dot{x}}_i^m, \mathbf {\ddot{x}}_i^m, \mathbf {c}_i^m, d_i^m \rangle$ extracted from the corresponding volume $\mathbf {I}^m$ . Figure 2a shows an example of a semantic target label representation comprising two object classes. According to this binary representation the first object is mentioned mutually with layer classes 1, 2 and 3 and with reference location class 1 in the textual report. Figure 2c shows the corresponding volume information.

Fig. 2.

(a) Example of a semantic target label comprising two object classes (K = 2). Each of which comprises a layer group $$g_1^k$$

with 4 bits (layer 1,2,3, or 4) and a reference location group $$g_2^k$$

with 2 bits (close or distant). (b) Prediction of a semantic description $\hat{\mathbf {s}}_i^m$ that would lead to a corresponding object class label prediction $\hat{l}_i = 1$ . (c) Visualization of the volume information which could lead to the given semantic target label shown in (a). (Best viewed in color) (Colour figure online)

Voxel-wise Ground Truth Labels. To evaluate the algorithm, we use the ground-truth class label $l_i \in \{1,...,K+1\}$ from $\mathbf {L}^m$ at the center position $\mathbf {c}_i^m$ of the patches for every multi-scale image patch pair $\langle \mathbf {\dot{x}}_i^m, \mathbf {\ddot{x}}_i^m \rangle$ . Labels include the reported observations $t_{P}^{m,k}$ and a healthy background label.

is assigned to the whole multi-scale image patch pair $\langle \mathbf {\dot{x}}_i^m, \mathbf {\ddot{x}}_i^m \rangle$ centered at voxel position i.

2.2 Training to Predict Semantic Descriptors

We train a CNN to predict the semantic description from the imaging data and the corresponding location information provided for the patch center voxels. We use tuples $\langle \mathbf {\dot{x}}_i^m, \mathbf {\ddot{x}}_i^m, \mathbf {c}_i^m, d_i^m, \mathbf {s}_i^m \rangle$ for weakly supervised training of our model. The training objective is to learn the mapping

$\begin{aligned} f: \langle \mathbf {\dot{x}}_i^m, \mathbf {\ddot{x}}_i^m, \mathbf {c}_i^m, d_i^m \rangle \longmapsto \mathbf {s}_i^m \end{aligned}$