Algorithms from a Non-stochastic Perspective

, where Z is an arbitrary set. We assume that there is z ^∗ ∈ Z with f(z ^∗) ≥ f(z), for all z ∈ Z. We also assume that there is a nonnegative function $b: \mathbb{R}^{N} \times Z \rightarrow \mathbb{R}$ such that

$\displaystyle{f(z) =\int b(x,z)\mathit{dx}.}$

Having found z ^k, we maximize the function

$\displaystyle\begin{array}{rcl} H(z^{k},z) =\int b(x,z^{k})\log b(x,z)\mathit{dx}& &{}\end{array}$

(1)

to get z ^k+1. Adopting such an iterative approach presupposes that maximizing H(z ^k, z) is simpler than maximizing f(z) itself. This is the case with the EM algorithm.

The cross-entropy or Kullback-Leibler distance [7] is a useful tool for analyzing the EM algorithm. For positive numbers u and v, the Kullback-Leibler distance from u to v is

$\displaystyle\begin{array}{rcl} KL(u,v) = u\log \frac{u} {v} + v - u.& &{}\end{array}$

(2)

We also define KL(0, 0) = 0, KL(0, v) = v, and $KL(u,0) = +\infty$ . The KL distance is extended to nonnegative vectors component-wise, so that for nonnegative vectors a and b, we have

$\displaystyle\begin{array}{rcl} \mathit{KL}(a,b) =\sum _{ j=1}^{J}\mathit{KL}(a_{ j},b_{j}).& &{}\end{array}$

(3)

One of the most useful and easily proved facts about the KL distance is contained in the following lemma; we simplify the notation by setting b(z) = b(x, z).

Lemma 1.

For nonnegative vectors a and b, with $$$b_{+} =\sum _{ j=1}^{J}b_{j} > 0$$” src=”/wp-content/uploads/2016/04/A183156_2_En_46_Chapter_IEq4.gif”></SPAN>, <SPAN class=EmphasisTypeItalic>we have</SPAN><br /> <DIV id=Equ4 class=Equation><br /> <DIV class=EquationContent><br /> <DIV class=MediaObject><IMG alt=$

(4)

This lemma can be extended to obtain the following useful identity.

Lemma 2.

For f(z) and b(x,z) as above, and z and w in Z, with f(w) > 0, we have

$\displaystyle\begin{array}{rcl} \mathit{KL}(b(z),b(w)) = \mathit{KL}(f(z),f(w)) + \mathit{KL}(b(z),(f(z)/f(w))b(w)).& &{}\end{array}$

(5)

Maximizing H(z ^k, z) is equivalent to minimizing

$\displaystyle\begin{array}{rcl} G(z^{k},z) = \mathit{KL}(b(z^{k}),b(z)) - f(z),& &{}\end{array}$

(6)

where

$\displaystyle\begin{array}{rcl} \mathit{KL}(b(z^{k}),b(z)) =\int \mathit{KL}(b(x,z^{k}),b(x,z))\mathit{dx}.& &{}\end{array}$

(7)

Therefore,

$\displaystyle{-f(z^{k}) = \mathit{KL}(b(z^{k}),b(z^{k})) - f(z^{k}) \geq \mathit{KL}(b(z^{k}),b(z^{k+1})) - f(z^{k+1}),}$

$\displaystyle{f(z^{k+1}) - f(z^{k}) \geq \mathit{KL}(b(z^{k}),b(z^{k+1})) \geq \mathit{KL}(f(z^{k}),f(z^{k+1})).}$

Consequently, the sequence {f(z ^k)} is increasing and bounded above, so that the sequence $\{\mathit{KL}(b(z^{k}),b(z^{k+1}))\}$ converges to zero. Without additional restrictions, we cannot conclude that {f(z ^k)} converges to f(z ^∗).

We get z ^k+1 by minimizing G(z ^k, z). When we minimize G(z, z ^k+1), we get z ^k+1 again. Therefore, we can put the NSEM algorithm into the alternating-minimization (AM) framework of Csiszár and Tusnády [12], to be discussed further Sect. 11.

The Discrete Case

Again, the problem is to maximize a nonnegative function $f: Z \rightarrow \mathbb{R}$ , where Z is an arbitrary set. As previously, we assume that there is z ^∗ ∈ Z with f(z ^∗) ≥ f(z), for all z ∈ Z. We also assume that there is a finite or countably infinite set B and a nonnegative function $b: B \times Z \rightarrow \mathbb{R}$ such that

$\displaystyle{f(z) =\sum _{x\in B}b(x,z).}$

Having found z ^k, we maximize the function

$\displaystyle\begin{array}{rcl} H(z^{k},z) =\sum _{ x\in B}b(x,z^{k})\log b(x,z)& &{}\end{array}$

(8)

to get z ^k+1.

We set b(z) = b(x, z) again. Maximizing H(z ^k, z) is equivalent to minimizing

$\displaystyle\begin{array}{rcl} G({z^{k}},z) = \mathit{KL}(b({z^{k}}),b(z)) - f(z),& &{}\end{array}$

(9)

where

$\displaystyle\begin{array}{rcl} \mathit{KL}(b(z^{k}),b(z)) =\sum _{ x\in B}\mathit{KL}(b(x,z^{k}),b(x,z)).& &{}\end{array}$

(10)

As previously, we find that the sequence {f(z ^k)} is increasing, and $\{\mathit{KL}(b(z^{k}),b(z^{k+1}))\}$ converges to zero. Without additional restrictions, we cannot conclude that {f(z ^k)} converges to f(z ^∗).

3 The Stochastic EM Algorithm

The E-Step and M-Step

In statistical parameter estimation, one typically has an observable random vector Y taking values in ${\mathbb{R}}^{N}$ that is governed by a probability density function (pdf) or probability function (pf) of the form f _Y(y | θ), for some value of the parameter vector $\theta \in \Theta$ , where $\Theta$ is the set of all legitimate values of θ. Our observed data consists of one realization y of Y; we do not exclude the possibility that the entries of y are independently obtained samples of a common real-valued random variable. The true vector of parameters is to be estimated by maximizing the likelihood function L _y(θ) = f _Y(y | θ) over all $\theta \in \Theta$ to obtain a maximum likelihood estimate, θ _ML.

To employ the EM algorithmic approach, it is assumed that there is another related random vector X, which we shall call the preferred data, such that, had we been able to obtain one realization x of X, maximizing the likelihood function $L_{x}(\theta ) = f_{X}(x\vert \theta )$ would have been simpler than maximizing the likelihood function L _y(θ) = f _Y(y | θ). Of course, we do not have a realization x of X. The basic idea of the EM approach is to estimate x using the current estimate of θ, denoted θ ^k, and to use each estimate x ^kof x to get the next estimate θ ^k+1.

The EM algorithm proceeds in two steps. Having selected the preferred data X, and having found θ ^k, we form the function of θ given by

$\displaystyle\begin{array}{rcl} Q(\theta \vert \theta ^{k}) = E(\log f_{ X}(x\vert \theta )\vert y,\theta ^{k});& &{}\end{array}$

(11)

this is the E-step of the EM algorithm. Then we maximize Q(θ | θ ^k) over all θ to get θ ^k+1; this is the M-step of the EM algorithm. In this way, the EM algorithm based on X generates a sequence {θ ^k} of parameter vectors.

For the discrete case of probability functions, we have

$\displaystyle\begin{array}{rcl} Q(\theta \vert \theta ^{k}) =\sum _{ x}f_{X\vert Y }(x\vert y,\theta ^{k})\log f_{ X}(x\vert \theta ),& &{}\end{array}$

(12)

and for the continuous case of probability density functions, we have

$\displaystyle\begin{array}{rcl} Q(\theta \vert \theta ^{k}) =\int f_{ X\vert Y }(x\vert y,\theta ^{k})\log f_{ X}(x\vert \theta )dx.& &{}\end{array}$

(13)

In decreasing order of importance and difficulty, the goals are these:

To have the sequence of parameters {θ ^k} converging to θ _ML;

To have the sequence of functions $\{f_{X}(x\vert \theta ^{k})\}$ converging to $f_{X}(x\vert \theta _{ML})$ ;

To have the sequence of numbers $\left\{L_{y}(\theta ^{k})\right\}$ converging to L _y(θ _ML);

To have the sequence of numbers $\{L_{y}(\theta ^{k})\}$ non-decreasing.

Our focus here is mainly on the fourth goal, with some discussion of the third goal. We do present some examples for which all four goals are attained. Clearly, the first goal requires a topology on the set $\Theta$ .

Difficulties with the Conventional Formulation

In [1] we are told that

$\displaystyle\begin{array}{rcl} f_{X\vert Y }(x\vert y,\theta ) = f_{X}(x\vert \theta )/f_{Y }(y\vert \theta ).& &{}\end{array}$

(14)

This is false; integrating with respect to x gives one on the left side and 1∕f _Y(y | θ) on the right side. Perhaps the equation is not meant to hold for all x, but just for some x. In fact, if there is a function h such that Y = h(X), then Eq. (14) might hold for those x such that h(x) = y. In fact, this is what happens in the discrete case of probabilities; in that case we do have

$\displaystyle\begin{array}{rcl} f_{Y }(y\vert \theta ) =\sum _{x\in h^{-1}\{y\}}f_{X}(x\vert \theta ),& &{}\end{array}$

(15)

where

$\displaystyle{h^{-1}\{y\} =\{ x\vert h(x) = y\}.}$

Consequently,

$\displaystyle\begin{array}{rcl} f_{X\vert Y }(x\vert y,\theta ) = \left \{\begin{array}{*{10}c} f_{X}(x\vert \theta )/f_{Y }(y\vert \theta ),&{\mathrm{if}}\,x \in h^{-1}\{y\}; \\ 0, & {\mathrm{if}}\,x\notin h^{-1}\{y\}. \end{array} \right.& &{}\end{array}$

(16)

However, this modification of Eq. (14) fails in the continuous case of probability density functions, since $h^{-1}\{y\}$ is often a subset of zero measure. Even if the set $h^{-1}\{y\}$ has positive measure, integrating both sides of Eq. (14) over $x \in {h^{-1}} \left\{y\right\}$ tells us that $f_{Y }(y\vert \theta ) \leq 1$ , which need not hold for probability density functions.

An Incorrect Proof

Everyone who works with the EM algorithm will say that the likelihood is non-decreasing for the EM algorithm. The proof of this fact usually proceeds as follows; we use the notation for the continuous case, but the proof for the discrete case is essentially the same. Use Eq. (14) to get

$\displaystyle\begin{array}{rcl} \log f_{X}(x\vert \theta ) =\log f_{X\vert Y }(x\vert y,\theta ) -\log f_{Y }(y\vert \theta ).& &{}\end{array}$

(17)

Then replace the term $\log f_{X}(x\vert \theta )$ in Eq. (13) with the right side of Eq. (17), obtaining

$\displaystyle\begin{array}{rcl} \log f_{Y }(y\vert \theta ) - Q(\theta \vert \theta ^{k}) = -\int f_{ X\vert Y }(x\vert y,\theta ^{k})\log f_{ X\vert Y }(x\vert y,\theta )dx.& &{}\end{array}$

(18)

Jensen’s Inequality tells us that

$\displaystyle\begin{array}{rcl} \int u(x)\log u(x)dx \geq \int u(x)\log v(x)dx,& &{}\end{array}$

(19)

for any probability density functions u(x) and v(x). Since $f_{X\vert Y }(x\vert y,\theta )$ is a probability density function, we have

$\displaystyle\begin{array}{rcl} \int f_{X\vert Y }(x\vert y,\theta ^{k})\log f_{ X\vert Y }(x\vert y,\theta )dx \leq \int f_{X\vert Y }(x\vert y,\theta ^{k})\log f_{ X\vert Y }(x\vert y,\theta ^{k})dx.& &{}\end{array}$

(20)

We conclude, therefore, that $\log f_{Y }(y\vert \theta ) - Q(\theta \vert \theta ^{k})$ attains its minimum value at $\theta =\theta ^{k}$ . Then we have

$\displaystyle\begin{array}{rcl} \log f_{Y }(y\vert \theta ^{k+1}) -\log f_{ Y }(y\vert \theta ^{k}) \geq Q(\theta ^{k+1}\vert \theta ^{k}) - Q(\theta ^{k}\vert \theta ^{k}) \geq 0.& &{}\end{array}$

(21)

This proof is incorrect; clearly it rests on the validity of Eq. (14), which is generally false. For the discrete case, with Y = h(X), this proof is valid, when we use Eq. (16), instead of Eq. (14). In all other cases, however, the proof is incorrect.

Acceptable Data

We turn now to the question of how to repair the incorrect proof. Equation (14) should read

$\displaystyle\begin{array}{rcl} f_{X\vert Y }(x\vert y,\theta ) = f_{X,Y }(x,y\vert \theta )/f_{Y }(y\vert \theta ),& &{}\end{array}$

(22)

for all x. In order to replace $\log f_{X}(x\vert \theta )$ in Eq. (13), we write

$\displaystyle\begin{array}{rcl} f_{X,Y }(x,y\vert \theta ) = f_{X\vert Y }(x\vert y,\theta )f_{Y }(y\vert \theta ),& &{}\end{array}$

(23)

and

$\displaystyle\begin{array}{rcl} f_{X,Y }(x,y\vert \theta ) = f_{Y \vert X}(y\vert x,\theta )f_{X}(x\vert \theta ),& &{}\end{array}$

(24)

so that

$\displaystyle\begin{array}{rcl} \log f_{X}(x\vert \theta ) =\log f_{X\vert Y }(x\vert y,\theta ) +\log f_{Y }(y\vert \theta ) -\log f_{Y \vert X}(y\vert x,\theta ).& &{}\end{array}$

(25)

We say that the preferred data is acceptable if

$\displaystyle\begin{array}{rcl} f_{Y \vert X}(y\vert x,\theta ) = f_{Y \vert X}(y\vert x);& &{}\end{array}$

(26)

that is, the dependence of Y on X is unrelated to the value of the parameter θ. This definition provides our generalization of the relationship Y = h(X).

When X is acceptable, we have that $\log f_{Y }(y\vert \theta ) - Q(\theta \vert \theta ^{k})$ again attains its minimum value at θ = θ ^k. The assertion that the likelihood is non-decreasing then follows, using the same argument as in the previous incorrect proof.

4 The Discrete Case

In the discrete case, we assume that Y is a discrete random vector taking values in a finite or countably infinite set A, and governed by probability f _Y(y | θ). We assume, in addition, that there is a second discrete random vector X, taking values in a finite or countably infinite set B, and a function $h: B \rightarrow A$ such that Y = h(X). We define the set

$h^{-1}\{y\} =\left\{ x \in B\vert h(x) = y \right\}.$

(27)

Then we have

$\displaystyle\begin{array}{rcl} f_{Y }(y\vert \theta ) =\sum _{x\in h^{-1}\{y\}}f_{X}(x\vert \theta ).& &{}\end{array}$

(28)

The conditional probability function for X, given Y = y, is

$\displaystyle\begin{array}{rcl} f_{X\vert Y }(x\vert y,\theta ) = \frac{f_{X}(x\vert \theta )} {f_{Y }(y\vert \theta )},& &{}\end{array}$

(29)

for x ∈ h ⁻¹{y}, and zero, otherwise. The so-called E-step of the EM algorithm is then to calculate

$\displaystyle\begin{array}{rcl} Q(\theta \vert \theta ^{k}) = E((\log f_{ X}(X\vert \theta )\vert y,\theta ^{k}) =\sum _{ x\in h^{-1}\{y\}}f_{X\vert Y }(x\vert y,\theta ^{k})\log f_{ X}(x\vert \theta ),& &{}\end{array}$

(30)

and the M-step is to maximize Q(θ | θ ^k) as a function of θ to obtain θ ^k+1.

Using Eq. (29), we can write

$\displaystyle\begin{array}{rcl} Q(\theta \vert \theta ^{k}) =\sum _{ x\in h^{-1}\{y\}}f_{X\vert Y }(x\vert y,\theta ^{k})\log f_{ X\vert Y }(x\vert y,\theta ) +\log f_{Y }(y\vert \theta ).& &{}\end{array}$

(31)

Therefore,

$\displaystyle{\log f_{Y }(y\vert \theta ) - Q(\theta \vert \theta ^{k}) = -\sum _{ x\in h^{-1}\{y\}}f_{X\vert Y }(x\vert y,\theta ^{k})\log f_{ X\vert Y }(x\vert y,\theta ).}$

Since

$\displaystyle{\sum _{x\in h^{-1}\{y\}}f_{X\vert Y }(x\vert y,\theta ^{k}) =\sum _{ x\in h^{-1}\{y\}}f_{X\vert Y }(x\vert y,\theta ) = 1,}$

it follows from Jensen’s Inequality that

$\displaystyle\begin{array}{rcl} & & -\sum _{x\in h^{-1}\{y\}}f_{X\vert Y }(x\vert y,\theta ^{k})\log f_{ X\vert Y }(x\vert y,\theta ) {}\\ & & \quad \geq -\sum _{x\in h^{-1}\{y\}}f_{X\vert Y }(x\vert y,\theta ^{k})\log f_{ X\vert Y }(x\vert y,\theta ^{k}). {}\\ \end{array}$

Therefore, $\log f_{Y }(y \vert \theta ) - Q(\theta \vert \theta ^{k})$ attains its minimum at θ = θ ^k. We have the following result.

Proposition 1.

The sequence {f _Y (y|θ ^k )} is non-decreasing.

Proof.

We have

$\displaystyle{\log f_{Y }(y\vert \theta ^{k+1}) - Q(\theta ^{k+1}\vert \theta ^{k}) \geq \log f_{ Y }(y\vert \theta ^{k}) - Q(\theta ^{k}\vert \theta ^{k}),}$

$\displaystyle{\log f_{Y }(y\vert \theta ^{k+1}) -\log f_{ Y }(y\vert \theta ^{k}) \geq Q(\theta ^{k+1}\vert \theta ^{k}) - Q(\theta ^{k}\vert \theta ^{k}) \geq 0.}$

Let $\chi _{h^{-1}\{y\}}(x)$ be the characteristic function of the set h ⁻¹{y}, that is,

$\displaystyle\begin{array}{rcl} \chi _{h^{-1}\{y\}}(x) = \left \{\begin{array}{*{10}c} 1,&{\mathrm{if}}\,x \in h^{-1}\{y\}; \\ 0,& {\mathrm{if}}\,x\notin h^{-1}\{y\}. \end{array} \right.& &{}\end{array}$

(32)

With the choices z = θ, f(z) = f _Y(y | θ), and $b(z) = f_{X}(x\vert \theta )\chi _{h^{-1}\{y\}}(x)$ , the discrete EM algorithm fits into the framework of the non-stochastic EM algorithm. Consequently, we see once again that the sequence {f _Y(y | θ ^k)} is non-decreasing and also that the sequence

$\displaystyle{\{KL(b(z^{k}),b(z^{k+1})\} =\{\sum _{ x\in h^{-1}\{y\}}(KL(f_{X}(x\vert \theta ^{k}),f_{ X}(x\vert \theta ^{k+1}))\}}$

converges to zero.

5 Missing Data

We say that there is missing data if the preferred data X has the form X = (Y, W), so that $$Y = h(X) = h(Y,W)$$

, where h is the orthogonal projection onto the first component. The case of missing data for the discrete case is covered by the discussion in Sect. 4, so we consider here the continuous case in which probability density functions are involved.

Once again, the E-step is to calculate Q(θ | θ ^k) given by

$\displaystyle\begin{array}{rcl} Q(\theta \vert \theta ^{k}) = E(\log f_{ X}(X\vert \theta )\vert y,\theta ^{k}).& &{}\end{array}$

(33)

Since X = (Y, W), we have

$\displaystyle\begin{array}{rcl} f_{X}(x\vert \theta ) = f_{Y,W}(y,w\vert \theta ).& &{}\end{array}$

(34)

Since the set h ⁻¹{y} has measure zero, we cannot write

$\displaystyle{Q(\theta \vert \theta ^{k}) =\int _{ h^{-1}\{y\}}f_{X\vert Y }(x\vert y,\theta ^{k})\log f_{ X}(x\vert \theta )dx.}$

Instead, following [8], we write

$\displaystyle\begin{array}{rcl} Q(\theta \vert \theta ^{k}) =\int f_{ Y,W}(y,w\vert \theta ^{k})\log f_{ Y,W}(y,w\vert \theta )dw/f_{Y }(y\vert \theta ^{k}).& &{}\end{array}$

(35)

Consequently, maximizing $Q(\theta \vert {\theta ^{k}})$ is equivalent to maximizing

$\displaystyle{\int f_{Y,W}(y,w\vert \theta ^{k})\log f_{ Y,W}(y,w\vert \theta )dw.}$

With $b(\theta ) = b(\theta,w) = f_{Y,W}(y,w\vert \theta )$ and

$\displaystyle{f_{Y }(y\vert \theta ) = f(\theta ) =\int f_{Y,W}(y,w\vert \theta )dw =\int b(\theta )dw,}$

we find that maximizing Q(θ | θ ^k) is equivalent to minimizing $KL(b(\theta ^{k}),b(\theta )) - f(\theta )$ . Therefore, the EM algorithm for the case of missing data falls into the framework of the non-stochastic EM algorithm. We conclude that the sequence {f(θ ^k)} is non-decreasing and that the sequence $\{KL(b(\theta ^{k}),b(\theta ^{k+1}))\}$ converges to zero.

Most other instances of the continuous case in which we have Y = h(X) can be handled using the missing-data model. For example, suppose that Z ₁ and Z ₂ are uniformly distributed on the interval [0, θ], for some positive θ and that $Y = {Z_1} + {Z_2}$ . We may, for example, then take W to be $W = Z_{1} - Z_{2}$ and X = (Y, W) as the preferred data. We shall discuss these instances further in Sect. 7.

6 The Continuous Case

We turn now to the general continuous case. We have a random vector Y taking values in $\mathbb{R}^{N}$ and governed by the probability density function f _Y(y | θ). The objective, once again, is to maximize the likelihood function L _y(θ) = f _Y(y | θ) to obtain the maximum likelihood estimate of θ.

Acceptable Preferred Data

For the continuous case, the vector θ ^k+1 is obtained from θ ^kby maximizing the conditional expected value

$\displaystyle\begin{array}{rcl} Q(\theta \vert \theta ^{k}) = E(\log f_{ X}(X\vert \theta )\vert y,\theta ^{k}) =\int f_{ X\vert Y }(x\vert y,\theta ^{k})\log f_{ X}(x\vert \theta )dx.& &{}\end{array}$

(36)

Assuming the acceptability condition and using

$\displaystyle{f_{X,Y }(x,y\vert \theta ^{k}) = f_{ X\vert Y }(x\vert y,\theta ^{k})f_{ Y }(y\vert \theta ^{k}),}$

and

$\displaystyle{\log f_{X}(x\vert \theta ) =\log f_{X,Y }(x,y\vert \theta ) -\log f_{Y \vert X}(y\vert x),}$

we find that maximizing $E(\log f_{X}(x\vert \theta )\vert y,\theta ^{k})$ is equivalent to minimizing

$\displaystyle\begin{array}{rcl} H(\theta ^{k},\theta ) =\int f_{ X,Y }(x,y\vert \theta ^{k})\log f_{ X,Y }(x,y\vert \theta )dx.& &{}\end{array}$

(37)

With $f(\theta ) = f_{Y }(y\vert \theta )$ , and $b(\theta ) = f_{X,Y }(x,y\vert \theta )$ , this problem fits the framework of the non-stochastic EM algorithm and is equivalent to minimizing

$\displaystyle{G(\theta ^{k},\theta ) = KL(b(\theta ^{k}),b(\theta )) - f(\theta ).}$

Once again, we may conclude that the likelihood function is non-decreasing and that the sequence $\{KL(b(\theta ^{k}),b(\theta ^{k+1}))\}$ converges to zero.

In the discrete case in which Y = h(X), the conditional probability $f_{Y \vert X}(y\vert x,\theta )$ is $\delta (y - h(x))$ , as a function of y, for given x, and is the characteristic function of the set h ⁻¹(y), as a function of x, for given y. Therefore, we can write f _{X | Y}(x | y, θ) using Eq. (16). For the continuous case in which Y = h(X), the pdf f _{Y | X}(y | x, θ) is again a delta function of y, for given x; the difficulty arises when we need to view this as a function of x, for given y. The acceptability property helps us avoid this difficulty.

When X is acceptable, we have

$\displaystyle\begin{array}{rcl} f_{X\vert Y }(x\vert y,\theta ) = f_{Y \vert X}(y\vert x)f_{X}(x\vert \theta )/f_{Y }(y\vert \theta ),& &{}\end{array}$

(38)

whenever f _Y(y | θ) ≠ 0, and is zero otherwise. Consequently, when X is acceptable, we have a kernel model for f _Y(y | θ) in terms of the f _X(x | θ):

$\displaystyle\begin{array}{rcl} f_{Y }(y\vert \theta ) =\int f_{Y \vert X}(y\vert x)f_{X}(x\vert \theta )dx;& &{}\end{array}$

(39)

for the continuous case we view this as a corrected version of Eq. (15). In the discrete case the integral is replaced by a summation, of course, but when we are speaking generally about either case, we shall use the integral sign.

The acceptability of the missing data W is used in [9], but more for computational convenience and to involve the Kullback-Leibler distance in the formulation of the EM algorithm. It is not necessary that W be acceptable in order for likelihood to be non-decreasing, as we have seen.

Selecting Preferred Data

The popular example of multinomial data given below illustrates well the point that one can often choose to view the observed data as “incomplete” simply in order to introduce “complete” data that makes the calculations simpler, even when there is no suggestion, in the original problem, that the observed data is in any way inadequate or “incomplete.” It is in order to emphasize this desire for simplification that we refer to X as the preferred data, not the complete data.

In some applications, the preferred data X arises naturally from the problem, while in other cases the user must imagine preferred data. This choice in selecting the preferred data can be helpful in speeding up the algorithm (see [10]).

If, instead of maximizing

$\displaystyle{\int f_{X\vert Y }(x\vert y,\theta ^{k})\log f_{ X}(x\vert \theta )dx,}$

at each M-step, we simply select θ ^k+1 so that

$$$\displaystyle\begin{array}{rcl} & & \int f_{X\vert Y }(x\vert y,\theta ^{k})\log f_{ X,Y }(x,y\vert \theta ^{k+1})dx {}\\ & & \quad -\int f_{X\vert Y }(x\vert y,\theta ^{k})\log f_{ X,Y }(x,y\vert \theta ^{k})dx > 0, {}\\ \end{array}$$” src=”/wp-content/uploads/2016/04/A183156_2_En_46_Chapter_Equ41.gif”></DIV></DIV></DIV>we say that we are using a <SPAN class=EmphasisTypeItalic>generalized</SPAN> EM (GEM) algorithm. It is clear from the discussion in the previous subsection that whenever <SPAN class=EmphasisTypeItalic>X</SPAN> is acceptable, a GEM also guarantees that likelihood is non-decreasing.</DIV></DIV><br /> <DIV id=Sec16 class=$

Preferred Data as Missing Data

As we have seen, when the EM algorithm is applied to the missing-data model, the likelihood is non-decreasing, which suggests that, for an arbitrary preferred data X, we could imagine X as W, the missing data, and imagine applying the EM algorithm to Z = (Y, X). This approach would produce an EM sequence of parameter vectors for which likelihood is non-decreasing, but it need not be the same sequence as obtained by applying the EM algorithm to X directly. It is the same sequence, provided that X is acceptable. We are not suggesting that applying the EM algorithm to Z = (Y, X) would simplify calculations.

We know that, when the missing-data model is used and the M-step is defined as maximizing the function in (35), the likelihood is not decreasing. It would seem then that for any choice of preferred data X, we could view this data as missing and take as our complete data the pair Z = (Y, X), with X now playing the role of W. Maximizing the function in (35) is then equivalent to maximizing

$\displaystyle\begin{array}{rcl} \int f_{X\vert Y }(x\vert y,\theta ^{k})\log f_{ X,Y }(x,y\vert \theta )dx;& &{}\end{array}$

(40)

to get $\theta ^{k+1}$ . It then follows that $L_{y}(\theta ^{k+1}) \geq L_{y}(\theta ^{k})$ . The obvious question is whether or not these two functions given in (11) and (40) have the same maximizers.

For acceptable X we have

$\displaystyle\begin{array}{rcl} \log f_{X,Y }(x,y\vert \theta ) =\log f_{X}(x\vert \theta ) +\log f_{Y \vert X}(y\vert x),& &{}\end{array}$

(41)

so the two functions given in (11) and (40) do have the same maximizers. It follows once again that whenever the preferred data is acceptable, we have $L_{y}(\theta ^{k+1}) \geq L_{y}(\theta ^{k})$ . Without additional assumptions, however, we cannot conclude that {θ ^k} converges to θ _ML, nor that $\{f_{Y }(y\vert \theta ^{k})\}$ converges to $f_{Y }(y\vert \theta _{ML})$ .

7 The Continuous Case with Y = h(X)

In this section we consider the continuous case in which the observed random vector Y takes values in $\mathbb{R}^{N}$ ; the preferred random vector X takes values in $\mathbb{R}^{M}$ ; the random vectors are governed by probability density functions $f_{Y }(y\vert \theta )$ and f _X(x | θ), respectively; and there is a function $h: \mathbb{R}^{N} \rightarrow \mathbb{R}^{M}$ such that Y = h(X). In most cases, M > N and ${h^{-1}}\{y\} = \{ x\vert h(x) = y\}$ has measure zero in $\mathbb{R}^{M}$ .

An Example

For example, suppose that Z ₁ and Z ₂ are independent and uniformly distributed on the interval [0, θ], for some θ > 0 to be estimated. Let $Y = Z_{1} + Z_{2}$ . With $Z = (Z_{1},Z_{2})$ , and $h: {{\mathbb{R}}^{2}} \rightarrow {\mathbb{R}}$ given by $h({z_{1}},{z_{2}}) = {z_{1}} + {z_{2}}$ , we have Y = h(Z). The pdf for Z is

$\displaystyle\begin{array}{rcl} f_{Z}(z\vert \theta ) = f_{Z}(z_{1},z_{2}\vert \theta ) = \frac{1} {\theta ^{2}} \chi _{[0,\theta ]}(z_{1})\chi _{[0,\theta ]}(z_{2}).& &{}\end{array}$

(42)

The pdf for Y is

$\displaystyle\begin{array}{rcl} f_{Y }(y\vert \theta ) = \left \{\begin{array}{*{10}c} \frac{y} {\theta ^{2}}, & {\mathrm{if}}\,0 \leq y \leq \theta; \\ \frac{2\theta -y} {\theta ^{2}},&{\mathrm{if}}\,\theta \leq y \leq 2\theta. \end{array} \right.& &{}\end{array}$

(43)

It is not the case that

$\displaystyle\begin{array}{rcl} f_{Y }(y\vert \theta ) =\int _{h^{-1}\{y\}}f_{Z}(z\vert \theta ),& &{}\end{array}$

(44)

since h ⁻¹{y} has measure zero in $\mathbb{R}^{2}$ .

The likelihood function is $L(\theta ) = f_{Y }(y\vert \theta )$ , viewed as a function of θ, and is given by

$\displaystyle\begin{array}{rcl} L(\theta ) = \left \{\begin{array}{*{10}c} \frac{y} {\theta ^{2}}, & {\mathrm{if}}\,\theta \geq y; \\ \frac{2\theta -y} {\theta ^{2}},&{\mathrm{if}}\,\frac{y} {2} \leq \theta \leq y. \end{array} \right.& &{}\end{array}$

(45)

Therefore, the maximum likelihood estimate of θ is θ _ML= y.

Instead of using Z as our preferred data, suppose that we define the random variable W = Z ₂, and let X = (Y, W), a missing-data model. We then have Y = h(X), where $h: \mathbb{R}^{2} \rightarrow \mathbb{R}$ is given by $$h(x) = h(y,w) = y$$

. The pdf for Y given in Eq. (43) can be written as

$\displaystyle\begin{array}{rcl} f_{Y }(y\vert \theta ) =\int \frac{1} {\theta ^{2}} \chi _{[0,\theta ]}(y - w)\chi _{[0,\theta ]}(w)dw.& &{}\end{array}$

(46)

The joint pdf is

${f_{Y,W}}(y,w\vert \theta ) = \left\{\begin{array}{cc} 1/ {\theta ^{2}},& {\mathrm{for}}\,w \leq y \leq \theta +w; \\ 0, & {\mathrm{otherwise}}. \end{array} \right.$

(47)

Censored Exponential Data

McLachlan and Krishnan [1] give the following example of a likelihood maximization problem involving probability density functions. This example provides a good illustration of the usefulness of the missing-data model.

Suppose that Z is the time until failure of a component, which we assume is governed by the exponential distribution

$\displaystyle\begin{array}{rcl} f(z\vert \theta ) = \frac{1} {\theta } e^{-z/\theta },& &{}\end{array}$

(48)

where the parameter θ > 0 is the expected time until failure. We observe a random sample of N components and record their failure times, z _n. On the basis of this data, we must estimate θ, the mean time until failure.

It may well happen, however, that during the time allotted for observing the components, only r of the N components fail, which, for convenience, are taken to be the first r items in the record. Rather than wait longer, we record the failure times of those that failed and record the elapsed time for the experiment, say T, for those that had not yet failed. The censored data is then $y = (y_{1},\ldots,y_{N})$ , where y _n= z _nis the time until failure for $n = 1,\ldots,r$ , and y _n= T for $n = r + 1,\ldots,N$ . The censored data is reasonably viewed as incomplete, relative to the complete data we would have had, had the trial lasted until all the components had failed.

Since the probability that a component will survive until time T is $e^{-T/\theta }$ , the pdf for the vector y is

$\displaystyle\begin{array}{rcl} f_{Y }(y\vert \theta ) =\Big (\prod _{n=1}^{r}\frac{1} {\theta } e^{-y_{n}/\theta }\Big)\,e^{-(N-r)T/\theta },& &{}\end{array}$

(49)

and the log likelihood function for the censored, or incomplete, data is

$\displaystyle\begin{array}{rcl} \log f_{Y }(y\vert \theta ) = -r\log \theta -\frac{1} {\theta } \sum _{n=1}^{N}y_{ n}.& &{}\end{array}$

(50)

In this particular example, we are fortunate, in that we can maximize f _Y(y | θ) easily, and find that the ML solution based on the incomplete, censored data is

$\displaystyle\begin{array}{rcl} \theta _{MLi} = \frac{1} {r}\sum _{n=1}^{N}y_{ n} = \frac{1} {r}\sum _{n=1}^{r}y_{ n} + \frac{N - r} {r} T.& &{}\end{array}$

(51)

In most cases in which our data is incomplete, finding the ML estimate from the incomplete data is difficult, while finding it for the complete data is relatively easy.

We say that the missing data are the times until failure of those components that did not fail during the observation time. The preferred data is the complete data $x = (z_{1},\ldots,z_{N})$ of actual times until failure. The pdf for the preferred data X is

$\displaystyle\begin{array}{rcl} f_{X}(x\vert \theta ) =\prod _{ n=1}^{N}\frac{1} {\theta } e^{-z_{n}/\theta },& &{}\end{array}$