# Algorithms from a Non-stochastic Perspective

, where Z is an arbitrary set. We assume that there is z Z with f(z ) ≥ f(z), for all zZ. We also assume that there is a nonnegative function such that

Having found z k , we maximize the function

(1)
to get z k+1. Adopting such an iterative approach presupposes that maximizing H(z k , z) is simpler than maximizing f(z) itself. This is the case with the EM algorithm.

The cross-entropy or Kullback-Leibler distance [7] is a useful tool for analyzing the EM algorithm. For positive numbers u and v, the Kullback-Leibler distance from u to v is

(2)
We also define KL(0, 0) = 0, KL(0, v) = v, and . The KL distance is extended to nonnegative vectors component-wise, so that for nonnegative vectors a and b, we have

(3)
One of the most useful and easily proved facts about the KL distance is contained in the following lemma; we simplify the notation by setting b(z) = b(x, z).

Lemma 1.

For nonnegative vectors a and b, with

(4)

This lemma can be extended to obtain the following useful identity.

Lemma 2.

For f(z) and b(x,z) as above, and z and w in Z, with f(w) > 0, we have

(5)

Maximizing H(z k , z) is equivalent to minimizing

(6)
where

(7)
Therefore,

or

Consequently, the sequence {f(z k )} is increasing and bounded above, so that the sequence converges to zero. Without additional restrictions, we cannot conclude that {f(z k )} converges to f(z ).

We get z k+1 by minimizing G(z k , z). When we minimize G(z, z k+1), we get z k+1 again. Therefore, we can put the NSEM algorithm into the alternating-minimization (AM) framework of Csiszár and Tusnády [12], to be discussed further Sect. 11.

#### The Discrete Case

Again, the problem is to maximize a nonnegative function , where Z is an arbitrary set. As previously, we assume that there is z Z with f(z ) ≥ f(z), for all zZ. We also assume that there is a finite or countably infinite set B and a nonnegative function such that

Having found z k , we maximize the function

(8)
to get z k+1.

We set b(z) = b(x, z) again. Maximizing H(z k , z) is equivalent to minimizing

(9)
where

(10)
As previously, we find that the sequence {f(z k )} is increasing, and converges to zero. Without additional restrictions, we cannot conclude that {f(z k )} converges to f(z ).

## 3 The Stochastic EM Algorithm

### The E-Step and M-Step

In statistical parameter estimation, one typically has an observable random vector Y taking values in that is governed by a probability density function (pdf) or probability function (pf) of the form f Y (y | θ), for some value of the parameter vector , where is the set of all legitimate values of θ. Our observed data consists of one realization y of Y; we do not exclude the possibility that the entries of y are independently obtained samples of a common real-valued random variable. The true vector of parameters is to be estimated by maximizing the likelihood function L y (θ) = f Y (y | θ) over all to obtain a maximum likelihood estimate, θ ML .

To employ the EM algorithmic approach, it is assumed that there is another related random vector X, which we shall call the preferred data, such that, had we been able to obtain one realization x of X, maximizing the likelihood function would have been simpler than maximizing the likelihood function L y (θ) = f Y (y | θ). Of course, we do not have a realization x of X. The basic idea of the EM approach is to estimate x using the current estimate of θ, denoted θ k , and to use each estimate x k of x to get the next estimate θ k+1.

The EM algorithm proceeds in two steps. Having selected the preferred data X, and having found θ k , we form the function of θ given by

(11)
this is the E-step of the EM algorithm. Then we maximize Q(θ | θ k ) over all θ to get θ k+1; this is the M-step of the EM algorithm. In this way, the EM algorithm based on X generates a sequence {θ k } of parameter vectors.

For the discrete case of probability functions, we have

(12)
and for the continuous case of probability density functions, we have

(13)
In decreasing order of importance and difficulty, the goals are these:

1.

To have the sequence of parameters {θ k } converging to θ ML ;

2.

To have the sequence of functions converging to ;

3.

To have the sequence of numbers converging to L y (θ ML );

4.

To have the sequence of numbers non-decreasing.

Our focus here is mainly on the fourth goal, with some discussion of the third goal. We do present some examples for which all four goals are attained. Clearly, the first goal requires a topology on the set .

### Difficulties with the Conventional Formulation

In [1] we are told that

(14)
This is false; integrating with respect to x gives one on the left side and 1∕f Y (y | θ) on the right side. Perhaps the equation is not meant to hold for all x, but just for some x. In fact, if there is a function h such that Y = h(X), then Eq. (14) might hold for those x such that h(x) = y. In fact, this is what happens in the discrete case of probabilities; in that case we do have

(15)
where

Consequently,

(16)
However, this modification of Eq. (14) fails in the continuous case of probability density functions, since is often a subset of zero measure. Even if the set has positive measure, integrating both sides of Eq. (14) over tells us that , which need not hold for probability density functions.

### An Incorrect Proof

Everyone who works with the EM algorithm will say that the likelihood is non-decreasing for the EM algorithm. The proof of this fact usually proceeds as follows; we use the notation for the continuous case, but the proof for the discrete case is essentially the same. Use Eq. (14) to get

(17)
Then replace the term in Eq. (13) with the right side of Eq. (17), obtaining

(18)
Jensen’s Inequality tells us that

(19)
for any probability density functions u(x) and v(x). Since is a probability density function, we have

(20)
We conclude, therefore, that attains its minimum value at . Then we have

(21)
This proof is incorrect; clearly it rests on the validity of Eq. (14), which is generally false. For the discrete case, with Y = h(X), this proof is valid, when we use Eq. (16), instead of Eq. (14). In all other cases, however, the proof is incorrect.

### Acceptable Data

We turn now to the question of how to repair the incorrect proof. Equation (14) should read

(22)
for all x. In order to replace in Eq. (13), we write

(23)
and

(24)
so that

(25)
We say that the preferred data is acceptable if

(26)
that is, the dependence of Y on X is unrelated to the value of the parameter θ. This definition provides our generalization of the relationship Y = h(X).

When X is acceptable, we have that again attains its minimum value at θ = θ k . The assertion that the likelihood is non-decreasing then follows, using the same argument as in the previous incorrect proof.

## 4 The Discrete Case

In the discrete case, we assume that Y is a discrete random vector taking values in a finite or countably infinite set A, and governed by probability f Y (y | θ). We assume, in addition, that there is a second discrete random vector X, taking values in a finite or countably infinite set B, and a function such that Y = h(X). We define the set

(27)
Then we have

(28)
The conditional probability function for X, given Y = y, is

(29)
for xh −1{y}, and zero, otherwise. The so-called E-step of the EM algorithm is then to calculate

(30)
and the M-step is to maximize Q(θ | θ k ) as a function of θ to obtain θ k+1.

Using Eq. (29), we can write

(31)
Therefore,

Since

it follows from Jensen’s Inequality that

Therefore, attains its minimum at θ = θ k . We have the following result.

Proposition 1.

The sequence {f Y (y|θ k )} is non-decreasing.

Proof.

We have

or

Let be the characteristic function of the set h −1{y}, that is,

(32)
With the choices z = θ, f(z) = f Y (y | θ), and , the discrete EM algorithm fits into the framework of the non-stochastic EM algorithm. Consequently, we see once again that the sequence {f Y (y | θ k )} is non-decreasing and also that the sequence

converges to zero.

## 5 Missing Data

We say that there is missing data if the preferred data X has the form X = (Y, W), so that , where h is the orthogonal projection onto the first component. The case of missing data for the discrete case is covered by the discussion in Sect. 4, so we consider here the continuous case in which probability density functions are involved.

Once again, the E-step is to calculate Q(θ | θ k ) given by

(33)
Since X = (Y, W), we have

(34)
Since the set h −1{y} has measure zero, we cannot write

(35)
Consequently, maximizing is equivalent to maximizing

With and

we find that maximizing Q(θ | θ k ) is equivalent to minimizing . Therefore, the EM algorithm for the case of missing data falls into the framework of the non-stochastic EM algorithm. We conclude that the sequence {f(θ k )} is non-decreasing and that the sequence converges to zero.

Most other instances of the continuous case in which we have Y = h(X) can be handled using the missing-data model. For example, suppose that Z 1 and Z 2 are uniformly distributed on the interval [0, θ], for some positive θ and that . We may, for example, then take W to be and X = (Y, W) as the preferred data. We shall discuss these instances further in Sect. 7.

## 6 The Continuous Case

We turn now to the general continuous case. We have a random vector Y taking values in and governed by the probability density function f Y (y | θ). The objective, once again, is to maximize the likelihood function L y (θ) = f Y (y | θ) to obtain the maximum likelihood estimate of θ.

### Acceptable Preferred Data

For the continuous case, the vector θ k+1 is obtained from θ k by maximizing the conditional expected value

(36)
Assuming the acceptability condition and using

and

we find that maximizing is equivalent to minimizing

(37)
With , and , this problem fits the framework of the non-stochastic EM algorithm and is equivalent to minimizing

Once again, we may conclude that the likelihood function is non-decreasing and that the sequence converges to zero.

In the discrete case in which Y = h(X), the conditional probability is , as a function of y, for given x, and is the characteristic function of the set h −1(y), as a function of x, for given y. Therefore, we can write f X | Y (x | y, θ) using Eq. (16). For the continuous case in which Y = h(X), the pdf f Y | X (y | x, θ) is again a delta function of y, for given x; the difficulty arises when we need to view this as a function of x, for given y. The acceptability property helps us avoid this difficulty.

When X is acceptable, we have

(38)
whenever f Y (y | θ) ≠ 0, and is zero otherwise. Consequently, when X is acceptable, we have a kernel model for f Y (y | θ) in terms of the f X (x | θ):

(39)
for the continuous case we view this as a corrected version of Eq. (15). In the discrete case the integral is replaced by a summation, of course, but when we are speaking generally about either case, we shall use the integral sign.

The acceptability of the missing data W is used in [9], but more for computational convenience and to involve the Kullback-Leibler distance in the formulation of the EM algorithm. It is not necessary that W be acceptable in order for likelihood to be non-decreasing, as we have seen.

### Selecting Preferred Data

The popular example of multinomial data given below illustrates well the point that one can often choose to view the observed data as “incomplete” simply in order to introduce “complete” data that makes the calculations simpler, even when there is no suggestion, in the original problem, that the observed data is in any way inadequate or “incomplete.” It is in order to emphasize this desire for simplification that we refer to X as the preferred data, not the complete data.

In some applications, the preferred data X arises naturally from the problem, while in other cases the user must imagine preferred data. This choice in selecting the preferred data can be helpful in speeding up the algorithm (see [10]).

at each M-step, we simply select θ k+1 so that

### Preferred Data as Missing Data

As we have seen, when the EM algorithm is applied to the missing-data model, the likelihood is non-decreasing, which suggests that, for an arbitrary preferred data X, we could imagine X as W, the missing data, and imagine applying the EM algorithm to Z = (Y, X). This approach would produce an EM sequence of parameter vectors for which likelihood is non-decreasing, but it need not be the same sequence as obtained by applying the EM algorithm to X directly. It is the same sequence, provided that X is acceptable. We are not suggesting that applying the EM algorithm to Z = (Y, X) would simplify calculations.

We know that, when the missing-data model is used and the M-step is defined as maximizing the function in (35), the likelihood is not decreasing. It would seem then that for any choice of preferred data X, we could view this data as missing and take as our complete data the pair Z = (Y, X), with X now playing the role of W. Maximizing the function in (35) is then equivalent to maximizing

(40)
to get . It then follows that . The obvious question is whether or not these two functions given in (11) and (40) have the same maximizers.

For acceptable X we have

(41)
so the two functions given in (11) and (40) do have the same maximizers. It follows once again that whenever the preferred data is acceptable, we have . Without additional assumptions, however, we cannot conclude that {θ k } converges to θ ML , nor that converges to .

## 7 The Continuous Case with Y = h(X)

In this section we consider the continuous case in which the observed random vector Y takes values in ; the preferred random vector X takes values in ; the random vectors are governed by probability density functions and f X (x | θ), respectively; and there is a function such that Y = h(X). In most cases, M > N and has measure zero in .

### An Example

For example, suppose that Z 1 and Z 2 are independent and uniformly distributed on the interval [0, θ], for some θ > 0 to be estimated. Let . With , and given by , we have Y = h(Z). The pdf for Z is

(42)
The pdf for Y is

(43)
It is not the case that

(44)
since h −1{y} has measure zero in .

The likelihood function is , viewed as a function of θ, and is given by

(45)
Therefore, the maximum likelihood estimate of θ is θ ML = y.

Instead of using Z as our preferred data, suppose that we define the random variable W = Z 2, and let X = (Y, W), a missing-data model. We then have Y = h(X), where is given by . The pdf for Y given in Eq. (43) can be written as

(46)
The joint pdf is

(47)

### Censored Exponential Data

McLachlan and Krishnan [1] give the following example of a likelihood maximization problem involving probability density functions. This example provides a good illustration of the usefulness of the missing-data model.

Suppose that Z is the time until failure of a component, which we assume is governed by the exponential distribution

(48)
where the parameter θ > 0 is the expected time until failure. We observe a random sample of N components and record their failure times, z n . On the basis of this data, we must estimate θ, the mean time until failure.

It may well happen, however, that during the time allotted for observing the components, only r of the N components fail, which, for convenience, are taken to be the first r items in the record. Rather than wait longer, we record the failure times of those that failed and record the elapsed time for the experiment, say T, for those that had not yet failed. The censored data is then , where y n = z n is the time until failure for , and y n = T for . The censored data is reasonably viewed as incomplete, relative to the complete data we would have had, had the trial lasted until all the components had failed.

Since the probability that a component will survive until time T is , the pdf for the vector y is

(49)
and the log likelihood function for the censored, or incomplete, data is

(50)
In this particular example, we are fortunate, in that we can maximize f Y (y | θ) easily, and find that the ML solution based on the incomplete, censored data is

(51)
In most cases in which our data is incomplete, finding the ML estimate from the incomplete data is difficult, while finding it for the complete data is relatively easy.

We say that the missing data are the times until failure of those components that did not fail during the observation time. The preferred data is the complete data of actual times until failure. The pdf for the preferred data X is

(52)
and the log likelihood function based on the complete data is

(53)
The ML estimate of θ