### The E-Step and M-Step

In statistical parameter estimation, one typically has an

observable random vector

Y taking values in

that is governed by a probability density function (pdf) or probability function (pf) of the form

f _{Y }(

y |

θ), for some value of the parameter vector

, where

is the set of all legitimate values of

θ. Our

observed data consists of one realization

y of

Y; we do not exclude the possibility that the entries of

y are independently obtained samples of a common real-valued random variable. The true vector of parameters is to be estimated by maximizing the likelihood function

L _{y }(

θ) =

f _{Y }(

y |

θ) over all

to obtain a maximum likelihood estimate,

θ _{ML }.

To employ the EM algorithmic approach, it is assumed that there is another related random vector

X, which we shall call the

preferred data, such that, had we been able to obtain one realization

x of

X, maximizing the likelihood function

would have been simpler than maximizing the likelihood function

L _{y }(

θ) =

f _{Y }(

y |

θ). Of course, we do not have a realization

x of

X. The basic idea of the EM approach is to estimate

x using the current estimate of

θ, denoted

θ ^{k }, and to use each estimate

x ^{k }of

x to get the next estimate

θ ^{k+1}.

The EM algorithm proceeds in two steps. Having selected the preferred data

X, and having found

θ ^{k }, we form the function of

θ given by

this is the E-step of the EM algorithm. Then we maximize

Q(

θ |

θ ^{k }) over all

θ to get

θ ^{k+1}; this is the M-step of the EM algorithm. In this way, the EM algorithm based on

X generates a sequence {

θ ^{k }} of parameter vectors.

For the discrete case of probability functions, we have

and for the continuous case of probability density functions, we have

In decreasing order of importance and difficulty, the goals are these:

1.
To have the sequence of parameters {θ ^{k }} converging to θ _{ML };

2.
To have the sequence of functions

converging to

;

3.
To have the sequence of numbers

converging to

L _{y }(

θ _{ML });

4.
To have the sequence of numbers

non-decreasing.

Our focus here is mainly on the fourth goal, with some discussion of the third goal. We do present some examples for which all four goals are attained. Clearly, the first goal requires a topology on the set

.

### Difficulties with the Conventional Formulation

In [

1] we are told that

This is false; integrating with respect to

x gives one on the left side and 1∕

f _{Y }(

y |

θ) on the right side. Perhaps the equation is not meant to hold for all

x, but just for some

x. In fact, if there is a function

h such that

Y =

h(

X), then Eq. (

14) might hold for those

x such that

h(

x) =

y. In fact, this is what happens in the discrete case of probabilities; in that case we do have

where

Consequently,

However, this modification of Eq. (

14) fails in the continuous case of probability density functions, since

is often a subset of zero measure. Even if the set

has positive measure, integrating both sides of Eq. (

14) over

tells us that

, which need not hold for probability density functions.

### An Incorrect Proof

Everyone who works with the EM algorithm will say that the likelihood is non-decreasing for the EM algorithm. The proof of this fact usually proceeds as follows; we use the notation for the continuous case, but the proof for the discrete case is essentially the same. Use Eq. (

14) to get

Then replace the term

in Eq. (

13) with the right side of Eq. (

17), obtaining

Jensen’s Inequality tells us that

for any probability density functions

u(

x) and

v(

x). Since

is a probability density function, we have

We conclude, therefore, that

attains its minimum value at

. Then we have

This proof is incorrect; clearly it rests on the validity of Eq. (

14), which is generally false. For the discrete case, with

Y =

h(

X), this proof is valid, when we use Eq. (

16), instead of Eq. (

14). In all other cases, however, the proof is incorrect.

### Acceptable Data

We turn now to the question of how to repair the incorrect proof. Equation (

14) should read

for all

x. In order to replace

in Eq. (

13), we write

and

so that

We say that the preferred data is

acceptable if

that is, the dependence of

Y on

X is unrelated to the value of the parameter

θ. This definition provides our generalization of the relationship

Y =

h(

X).

When

X is acceptable, we have that

again attains its minimum value at

θ =

θ ^{k }. The assertion that the likelihood is non-decreasing then follows, using the same argument as in the previous incorrect proof.