\documentstyle[12pt]{article}
%\input{jour}
\topmargin 0in
\headheight 0in
\headsep 0in
\textheight 9in
\textwidth 6.25in
\oddsidemargin 0in
\evensidemargin 0in
\newcommand{\bz}{\mbox{\boldmath $z$}}
\newcommand{\bZ}{\mbox{\boldmath $Z$}}
\newcommand{\by}{\mbox{\boldmath $y$}}
\newcommand{\bY}{\mbox{\boldmath $Y$}}
\newcommand{\bx}{\mbox{\boldmath $x$}}
\newcommand{\obx}{\overline{\mbox{\boldmath $x$}}}
\newcommand{\ox}{\overline x}
\newcommand{\bS}{\mbox{\boldmath $z$}}
\newcommand{\bX}{\mbox{\boldmath $X$}}
\newcommand{\btheta}{\mbox{\boldmath $\theta$}}
\newcommand{\bbeta}{\mbox{\boldmath $\beta$}}
\newcommand{\bPsi}{\mbox{\boldmath $\Psi$}}
\newcommand{\bPhi}{\mbox{\boldmath $\Phi$}}
\newcommand{\bpi}{\mbox{\boldmath $\pi$}}
\newcommand{\bnu}{\mbox{\boldmath $\nu$}}
\newcommand{\hnu}{\hat\nu}
\newcommand{\sumj}{\sum_{j=1}^{n}}
\newcommand{\sumi}{\sum_{i=1}^{g}}
\newcommand{\bmu}{\mbox{\boldmath $\mu$}}
\newcommand{\bSigma}{\mbox{\boldmath $\Sigma$}}
\newcommand{\hbPsi}{\hat{\mbox{\boldmath $\Psi$}}}
\newcommand{\hbSigma}{\hat{\mbox{\boldmath $\Sigma$}}}
\setlength{\parskip}{.05 in}
\setlength{\topmargin}{1mm}
\setlength{\baselineskip}{1.5\baselineskip}
\renewcommand{\baselinestretch}{1.5}
\setlength{\parskip}{.16 in}
\input{epsf}
\begin{document}
\title{THE EMMIX SOFTWARE FOR THE FITTING OF MIXTURES OF NORMAL AND $t$-COMPONENTS}
\author{G.J. McLachlan, D. Peel, K.E. Basford*, and P. Adams\\
Department of Mathematics,\\
University of Queensland, St. Lucia, Queensland 4072, AUSTRALIA\\
$^{*}$School of Land and Food,\\
University of Queensland, St. Lucia, Queensland 4072, AUSTRALIA}
\date{}
\maketitle
\section*{Abstract}
{\it We consider the fitting of normal or t-component mixture models to
multivariate data, using maximum likelihood via
the EM algorithm. This approach requires
the initial specification of an initial estimate of
the vector of unknown parameters, or equivalently, of an initial
classification of the data with respect to the components of the mixture
model under fit. We describe an algorithm called EMMIX that
automatically undertakes this fitting, including the provision of suitable
initial values if not supplied by the user. The EMMIX algorithm has several
options, including the option to carry out a resampling-based test for
the number of components in the mixture model.}
\section{INTRODUCTION}
\label{intro}
Finite mixtures models are being increasingly used to model the
distributions of a wide variety of random phenomena.
For multivariate data of a continuous
nature, attention has focussed on the use of multivariate normal components
because of their computational convenience.
They can be easily fitted iteratively by maximum likelihood (ML) via the
expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin
(1977), McLachlan and Krishnan (1997)), as the iterates on the
M-step are given in closed form. Also, in cluster analysis where a
mixture model-based approach is widely adopted,
the clusters in the data are often essentially elliptical in shape, so that it
is reasonable to consider fitting mixtures of elliptically symmetric component
densities. Within this class of component densities, the multivariate normal
density is a convenient choice given its above-mentioned computational
tractability.
However, for a set of data
containing a group, or groups, of observations with longer than normal tails
or atypical observations, the use of normal components may unduly affect
the fit of the mixture model. So a more robust
approach by modelling the data by a mixture of $t$ distributions is provided.
The use of the ECM algorithm to fit this $t$ mixture model is described
in McLachlan and Peel(1998).
We let $\by_1,\,\ldots,\,\by_n$ denote an observed $p$-dimensional
sample of size $n$. With a
mixture model-based approach to drawing inferences from these data,
each data point is assumed to be a realization of the random $p$-dimensional
vector $\bY$ with the $g$-component mixture probability density
function (p.d.f.),
\begin{equation}
f(\by;\,\bPsi)=\sum_{i=1}^g \pi_i c_i(\by;\,\btheta_i)
\label{eq:1}
\end{equation}
where the mixing proportions $\pi_i$ are nonnegative and sum to one and
$\bPsi=(\pi_1,\,\ldots,\,\pi_{g-1},\btheta^T)^T$ where
$\btheta_i$ denotes the unknown parameters of the distribution $c_i$.
In the case of multivariate normal mixture models the $c_i(\by;\,\btheta_i)$ are
replaced by
$\phi(\by;\,\bmu_i,\,\bSigma_i)$ denoting the multivariate normal
p.d.f. with mean (vector) $\bmu_i$ and covariance matrix $\bSigma_i$.
Hence the $\btheta$ contains the elements of the $\bmu_i$ and the distinct
elements of $\bSigma_i \, (i=1,\,\ldots,\,g)$.
Often, in order to reduce
the number of unknown parameters, the component-covariance matrices are
restricted to being equal, or even diagonal as in the AutoClass program of Cheeseman and Stutz (1996).
Less restrictive constraints can be imposed by a reparameterization of the
component-covariance matrices in terms of their eigenvalue decompositions as,
for example, in Banfield and Raftery (1993).
In the latest version of AutoClass (http://ic.arc.nasa.gov/ic/projects/bayes-group/autoclass/autoclass-c-program.html), the covariance matrices are unrestricted
In other software for the fitting of mixture models, there are
MCLUST and EMCLUST which are a suite of S-PLUS functions for hierarchical
clustering EM, and BIC, respectively based on parameterized Gaussian
mixture models; see Banfield and Raftery (1993), Byers and Raftery
(1998), Campbell et al. (1998), DasGupta and Raftery (1998),
and Fraley and Raftery (1998).
MCLUST (http://stat.washington.edu/fraley/software.shtml) and
EMCLUST (http://stat.washington.edu/fraley/software.shtml)
are written in
FORTRAN with an interface to the S-PLUS commercial package.
Some packages for the fitting of finite mixtures have been reviewed recently
by Haughton (1997).
Also, Wallace and Dowe (1994) have considered the application of their SNOB
(http://www.cs.monash.edu.au/~dld/Snob.html)
program to mixture modelling using the minimum message length principle of
Wallace and Boulton (1968).
More recently, Hunt and Jorgensen (1997) have developed the MULTIMIX
program for the fitting of mixture models to data sets that contain categorical
and continuous variables and that may have missing values.
Under the assumption that $\by_1,\,\ldots,\,\by_n$ are independent
realizations of the feature vector $\bY$, the log likelihood function
for $\bPsi$ is given by
\begin{equation}
\log L(\bPsi)=\sum_{j=1}^n\log\sum_{i=1}^g\pi_i
\phi(\by_j;\,\bmu_i,\,\bSigma_i).
\label{eq:2}
\end{equation}
With the maximum likelihood approach to the estimation of $\bPsi$, an
estimate is provided by an appropriate root of the likelihood equation,
\begin{equation}
\partial\log L(\bPsi)/\partial\bPsi={\bf 0}.
\label{eq:3}
\end{equation}
In this paper, we describe an algorithm called EMMIX that has been developed
using the EM algorithm to find solutions of (1.3) corresponding to
local maxima.
In the appendix of their monograph, McLachlan and Basford (1988) gave the
listing of FORTRAN programs that they had written for the maximum
likelihood fitting of multivariate normal mixture models under a variety of
experimental conditions. Over the years, these programs have undergone
continued refinement and development, leading to an interim version
known as the NMM algorithm (McLachlan and Peel, 1996). Since then, there
has been much further development, culminating in the present version of the
algorithm known as EMMIX. The option in EMMIX that uses hierarchical-based
methods for the provision of an initial classification of the data uses the
the program HACLUS written by Dr I.\ De Lacy.
For the mixture programs of McLachlan and Basford (1988), an initial
specification had to be given by the user either for the parameter vector
$\bPsi$ or for the classification of the data with respect to the components of
the normal mixture model. With the EMMIX algorithm, the user does not have to
provide this specification. In the absence of a user-provided
specification, the EMMIX algorithm can be run for a specified number of
random starts and/or for starts corresponding to classifications of the data
by specified clustering procedures from a wide class that includes
$k$-means and commonly used hierarchical methods.
Another major option of the EMMIX algorithm allows the user to
automatically carry out a test for the smallest number of components
compatible with the data. This likelihood-based test uses the
resampling approach of McLachlan (1987)
to assess the associated $P$-value.
This option is based on the MMRESAMP subroutine of McLachlan et al. (1995).
The EMMIX algorithm also has several other options which are outlined in the
user's guide.
\section{APPLICATION OF EM ALGORITHM}
It is straightforward to find solutions of (\ref{eq:3})
using the EM algorithm of Dempster et al. (1977). For the
purpose of the application of the EM algorithm, the observed-data vector
$\by_{obs} =(\by_1^T,\,\ldots,\,\by_n^T)^T$
is regarded as being incomplete. The component-label variables
$z_{ij}$ are consequently introduced, where $z_{ij}$ is defined to be one or
zero according to if $\by_j$ did or did not arise from the $i$ th component of the
mixture model, $(i=1,\,\ldots,\,g\, ;\, j=1,\,\ldots,\,n)$.
This complete-data framework in which each observation is conceptualised as
having arisen from one of the components of the mixture is
directly applicable in those situations
where $\bY$ can be physically identified as having come from a population
which is a mixture of $g$ groups.
On putting $\bz_j=(z_{1j},\,\ldots,\,z_{gj})^T$, the complete-data vector
$\bx_{\rm c}$ is therefore given by
$$\bx_{\rm c}=(\bx_1^T,\,\ldots,\,\bx_n^T)^T,$$
where $\bx_1=(\by_1^T,\,\bz_1^T)^T,\,\ldots,\,\bx_n=(\by_n^T,\,\bz_n^T)^T$
are taken to be independent and identically distributed with
$\bz_1,\,\ldots,\,\bz_n$ being independent realizations from a multinomial
distribution consisting of one draw on $g$ categories with respective
probabilities $\pi_1,\,\ldots,\,\pi_g$. That is,
$$
\bz_1,\,\ldots,\,\bz_n\stackrel{iid}{\sim}\mbox{ Mult}_g(1,\,\bpi),
$$
where $\bpi=(\pi_1,\,\ldots,\,\pi_g)^T$.
For this specification, the complete-data log likelihood is
\begin{equation}
\log L_{\rm c}(\bPsi)=\sum_{i=1}^g \sum_{j=1}^n z_{ij}\log\{\pi_i
\phi(\by_j;\bmu_i,\bSigma_i)\}.
\label{eq:4}
\end{equation}
The EM algorithm is easy to program and proceeds iteratively in two
steps, E (for expectation) and M (for maximization); see McLachlan and Krishnan (1997)
for a recent account of the EM algorithm in a general context.
On the $(K+1)$ th iteration, the E-step requires the calculation of
$$Q(\bPsi\,;\,\bPsi^{(k)})=E_{\bPsi^{(k)}}\{\log
L_{\rm c}(\bPsi)\mid\by_{obs}\},$$
the conditional expectation of the complete-data log likelihood
$\log L_{\rm c}(\bPsi)$, given the observed data $\by_{obs}$, using the
current fit $\bPsi^{(k)}$ for $\bPsi$. Since $\log L_{\rm c}(\bPsi)$
is a linear function of the unobservable component-label variables
$z_{ij}$, the E-step is effected simply by replacing $z_{ij}$ by its conditional
expectation given $\by_j$, using $\bPsi^{(k)}$ for $\bPsi$. That is,
$z_{ij}$ is replaced by
\begin{eqnarray*}
\tau_i(\by_j;\,\bPsi^{(k)}) &=& E_{\bPsi^{(k)}}\{Z_{ij}\mid\by_j\}\\
&=& {\rm pr}_{\bPsi^{(k)}}\{Z_{ij}=1\mid \by_j\} \\
&=&\frac{\pi_i \phi(\by_j;\,\bmu_i,\bSigma_i)}
{\sum_{h=1}^g \pi_h \phi(\by_j;\,\bmu_h,\bSigma_h)}\quad
(i=1,\,\ldots,\,g\,;\,j=1,\,\ldots,\,n),\\
\end{eqnarray*}
where $\tau_i(\by_j;\,\bPsi^{(k)})$ is the current estimate of the posterior
probability that the $j$ th entity with feature vector $\by_j$ belongs to the
$i$th component\, $(i=1,\,\ldots,\,g\,; \,j=1,\,\ldots,\,n)$.
On the M-step on the $(k+1)$ th iteration, the intent is to choose the value of
$\bPsi$, say $\bPsi^{(k+1)}$, that maximizes $Q(\bPsi;\,\bPsi^{(k)})$.
It follows that on the M-step of the $(k+1)$ th iteration, the current fit
for the mixing proportions, the component means, and the covariance matrices
is given explicitly by
$$\pi_i^{(k+1)}=\sum_{j=1}^n\tau_i(\by_j;\,\bPsi^{(k)})/n,$$
$$\bmu_{i}^{(k+1)}=\sum_{j=1}^n\tau_i
(\by_j;\,\bPsi^{(k)})\by_j/\sum_{i=1}^n\tau_i(\bPsi^{(k)}),$$
and
\begin{equation}
\bSigma^{(k+1)}_i=
\sum_{j=1}^n\tau_i(\by_j;\,\bPsi^{(k)})(\by_j-\bmu_i^{(k+1)})
(\by_j-\bmu_i^{(k+1)})^T/\sum_{i=1}^n \tau_i(\by_j;\,\bPsi^{(k)})
\label{eq:6}
\end{equation}
for $i=1,\,\ldots,\,g>$.
A nice feature of the EM algorithm is that the mixture likelihood
$L(\bPsi)$ can never be decreased after the EM sequence. Hence
$$L(\bPsi^{(k+1)})\geq L(\bPsi^{(k)}),$$
which implies that $L(\bPsi^{(k)})$ converges to some $L^*$ for a
sequence of likelihood values bounded above.
The E and M-steps are alternated repeatedly until the
likelihood (or the parameter estimates) change by an arbitrarily small
amount in the case of convergence.
Let $\hbPsi$ be the chosen solution of the likelihood equation.
The likelihood function $L(\bPsi)$ tends to have multiple local maxima for
normal mixture models. In this case of unrestricted component covariance
matrices, $L(\bPsi)$ is unbounded, as each data point gives rise to a
singularity on the edge of the parameter space; see, for example,
McLachlan and Basford (1988, Chapter 2).
In practice, however,
consideration has to be given to the problem of relatively large local maxima
that occur as a consequence of a fitted component having a very small
(but nonzero) variance for univariate data or generalized variance
(the determinant of the covariance matrix) for multivariate data.
Such a component corresponds to a cluster containing a few data points either
relatively close together or almost lying in a lower dimensional subspace in
the case of multivariate data. There is thus a need to monitor
the relative size of the fitted mixing proportions and of the
component variances for univariate observations and of
the generalized component variances for multivariate data in an attempt to
identify these spurious local maximizers.
There is also a need to monitor the Euclidean distances between the fitted
component means to see if the implied clusters represent a real separation
between the means or whether they arise because one or more of the
clusters fall almost in a subspace of the original feature space.
\section{MIXTURES OF t-DISTRIBUTIONS}
As mentioned in Section~\ref{intro} for many applied problems, the tails of the normal distribution
are often shorter than required. Also, the estimates of the component
means and covariance matrices can be affected by observations that are
atypical of the components in the normal mixture model being fitted.
EMMIX provides a more robust approach by modelling the data by a mixture of
$t$ distributions.
The use of the ECM algorithm to fit this $t$ mixture model is described and
examples of its use are given in McLachlan and Peel (1998). With this $t$ mixture model-based approach, the normal distribution for each
component in the mixture is embedded in a wider
class of elliptically symmetric distributions with an additional parameter
called the degrees of freedom $\nu$. As $\nu$ tends to infinity, the
$t$ distribution approaches the normal distribution. Hence this parameter
$\nu$ may be viewed as a robustness tuning parameter. EMMIX has the option
to fix the component $\nu$ parameters in advance or
infer their values from the data for each component using the ECM algorithm.
\section{SPECIFICATION OF INITIAL VALUES}
It follows from the previous section that care must be taken in the choice of the root
of the likelihood equation in the case of unrestricted covariance matrices
where $L(\bPsi)$ is unbounded.
In order to fit a mixture model using the EM algorithm, an initial value
has to be specified for the vector $\bPsi$ of unknown parameters for use on the
E-step on the first iteration of the EM algorithm. Equivalently, initial values
must be specified for the posterior probabilities of component membership of
the mixture, $\tau_1(\by_j;\;\bPsi^{(0)}),\,\ldots,\,
\tau_g(\by_j;\;\bPsi^{(0)})$, for each $\by_j \,(j=1,\,\ldots,\,n)$ for use on
commencing the EM algorithm on the M-step the first time through. The latter
posterior probabilities can be specified as zero-one values, corresponding to
an outright classification of the data with respect to the $g$ components
of the mixture. In this case, it suffices
to specify the initial partition of the data. In a cluster analysis context
it is usually more appropriate to do this rather than specifying an initial
value for $\bPsi$.
\section{EMMIX ALGORITHM}
We now give a general description of an algorithm called EMMIX, which
automatically provides a selection of starting values for this purpose if not
provided by the user.
More precise details on the EMMIX algorithm, including its
implementation, are given in the ``User's Guide to EMMIX''.
The EMMIX algorithm automatically provides
starting values for the application of the EM
algorithm by considering a selection obtained from three sources:
\begin{description}
\item [(a)] random starts,
\item [(b)] hierarchical clustering-based starts, and
\item [(c)] $k$-means clustering-based starts
\end{description}
Concerning (a) for randomly selected starts, we have an additional option
whereby the user can first subsample the data
before using a random start based on the subsample each time. This is to
limit the effect of the central limit theorem which would have the
randomly selected starts being similar for each component in large
samples.
Concerning (b), the user has the option of
using in either standardized or unstandardized form, the results from seven
hierarchical methods (nearest neighbour, farthest neighbour, group average,
median, centroid, flexible sorting, and Ward's method).
There are several algorithm parameters that the user can optionally specify;
alternatively, default values are used.
The program fits the normal mixture model for each of the initial grouping
specified from the three sources (a) to (c).
All these computations are automatically carried out by the program. The user
only has to provide the data set the restrictions on the component-covariance
matrices (equal, unequal, or diagonal), the extent of the selection of the
initial groupings to be used to determine starting values, and the number of
components that are to be fitted. Summary information is
automatically given as output for the final fit.
However, it is not suggested that the clustering of a data set should be
based solely on a single solution of the likelihood equation, but rather on
the various solutions considered collectively.
The default final fit is taken to be the one corresponding to the largest of
the local maxima located. However, the summary information can be recovered
for any distinct fit.
As well as the options pertaining to the automatic provision of starting
values covered above, several other options are available, including the
provision of standard errors for the fitted parameters in the mixture model,
and the bootstrapping of the likelihood ratio statistic
$\lambda$ for testing $g=g_0$ versus $g=g_0+1$ components in the mixture model,
where the value $g_0$ is specified by the user. With the latter option,
the bootstrap samples are generated parametrically from the
$g_0$-component normal mixture model with $\bPsi$ set equal to the fit
$\hbPsi_{g_0}$ for $\bPsi$ under the null hypothesis of $g_0$ components.
\section{EXAMPLE}
To illustrate the use of the EMMIX algorithm, we consider the
a simulated bivariate sample generated from a normal mixture model with
parameters\\
$$\bmu_1=\left(\begin{array}{cc} 0,&0 \end{array}\right)^T,
\bmu_2=\left(\begin{array}{cc} 4,&0 \end{array}\right)^T,
\bmu_3=\left(\begin{array}{cc} -4,&0 \end{array}\right)^T$$
$$\bSigma_1=\left(\begin{array}{cc} 1 & \\ 0& 1 \\ \end{array} \right),
\bSigma_2=\left(\begin{array}{cc} 1 & \\ -0.4& 3 \\ \end{array} \right),
\bSigma_3=\left(\begin{array}{cc} 2 & \\ 0.3& 0.5 \\ \end{array} \right)$$
and mixing proportions $\pi_1=\pi_2=\pi_3=0.33$
A plot of the sample with the true allocation shown is given in Figure 1.
\setlength{\epsfxsize}{4in}
\begin{figure}[htb]
\begin{center}
\leavevmode
\
\epsfbox{fig1.eps}
\end{center}
\vskip -0.1in
\caption{Plot of the simulated sample with the true allocation shown}
\label{fig1}
\end{figure}
We now cluster these data, ignoring the known classification of the
data, by fitting a mixture of three normal components with 10 random starts
(using 70 percent sub-sampling of the data), 10 $k$-means starts and the default 6
hierarchical methods (with and without restrictions on the component-covariance
matrices). The resulting allocation when fitting unconstrained covariance
matrices is shown in Figure 2.
\setlength{\epsfxsize}{4in}
\begin{figure}[htb]
\begin{center}
\leavevmode
\
\epsfbox{fig2.eps}
\end{center}
\vskip -0.1in
\caption{Plot of the allocation found by EMMIX with arbitrary covariance matrices for the simulated sample}
\label{fig2}
\end{figure}
When fitting unconstrained covariance matrices EMMIX misallocated eight points
(5.3 $\%$). The misallocation occurs on the boundary between the component
denoted by crosses and the component denoted by circles
with EMMIX misallocating seven of the crosses group as circles and one of the
circle points as a cross.
Similarly, the allocation produced when fitting equal covariance matrices is
given in Figure 3. Fitting equal covariance matrices in this example results, as
would be expected, in a much larger number of misallocations.
\setlength{\epsfxsize}{4in}
\begin{figure}[htb]
\begin{center}
\leavevmode
\
\epsfbox{fig3.eps}
\end{center}
\vskip -0.1in
\caption{Plot of the allocation found by EMMIX with equal covariance matrices for the simulated sample}
\label{fig3}
\end{figure}
If the number of components is not specified, EMMIX can fit a range
of values for the number of components utilizing
the bootstrap procedure. The resulting output from EMMIX (fitting unconstrained
covariance matrices) is given below in Table 1.
\begin{table} [htb]
\centering
\begin{tabular}{|c|c|c|c|c|c|c|} \hline
NG & Log Lik &$-2\log\lambda$ & AIC & BIC & AWE & $P$-VAL \\
\hline
1 & -636.76 & - & 1283.53 & 1298.58 & 1333.64 & - \\
2 & -612.31 & 48.92 & 1246.61 & 1279.73 & 1356.85 & 0.01\\
3 & -588.21 & 48.19 & 1210.42 & 1261.60 & 1380.78 & 0.01\\
4 & -580.79 & 14.84 & 1207.58 & 1276.82 & 1438.07 & 0.12\\
\hline
\end{tabular}
\caption{Analysis to determine the number of groups for the
simulated example}
\label{tab1}
\end{table}
The results produced by EMMIX shown in Table 1 concur with the true number of
components, three.
\clearpage
\section*{REFERENCES}
\noindent
Banfield, J.D., and Raftery, A. (1993). Model-based Gaussian
and non-Gaussian clustering. {\it Biometrics} {\bf 49}, 803--821.
\\
\vskip -0.6cm
\noindent
Byers, S. and Raftery, A.E. (1988). Nearest-neighbor clutter removal for
estimating features in spatial point processes.
{\it Journal of the American Statistical Association} {\bf 91}, 577--584.
\\
\vskip -0.6cm
\noindent
Campbell, J.G., Fraley, C., Murtagh, F., and and Raftery, A.E. (1998).
Linear flaw detection in woven textiles using model-based clustering.
{\it Pattern Recognition Letters} {\bf 18}, 1539--1548.
\\
\vskip -0.6cm
\noindent
Cheeseman, P., and Stutz, J. (1996). Bayesian classification
(AutoClass): theory and results. In {Advances in Knowledge Discovery and
Data Mining} , U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy (Eds.). Menlo Park, California: The AAAI Press,
pp. 61--83.
\\
\vskip -0.6cm
\noindent
DasGupta, A. and Raftery, A.E. (1998).
Detecting features in spatial point processes with clutter
via model-based clustering. {\it Journal of the American Statistical Association} {\bf 93}, 294--302.
\\
\vskip -0.6cm
\noindent
Dempster, A.P., Laird, N.M., and and Rubin, D.B. (1997).
Maximum likelihood from incomplete data via the EM algorithm (with discussion).
{\it Journal of the Royal Statistical Society B} {\it 39}, 1--38.
\\
\vskip -0.6cm
\noindent
Fraley, C. and Raftery, A.E. (1998).
How many clusters? Which clustering method? - Answers via model-based
cluster analysis. {\it Technical Report No. 329}. Seattle: Department
of Statistics, University of Washington.
\\
\vskip -0.6cm
\noindent
Hastie, T. and Tibshirani, R.J. (1996). Discriminant analysis
by Gaussian mixtures. {\it Journal of the Royal Statistical Society B} {\bf 58},
155--176.
\\
\vskip -0.6cm
\noindent
Haughton, D. (1997). Packages for estimating finite mixtures: a
review. {\it Applied Statistics} {\bf 51}, 194--205.
\\
\vskip -0.6cm
\noindent
Hunt, L., and Jorgensen, J. (1997). Mixture model clustering: a brief
introduction to the MULTIMIX program. Unpublished manuscript.
\\
\vskip -0.6cm
\noindent
McLachlan, G.J. (1987). On bootstrapping the likelihood ratio test statistic
for the number of components in a normal mixture. {\it Applied Statistics}
{\bf 36}, 318--324.
\\
\vskip -0.6cm
\noindent
McLachlan, G.J. and Basford, K.E. (1988).
{\it Mixture Models: Inference and Applications to Clustering}.
New York: Marcel Dekker.
\\
\vskip -0.6cm
\noindent
McLachlan, G.J. and Krishnan, T. (1997).
{\it The EM Algorithm and Extensions}. New York: Wiley.
\\
\vskip -0.6cm
\noindent
McLachlan, G.J. and Peel, D. (1996). An algorithm for unsupervised
learning via normal mixture models. In {\it ISIS: Information, Statistics
and Induction in Science}, D.L. Dowe, K.B. Korb, and J.J. Oliver
(Eds.), pp. 354--363. Singapore: World Scientific.
\\
\vskip -0.6cm
\noindent
McLachlan, G.J., Peel D., Adams, P., and Basford, K.E. (1995). An algorithm
for assessing by resampling the $P$-value of the likelihood ratio test on the
number of components in normal mixture models. {\it Research Report No. 31}.
Brisbane: Centre for Statistics, The University of Queensland.
\\
\vskip -0.6cm
\noindent
McLachlan, G.J. and Peel, D. (1998). Robust cluster analysis via mixtures of
multivariate t-distributions. In {\it Lecture Notes in Computer Science}
Vol. 1451, A. Amin, D. Dori, P. Pudil, and H. Freeman (Eds.).
Berlin: Springer-Verlag, pp. 658--666.
\\
\vskip -0.6cm
\noindent
Wallace, C.S and Boulton, D.M. (1968). An information measure for
classification. {\it Computer Journal} {\bf 11}, 185--194.
\\
\vskip -0.6cm
\noindent
Wallace, C.S. and Dowe D.L. (1994). Intrinsic classification by MML - the
Snob program. In 7th Australian Joint Conference on Artificial Intelligence,
37--44.
\end{document}