2.3 Supplementary Remarks
2.3.1 Choice of the Trimming Constant
As was already mentioned, the trimming constant
have to satisfy
and indeed determines the breakdown point of LTS.
The choice of this constant depends mainly on the purpose for which we want
to use LTS. There is, of course, a trade-off involved:
lower values of
, which are close to the optimal breakdown point choice,
lead to a higher breakdown point, while higher values improve efficiency
(if the data are not too contaminated) since more information stored in data
is utilized. The maximum breakdown point is attained for
.
This choice is often employed when the LTS is used for
diagnostic purposes (see Subsection 2.3.2).
The most robust choice of
may be also favored
when LTS is used for comparison with some less robust estimator, e.g., the
least squares, since comparison of these two estimators can serve as a
simple check of data and a model--if the estimates are not similar to each
other, a special care should be taken throughout the subsequent analysis.
On the other hand, it may be sensible to evaluate LTS for a wide range of
values of the trimming constant and to observe how the estimate
behaves with increasing
, because this can provide hints on the amount
of contamination and possibly on suspicious structures of a given data set
(for example, that the data set contains actually a mixture of two
different populations).
2.3.2 LTS as a Diagnostic Tool
We have several times advocated the use of the least trimmed squares
estimator for diagnostic purposes. Therefore, a brief guidance regarding
diagnostics is provided in this subsection via an example. Let us look at
stacklos
data, which were already analyzed many times, for example
by Drapper and Smith (1966), Daniel and Wood (1971),
Carroll and Ruppert (1985), and Rousseeuw and Leroy (1987). The data consist of 21
four-dimensional observations characterizing the production of nitric acid by
the oxidation of ammonia. The stackloss (y) is assumed to depend on the
rate of operation (
), on the cooling water inlet temperature (
)
and on the acid concentration (
). Most of the studies dealing with
this data set found out among others that data points 1, 3, 4, 21, and
maybe also 2 were outliers. First, the least square regression result
ls03.xpl
,
is reported for comparison with LTS, the corresponding residual plot is plotted
on Figure 2.4 (once again, the blue thin lines represent
and the
blue thick lines correspond to
). There are no significantly
large residuals with respect to the standard deviation, so without any other
diagnostic statistics one would be tempted to believe that there are no outlying
observations. On the contrary, if we inspect the least trimmed squares regression,
which produces
lts03.xpl
,
our conclusion will be different. To construct a residual plot for a robust
estimator, it is necessary to use
also a robust estimator of scale because the presence of outliers is presumed.
Such a robust estimator of variance can be based in the case of LTS,
for example, on the sum of the
smallest residuals or on the absolute median deviation
as is the case on Figure 2.5.
Inspecting the residual plot on Figure 2.5 (the blue lines represents again
and
levels, where
),
observations 1, 2, 3, 4, and 21 become suspicious ones as their residuals
are very large in the sense that they lie outside of the interval
. Thus, the LTS estimate provide us at the same time
with a powerful diagnostic tool. One has naturally to decide which ratios
are already doubtable, but value 2.5 is often used
as a decisive point.
2.3.3 High Subsample Sensitivity
The final note on LTS concerns a broader issue that we should be aware of
whenever such a robust estimator is employed. Already mentioned high
subsample sensitivity is caused by the fact that high breakdown point
estimators search for a ``core'' subset of data that follows
best a certain model (with all its assumptions) without taking into account
the rest of observations. A change of some observations may then lead to a
large swing in composition of this core subset. This might happen, for
instance, if the data are actually a mixture of two (or several)
populations of data, i.e., a part of data can be explained by one
regression line, another part of the same data by a quite different
regression function, and in addition to that, some observations may suit
both model relatively well
(this can happen with a real data set too, see Benácek, Jarolím, and Vísek; 1998).
In such a situation, a small change of some
observations or some parameters of the estimator can bring the estimate
from one regression function to another. Moreover, application of several
(robust) estimates is likely to introduce several rather different
estimates in such a situation--see Vísek (1999b)
for a detailed discussion. Still,
it is necessary to have in mind that this is not shortcoming of the discussed
estimators, but of the approach taken in this case--procedures
designed to suit some theoretical models are applied to an unknown sample
and the procedures in question just try to explain it by means of a prescribed
model.