2.1 Robust Regression
2.1.1 Introduction
The classical least squares (LS) estimator is widely used in regression
analysis both because of the ease of its computation and its
tradition. Unfortunately, it is quite sensitive to higher amounts of data
contamination, and this just adds together with the fact that outliers and
other deviations from the standard linear regression model (for which the least
squares method is best suited) appear quite frequently in real data. The
danger of outlying observations, both in the direction of the dependent and
explanatory variables, to the least squares regression is that they can have
a strong adverse effect on the estimate and they may remain unnoticed,
especially when higher dimensional data are analyzed.
Therefore, statistical techniques that are able to cope with or to detect
outlying observations have been developed. One of them is the least trimmed
squares estimator.
The methods designed to treat contaminated data can be based on one of two
principles. They can either detect highly influential observations first and
then apply a classical estimation procedure on the ``cleaned'' data, or they
can be designed so that the resulting regression estimates are not easily
influenced by contamination. Before we actually discuss them, especially the
latter ones, let us exemplify the sensitivity of the least squares estimator to
outlying observations.
The data set phonecal
serves well this purpose. The
data set, which comes from the Belgian Statistical Survey and was analyzed
by Rousseeuw and Leroy (1987), describes the number of international phone
calls from Belgium in years 1950-1973. The result of the least squares
regression is depicted on Figure 2.1. Apparently, there is a heavy
contamination caused by a different measurement system in years 1964-1969
and parts of year 1963 and 1970--instead of the number of phone calls, the
total number of minutes of these calls was reported. Moreover, one can
immediately see the effect of this contamination: the estimated regression
line follow neither a mild upward trend in the rest of the data, nor any
other pattern that can be recognized in the data. One could argue that the
contamination was quite high and evident after a brief inspection of the
data. However, such an effect might be caused even by a single observation,
and in addition to that, the outlying observations do not have to be easily
recognizable if analyzed data are multi-dimensional. To give an example, an
artificial data set consisting of 10 observations and one outlier is used.
We can see the effect of a single outlier on Figure 2.2--while the
blue line represents the underlying model, the red thick line shows the least
squares estimate. Moreover, the same figure shows that the residuals plot
does not have to have any outlier-detection power (the blue thin lines
represent interval
and the blue thick lines correspond
to
).
Figure:
Least squares regression with one outlier and the corresponding
residual plot,
ls02.xpl
|
As most statisticians are aware of the described threats caused by very
influential observations for a long time, they have been trying to develop
procedures that would help to identify these influential observations and
provide ``outlier-resistant'' estimates. There are actually two ways how this
goal can be achieved. First one relies on some kind of regression diagnostics
to identify highly influential data points. Having identified suspicious data
points, one can remove them, and subsequently, apply classical regression
methods. These methods are not in the focus of this chapter. Another strategy,
which will be discussed here, is to utilize estimation techniques based on the
so-called robust statistics. These robust estimation methods are designed so
that they are not easily endangered by contamination of data. Furthermore, a
subsequent analysis of regression residuals coming from such a robust
regression fit can then hint on outlying observations. Consequently, such
robust regression methods can serve as diagnostic tools as well.
2.1.2 High Breakdown Point Estimators
Within the theory of robustness, several concepts exist. They range from the
original minimax approach introduced in Huber (1964) and the
approach based on the influence function (Hampel et al.; 1986) to high
breakdown point procedures (Hampel; 1971), that is the procedures that
are able to handle highly contaminated data.
The last one will be of interest here as the least trimmed
squares estimator belongs to and was developed as a high breakdown point
method. To formalize the notion of the capability of an estimator to resist
to some amount of contamination in the data, the breakdown
point was introduced. For the simplicity of exposure,
we present here one of its finite-sample versions suggested
by Donoho and Huber (1983):
Take an arbitrary sample of
data points,
,
and let
be a regression estimator, i.e., applying
to the sample
produces an estimate of regression coefficients
. Then the
breakdown point of the estimator
at
is defined by
 |
(2.1) |
where sample
is created from the original sample
by replacing observations
by
values
.
The breakdown point usually does not depend on
.
To give an example, it immediately follows from
the definition that the finite-sample breakdown point of the arithmetic
mean equals to 0 in a one-dimensional location model, while for
the median it is
. Actually, the breakdown point equal to
is the
highest one that can be achieved at all--if the amount of contamination is
higher, it is not possible to decide which part of the data is the correct one.
Such a result is proved, for example, in Theorem 4, Chapter 3 of Rousseeuw and Leroy (1987)
for the case of regression equivariance estimators (the upper bound on
in this case is actually
, where
denotes the integer part of
).
There were quite a lot of estimators intended to have a high breakdown
point, that is close to the upper bound, although some of them were not
entirely successful in achieving this point because of their sensitivity to
a specific kind of data contamination. One of truly high breakdown point
estimators that reached the above mentioned upper bound of the breakdown point
were the least median of squares (LMS)
estimator (Rousseeuw; 1984), which minimizes the median of squared residuals, and
the least trimmed squares (LTS) estimator
(Rousseeuw; 1985), which takes as its objective function the sum of
smallest
squared residuals and was indeed proposed as a remedy to the low asymptotic
efficiency of LMS.
Before proceeding to the definition and a more detailed discussion of the
least trimmed squares estimator, let us show the behavior of this estimator
when applied to phonecal
data used in the previous section. On Figure
2.3 we can see two estimated regression lines: the red thick line
that corresponds to the LTS estimate, and for comparison purposes, the blue
thin line that depicts the least squares regression result.
While the least squares estimate is spoilt by outliers coming from years
1963-1970, the least trimmed squares regression line
is not affected and outlines the trend one would consider as the right one.