| Robust Statistics |
Article Index for Robust |
Website Links For Statistics |
Information AboutRobust Statistics |
| CATEGORIES ABOUT ROBUST STATISTICS | |
| statistical theory | |
| statistics | |
|
INTRODUCTION In Statistics , classical methods rely heavily on assumptions which are often not met in practice. In particular, it is often assumed that the data are normally distributed, at least approximately, or that the Central Limit Theorem can be relied on to produce normally distributed estimates. Unfortunately, when there are outliers in the data, classical methods often have very poor performance. Robust statistics seeks to provide methods that emulate classical methods, but which are not unduly affected by outliers or other small departures from model assumptions. In order to quantify the robustness of a method, it is necessary to define some measures of robustness. Perhaps the most common of these are the ''breakdown point'' and the ''influence function'', described below. Good books on robust statistics include those by Huber (1981), Hampel et al (1986) and Rousseeuw and Leroy (1987). A modern treatment is given by Maronna et al (2006). Huber's book is quite theoretical, whereas the book by Rousseew and Leroy is very practical (although the sections discussing software are rather out of date, the bulk of the book is still very relevant). Hampel et al (1987) and Maronna et al (2006) fall somewhere in the middle ground. All four of these are recommended reading, though Maronna et al is the most up to date. Robust parametric statistics tends to rely on replacing the normal distribution in classical methods with the T-distribution with low degrees of freedom (high kurtosis; degrees of freedom between 4 and 6 have often been found to be useful in practice) or with a mixture of two or more distributions. EXAMPLE: SPEED OF LIGHT DATA Gelman et al in Bayesian Data Analysis (2004) consider a data set relating to speed of light measurements made by Simon Newcomb. The data sets for that book can be found via the Classic Data Sets page, and the book's website contains more information on the data. Although the bulk of the data look to be more or less normally distributed, there are two obvious outliers. These outliers have a large effect on the mean, dragging it towards them, and away from the center of the bulk of the data. Thus, if the mean is intended as a measure of the location of the center of the data, it is, in a sense, biased when outliers are present. Also, the distribution of the mean is known to be asymptotically normal due to the central limit theorem. However, outliers can make the distribution of the mean non-normal even for fairly large data sets. Besides this non-normality, the mean is also inefficient in the presence of outliers and less variable measures of location are available. Estimation of location The plot below shows a density plot of the speed of light data, together with a rug plot (panel (a)). Also shown is a normal QQ-plot (panel (b)). The outliers are clearly visible in these plots. Panels (c) and (d) of the plot show the bootstrap distribution of the mean (c) and the 10% trimmed mean (d). The trimmed mean is a simple robust estimator of location that deletes a certain percentage of observations (10% here) ''from each end'' of the data, then computes the mean in the usual way. The analysis was performed in R and 10 000 Bootstrap samples were used for each of the raw and trimmed means. The distribution of the mean is clearly much wider than that of the 10% trimmed mean (the plots are on the same scale). Also note that whereas the distribution of the trimmed mean appears to be close to normal, the distribution of the raw mean is quite skewed to the left. So, in this sample of 66 observations, only 2 outliers cause the central limit theorem to be inapplicable. Robust statistical methods, of which the trimmed mean is a simple example, seek to outperform classical statistical methods in the presence of outliers, or, more generally, when underlying parametric assumptions are not quite correct. Whilst the trimmed mean performs well relative to the mean in this example, better robust estimates are available. In fact, the mean, median and trimmed mean are all special cases of ''M-estimators''. Details appear in the sections below. Estimation of scale The outliers in the speed of light data do not just have an adverse effect on the mean. The usual estimate of scale is the standard deviation, and this quantity is even more badly affected by outliers because the squares of the deviations from the mean go into the calcuation, so the outliers' effects are exacerbated. The plots below show the bootstrap distributions of the standard deviation, Median Absolute Deviation (MAD) and Qn estimator of scale (Rousseeuw and Croux, 1993). The plots are based on 10000 bootstrap samples for each estimator, and some normal random noise was added to the resampled data (smoothed bootstrap). Panel (a) shows the distribution of the standard deviation, (b) of the MAD and (c) of Qn. The distribution of standard deviation is erratic and wide, a result of the outliers. The MAD is better behaved, and Qn is a little bit more efficient than MAD. This simple example demonstrates that when outliers are present, the standard deviation cannot be recommended as an estimate of scale. Manual screening for outliers Traditionally, statisticians would manually screen data for Outliers , and remove them, usually checking the source of the data to see if the Outliers were erroneously recorded. Indeed, in the speed of light example above, it is easy to see and remove the two outliers prior to proceeding with any further analysis. However, in modern times, data sets often consist of large numbers of variables being measured on large numbers of experimental units. As such, manual screening for outliers is impractical. Outliers can often interact in such a way that they mask each other. As a simple example, consider a small univariate data set containing one modest and one large outlier. The estimated standard deviation will be grossly inflated by the large outlier. The result is that the modest outlier looks relatively normal. As soon as the large outlier is removed, the estimated standard deviation shrinks, and the modest outlier now looks unusual. This problem of masking gets worse as the complexity of the data increases. For example, in regression problems, diagnostic plots are used to identify outliers. However, it is common that once a few outliers have been removed, others become visible. The problem is even worse in higher dimensions. Robust methods provide automatic ways of detecting, downweighting (or removing), and flagging outliers, largely removing the need for manual screening. Variety of applications Although this article deals with general principles for univariate statistical methods, robust methods also exist for regression problems, generalized linear models, and parameter estimation of various distributions. MEASURES OF ROBUSTNESS The basic tools used to describe and measure robustness are, the ''breakdown point', the 'influence function'' and the ''sensitivity curve''. Breakdown point Intuitively, the breakdown point of an Estimator is the proportion of incorrect observations (i.e. arbitrarily large observations) an estimator can handle before giving an arbitrarily large result. For example, given independent random variables and the corresponding realizations , we can use to estimate the mean. Such an estimator has a breakdown point of 0 because we can make arbitrarily large just by changing any of . The higher the breakdown point of an estimator, the more robust it is. Intuitively, we can understand that a breakdown point cannot exceed 50% because if more than half of the observations are contaminated, it is not possible to distinguish between the underlying distribution and the contaminating distribution. Therefore, the maximum breakdown point is 0.5 and there are estimators which achieve such a breakdown point. For example, the median has a breakdown point of 0.5. The X% trimmed mean has breakdown point of X%, for the chosen level of X. Huber (1981) and Maronna et al (2006) contain more details. Example: speed of light data In the speed of light example, removing the two lowest observations causes the mean to change from 26.2 to 27.75, a change of 1.55. The estimate of scale produced by the Qn method is 6.3. Intuitively, we can divide this by the square root of the sample size to get a robust standard error, and we find this quantity to be 0.78. Thus, the change in the mean resulting from removing two outliers is approximately twice the robust standard error. The 10% trimmed mean for the speed of light data is 27.43. Removing the two lowest observations and recomputing gives 27.67. Clearly, the trimmed mean is less affected by the outliers and has a higher breakdown point. Notice that if we replace the lowest observation, -44, by -1000, the mean becomes 11.73, whereas the 10% trimmed mean is still 27.43. In many areas of applied statistics, it is common for data to be log-transformed to make them near symmetrical. Very small values become large negative when log-transformed, and zeroes become negatively infinite. Therefore, this example is of practical interest. Empirical influence function The empirical influence function gives us an idea of how an estimator behaves when we change one point in the sample and relies on the data (i.e. no model assumptions). The following picture is Tukey's biweight function, which, as we will later see, is an example of what a "good" (in a sense defined later on) empirical influence function should look like: The context is the following: # is a probability space, # is a measure space (state space),
# is a measure space, # is a projection, # is the set of all possible distributions on For example, # is any probability space, #, # #, # is defined by . The definition of an empirical influence function is:
What this actually means is that we are replacing the i-th value in the sample by an arbitrary value and looking at the output of the estimator. Influence function and sensitivity curve Instead of relying solely on the data, we could use the distribution of the random variables. The approach is quite different from that of the previous paragraph. What we are now trying to do is to see what happens to an estimator when we change the distribution of the data slightly. Let be a convex subset of the set of all finite signed measures on . We want to estimate the parameter of a distribution in . Let the functional be the asymptotic value of some estimator sequence . We will suppose that this functional is Fisher consistent, i.e. . this means that at the model , the estimator sequence asymptotically measures the right quantity. Let be some distribution in . What happens when the data doesn't follow the model exactly but another, slightly different, "going towards" ? We're looking at: , which is the directional derivative of at , in the direction of . Let . is the probability measure which gives mass 1 to . We chose . The influence function is then defined by: It describes the effect of an infinitesimal contamination at the point on the estimate we are seeking, standardized by the mass of the contamination (the asymptotic bias caused by contamination in the observations). DESIRABLE PROPERTIES Properties of an influence function which bestow it with desirable performance are:
Rejection point |
|
|