Regression dilution
Regression dilution, also known as regression attenuation, is the biasing of the regression slope towards zero, caused by errors in the independent variable.
Consider fitting a straight line for the relationship of an outcome variable y to a predictor variable x, and estimating the slope of the line. Statistical variability, measurement error or random noise in the y variable causes uncertainty in the estimated slope, but not bias: on average, the procedure calculates the right slope. However, variability, measurement error or random noise in the x variable causes bias in the estimated slope. The greater the variance in the x measurement, the closer the estimated slope must approach zero instead of the true value.
It may seem counter-intuitive that noise in the predictor variable x induces a bias, but noise in the outcome variable y does not. Recall that linear regression is not symmetric: the line of best fit for predicting y from x is not the same as the line of best fit for predicting x from y.
How to correct for regression dilution
The case of a randomly distributed ''x'' variable
The case that the x variable arises randomly is known as the structural model or structural relationship. For example, in a medical study patients are recruited as a sample from a population, and their characteristics such as blood pressure may be viewed as arising from a random sample.Under certain assumptions there is a known ratio between the true slope, and the expected estimated slope. Frost and Thompson review several methods for estimating this ratio and hence correcting the estimated slope. The term regression dilution ratio, although not defined in quite the same way by all authors, is used for this general approach, in which the usual linear regression is fitted, and then a correction applied. The reply to Frost & Thompson by Longford refers the reader to other methods, expanding the regression model to acknowledge the variability in the x variable, so that no bias arises. Fuller is one of the standard references for assessing and correcting for regression dilution.
Hughes shows that the regression dilution ratio methods apply approximately in survival models. Rosner shows that the ratio methods apply approximately to logistic regression models. Carroll et al. give more detail on regression dilution in nonlinear models, presenting the regression dilution ratio methods as the simplest case of regression calibration methods, in which additional covariates may also be incorporated.
In general, methods for the structural model require some estimate of the variability of the x variable. This will require repeated measurements of the x variable in the same individuals, either in a sub-study of the main data set, or in a separate data set. Without this information it will not be possible to make a correction.
The case of a fixed ''x'' variable
The case that x is fixed, but measured with noise, is known as the functional model or functional relationship. See, for example, Riggs et al..Multiple ''x'' variables
The case of multiple predictor variables subject to variability has been well-studied for linear regression, and for some non-linear regression models. Other non-linear models, such as proportional hazards models for survival analysis, have been considered only with a single predictor subject to variability.Is correction necessary?
In statistical inference based on regression coefficients, yes; in predictive modelling applications, correction is neither necessary nor appropriate. To understand this, consider the measurement error as follows. Let y be the outcome variable, x be the true predictor variable, and w be an approximate observation of x. Frost and Thompson suggest, for example, that x may be the true, long-term blood pressure of a patient, and w may be the blood pressure observed on one particular clinic visit. Regression dilution arises if we are interested in the relationship between y and x, but estimate the relationship between y and w. Because w is measured with variability, the slope of a regression line of y on w is less than the regression line of y on x.Does this matter? In predictive modelling, no. Standard methods can fit a regression of y on w without bias. There is bias only if we then use the regression of y on w as an approximation to the regression of y on x. In the example, assuming that blood pressure measurements are similarly variable in future patients, our regression line of y on w gives unbiased predictions.
An example of a circumstance in which correction is desired is prediction of change. Suppose the change in x is known under some new circumstance: to estimate the likely change in an outcome variable y, the slope of the regression of y on x is needed, not y on w. This arises in epidemiology. To continue the example in which x denotes blood pressure, perhaps a large clinical trial has provided an estimate of the change in blood pressure under a new treatment; then the possible effect on y, under the new treatment, should be estimated from the slope in the regression of y on x.
Another circumstance is predictive modelling in which future observations are also variable, but not "similarly variable". For example, if the current data set includes blood pressure measured with greater precision than is common in clinical practice. One specific example of this arose when developing a regression equation based on a clinical trial, in which blood pressure was the average of six measurements, for use in clinical practice, where blood pressure is usually a single measurement.
Caveats
All of these results can be shown mathematically, in the case of simple linear regression assuming normal distributions throughout.It has been discussed that a poorly executed correction for regression dilution, in particular when performed without checking for the underlying assumptions, may do more damage to an estimate than no correction.