Mallows’s Cp addresses the issue of overfitting, in which model selection statistics such as the residual sum of squares always get smaller as more variables are added to a model. Thus, if we aim to select the model giving the smallest residual sum of squares, the model including all variables would always be selected. Instead, the Cp statistic calculated on a sample of data estimates the mean squared prediction error as its population target where is the fitted value from the regression model for the jth case, E is the expected value for the jth case, and σ2 is the error variance. The MSPE will not automatically get smaller as more variables are added. The optimum model under this criterion is a compromise influenced by the sample size, the effect sizes of the different predictors, and the degree of collinearity between them. If P regressors are selected from a set of K > P, the Cp statistic for that particular set of regressors is defined as: where
RSS is the residual sum of squares on a training set of data
is the number of predictors
and refers to an estimate of the variance associated with each response in the linear model
Note that this version of the Cp does not give equivalent values to the earlier version, but the model with the smallest Cp from this definition will also be the same model with the smallest Cp from the earlier definition.
Limitations
The Cp criterion suffers from two main limitations
the Cp approximation is only valid for large sample size;
the Cp cannot handle complex collections of models as in the variable selection problem.
Practical use
The Cp statistic is often used as a stopping rule for various forms of stepwise regression. Mallows proposed the statistic as a criterion for selecting among many alternative subset regressions. Under a model not suffering from appreciable lack of fit, Cp has expectation nearly equal to P; otherwise the expectation is roughly P plus a positive bias term. Nevertheless, even though it has expectation greater than or equal toP, there is nothing to prevent Cp < P or even Cp < 0 in extreme cases. It is suggested that one should choose a subset that has Cp approaching P, from above, for a list of subsets ordered by increasing P. In practice, the positive bias can be adjusted for by selecting a model from the ordered list of subsets, such that Cp < 2P. Since the sample-based Cp statistic is an estimate of the MSPE, using Cp for model selection does not completely guard against overfitting. For instance, it is possible that the selected model will be one in which the sample Cp was a particularly severe underestimate of the MSPE. Model selection statistics such as Cp are generally not used blindly, but rather information about the field of application, the intended use of the model, and any known biases in the data are taken into account in the process of model selection.