Model diagnostics

Specifying a model, one can then do inference to determine the model parameters from the data. However, in the general paradigm of statistical inference, we don’t know at all that the specified model is valid (in the full paradigm of Solmonoff Induction, the model is also inferred statistically). With simple data of low dimensionality, we often “eyeball” the data to set the model. Heuristics that help us decide on/evaluate a model are called model diagnostics.

We earlier discussed ANOVA, which is a diagnostic to evaluate sub-models of a normal linear model. A related approach to evaluate the suitability of a normal linear model (without reference to a super-model) is the Coefficient of Determination \(R^2\), which is defined as the fraction of variance in the response variable explained by the model.

Well, even an uncorrelated explanatory variable will spuriously explain some of the variance in \(Y\) because the model will fit to whatever insignificant sample correlation is observed – in particular if we have a full set of \(p\) explanatory variables (where \(p\) is the number of data points), \(Y\) is fitted exactly and \(R^2\) is 1. This is not a result of the accuracy of the model – the model isn’t predicting anything, it simply lacks the freedom to equal anything but the observed data. So analogous to our degree-of-freedom argument for scaling by \(1/(n-p)\) for the sample variance, one may want to divide the numerator and denominator of \(R^2\) by the degrees of freedom \(n-p\) and \(n-1\) to obtain unbiased estimates of the errors of each model.

+leverages etc.