ANOVA and model testing
Statistical modeling is closely linked to decision theory – exogenous variables are those that we have “direct control” over, so the effects of the decision can be seen suppressing the distribution of and the internal correlations between these controllable variables. These “projected” correlations are causations.
This problem is related to the problem of explaining variance. Sure, statistical modeling is more general – the variability of a random variable has more to do with just its variance, but in the special case of a normal linear model, the distribution is summarized by the mean and the variance, and so explaining the variance in \(Y\) through exogenous variables is equivalent to determining a statistical model.
The basic motivating fact behind ANOVA is the law of total variance: variance in a dependent variable can be broken down, in a Pythagorean fashion, into variances in the exogenous variables. This is a result of the independence of \(Y\mid X\) from the residuals.
\[\mathrm{Var}(Y)=\mathrm{Var}\left(\mathrm{E}\left(Y\mid X\right)\right)+\mathrm{E}\left(\mathrm{Var}\left(Y\mid X\right)\right)\]
This simplifies rather nicely in the case of a normal linear model, where errors are assumed to be independent of exogenous variables.
ANOVA for model testing
A very simple application of ANOVA is in assessing the “importance” of a particular exogenous variable to \(Y\), by looking at the fractions of variance explained by each exogenous variable. More generally, ANOVA can be used to test the validity of any sub-model – if a particular factor doesn’t explain much of the variance in a variable \(Y\), it can probably be discarded while still retaining a suitable model.
Any linear model can be represented as \(E(Y)=\mathrm{span}(X)\), representing the hypothesis that the mean of \(Y\) is a plane/that the mean of \(Y\) is a linear function of \(X\). A submodel \(E(Y)=\mathrm{span}(X_0)\) (where \(X_0\) is a submatrix of columns in \(X\)) represents the further hypothesis that \(Y\) does not correlate with any of the variables in \(X\) except those in \(X_0\).
The fraction of variance in \(Y\) unexplained by the sub-model is then a test statistic for the sub-model (the larger this fraction is, the less likely the sub-model is to be true), and its distribution can be calculated under the sub-model as a null hypothesis.
\[F=\frac{\mathrm{RSS}_0-\mathrm{RSS}}{\mathrm{RSS}}\cdot \frac{p-r}{r-r_0} \sim F_{r-r_0,p-r}\]