Least-squares estimator and Gauss-Markov

Perhaps you’ve heard of Least-squares estimation (e.g. from this post in the Algorithms course).

Well, it may have seemed a bit unnatural to you. I mean, it feels natural. But it’s hard to actually justify why we care about least-squares rather than least cubes, least line segments etc. Squares often come up as a natural way of measuring distance, but why?

First of all, the least-squares stuff can be thought of in a linear algebraic light: think of some parameter space for the vector \(\beta\), which is mapped by the design matrix (the independent variables) \(X\) to some subspace \(W\) of the data space, and the value of \(Y\) for any particular \(\beta\) is distributed around its image in \(W\). The least-squares estimate can then be thought of in terms of projecting an observed value \(y\) onto \(W\) as \(\hat{y}\) and finding its preimage under \(X\), \(\hat{\beta}\).

Then \(\hat{\beta}\) itself is a cloud/has a distribution, isn’t it? We can think about projecting the cloud of \(Y\) onto \(W\), then taking its pre-image under \(X\). The first thing that is obvious is that due to the spherical symmetry of the distribution of \(Y\) (because we’re assuming the covariance matrix is \(\sigma^2 I\) – do you see why this makes sense in the contexts we’re probably interested in?), the \(\hat{\beta}\) distribution has an ellipsoidal symmetry (think about what this means precisely) and is centered at \(\beta\), which trivially means \(E(\hat{\beta})=\beta\), i.e. the estimator is unbiased.

More interestingly though, one can consider the variances of \(\hat{\beta}\), and observe that it is the “best” among all linear unbiased estimators. One may see that any other linear unbiased estimator \(\tilde{\beta}\) represents some oblique projection of the cloud onto \(W\) – in particular, this projection is an ellipse which necessarily “contains” the Hermitian projection (circle) corresponding to the least-squares estimator (i.e. in every direction has higher variance). This can be transformed back into the \(\beta\) space, and the circle is still completely contained in the ellipse. Of course this means that \(\mathrm{Var}(\tilde{\beta})-\mathrm{Var}(\hat{\beta})\) is always positive-definite, a result known as the Gauss-Markov theorem.

The more general lesson is that 2-norms are fundamentally related to linear algebra – and therefore to the mean/expectation operator, because it is a linear operator. Indeed, the absolute deviation norm is similarly related to the median.

Adding to the nice things about 2-norms, 2-norms (and therefore linear algebra) are naturally related to the normal distribution. More specifically, if you assume \(Y\) to be normally distributed around \(\beta x\), then it’s easy to see that the maximum likelihood estimate is precisely the least-squares estimator. Indeed, the absolute norm would be appropriate if \(Y\) had a double-exponential distribution about \(\beta x\) (can you see the connection?).

In fact, different “methods” of linear regression correspond to different, often non-linear projections onto \(W\). This includes Bayesian methods, or “trying to be Bayesian” methods of correcting overfitting like Lasso and Ridge regression.

Date: 2020-05-06 Wed 00:00

Author: Abhimanyu Pallavi Sudhir

Created: 2026-01-29 Thu 13:27