The Statement of Fundamental Theorem of MLEs¶
Loosely, the Fundamental Theorem of Maximum Likelihood Estimators states:
Theorem
Maximum likelihood estimators are asymptotically normal.
Making precise sense of this requires considerable work.
Recollections¶
Definition
The relative entropy:
\[\begin{split}\begin{align*} \mathcal{D}(\rho_A || \rho_B) &= \langle I_{\rho_B} - I_{\rho_A} \rangle_{\rho_A} \\ &= \langle I_{\rho_B} \rangle_{\rho_A} - \mathcal{S}(\rho_A) \end{align*}\end{split}\]
where \(I_\rho\) is the information associated to the distibution \(\rho_A\)
We begin with a brief review.
Note
Recall that, given a parametric family:
\[\begin{split}\begin{align*} \Theta &\overset{\theta}\longrightarrow \mathrm{Prob}(\Omega) \\ \theta &\longmapsto \rho_\theta \end{align*}\end{split}\]
maximum likelihood estimation provides a map:
\[\begin{split}\begin{align*} \mathfrak{D}(\Omega) &\overset{\mathrm{MLE}_\Theta}\longrightarrow \mathrm{Prob}(\Omega) \\ \rho_X &\longmapsto \hat{\rho}_\theta(X) = \mathrm{MLE_\Theta}(\rho_x) := \ \mathrm{argmin}_\theta \mathfrak{D}(\rho_X || \rho_\theta \end{align*}\end{split}\]When the data is drawn from a probability distribution, \(\rho \in \mathrm{Prob}(\Omega)\), the MLE map gives a probability distribution on the space of probability distributions:
\[\mathrm{MLE}_*(\rho_\theta) := \hat{\rho_\theta} \in \mathrm{Prob}(\Omega)\]
Warning
Although \(\hat{\rho}\) is “random” (in the sense that it is a a probability distribution) it is not a “random variable”. This subtle, technical point is meant to emphasize the intrinsic nature of MLEs.
However, a choice of coordinates allows us to consider \(\hat{\rho}\) a random variable.
Note
Stein’s Lemma interprets the function MLEs are trying to minimize interpretation in terms of hypothesis testing.
Geometric Preliminaries¶
Given a smooth function on a manifold e.g. (\(\mathbb{R}^n\)):
\[M \overset{f}\longrightarrow \mathbb{R}\]
along with a “dummy” metric, \(g\), we can construct a symmetric quadratic form, the Hessian of \(f\):
\[\mathrm{Hess}_g(f) \in \mathrm{Sym}^2(\mathrm{T}^*M)\]
which can be computed as second derivatives in coordinates.
In general, this form depends on the dummy metric. However, if:
\[\mathrm{d}f|_p = 0\]
then
\[\mathrm{Hess}_g(f)|_p \in \mathrm{Sym}^2(\mathrm{T}^*_p M)\]
is independent of the dummy metric independent of the dummy metric.
Moreover, if \(f\) is convex, \(\mathrm{Hess}_g(f)|_p\) is a positive definite symmetric quadratic form on \(\mathrm{T}_p M\).
Given coordinates \(\varphi\), this can be computed as:
\[(\partial_i \partial_j \varphi^*f)(p) \in \mathbb{R}\]
Moreover, one can explicitly compute the covariance matrix using the following construction:
Back to Statistics¶
In the setting of MLE, the dictionary is:
\[\begin{split}\begin{align*} \Theta &:= M \\ f(\rho) &:= \mathcal{D}(\rho_\theta || \rho) \in \mathrm{C}^\infty(\Theta) \end{align*}\end{split}\]
As we vary \(\theta\), we obtain an element of:
\[f_\theta \in \Theta \times \mathrm{C}^\infty(\Theta)\]
In other words, the function which MLEs are trying to minimize gives a positive definite quadratic form on the tangent space of \(\hat{\rho}_\theta\).
Definition
The Fisher information metric is defined as:
\[\mathbb{I}_\Theta = \left. \mathrm{Hess} \bigl( \mathcal{D}(\rho_\theta || \rho) \bigl) \right\vert_{\rho = \rho_\theta} \in \mathrm{Sym}^2(\mathrm{T}^*\Theta)\]
Note
As this metric is positive definite and symmetric, it defines a Riemannian metric on \(\Theta\), conventionally referred to as the Fisher-Rao metric
Theorem
When \(g \in \mathrm{Im}(\theta)\), in the limit of \(n \rightarrow \infty\):
\[\hat{\rho}_\theta \sim \mathcal{N} \bigl( g, (n \cdot \mathbb{I}|_{\hat{\rho}_\theta)})^{-1} \bigl)\]
where:
Lemma
When \(g \in \mathrm{Im}(\theta)\) (i.e. the model is correctly “specified”, MLE’s are consistent.
Note
As will be seen in the next sections, the relative entropy is a generalization of the effect size between two normal distibutions with identical standard deviations.
Theorem
Maximum likelihood estimation is positive definite and convex.
Corrollary
The central limit theorem?