The Statement of Fundamental Theorem of MLEs¶

Loosely, the Fundamental Theorem of Maximum Likelihood Estimators states:

Theorem

Maximum likelihood estimators are asymptotically normal.

Making precise sense of this requires considerable work.

Recollections¶

Definition

The relative entropy:

\[\begin{split}\begin{align*} \mathcal{D}(\rho_A || \rho_B) &= \langle I_{\rho_B} - I_{\rho_A} \rangle_{\rho_A} \\ &= \langle I_{\rho_B} \rangle_{\rho_A} - \mathcal{S}(\rho_A) \end{align*}\end{split}\]

where \(I_\rho\) is the information associated to the distibution \(\rho_A\)

We begin with a brief review.

Note

Recall that, given a parametric family:

\[\begin{split}\begin{align*} \Theta &\overset{\theta}\longrightarrow \mathrm{Prob}(\Omega) \\ \theta &\longmapsto \rho_\theta \end{align*}\end{split}\]

maximum likelihood estimation provides a map:

\[\begin{split}\begin{align*} \mathfrak{D}(\Omega) &\overset{\mathrm{MLE}_\Theta}\longrightarrow \mathrm{Prob}(\Omega) \\ \rho_X &\longmapsto \hat{\rho}_\theta(X) = \mathrm{MLE_\Theta}(\rho_x) := \ \mathrm{argmin}_\theta \mathfrak{D}(\rho_X || \rho_\theta \end{align*}\end{split}\]

When the data is drawn from a probability distribution, \(\rho \in \mathrm{Prob}(\Omega)\), the MLE map gives a probability distribution on the space of probability distributions:

\[\mathrm{MLE}_*(\rho_\theta) := \hat{\rho_\theta} \in \mathrm{Prob}(\Omega)\]

Warning

Although \(\hat{\rho}\) is “random” (in the sense that it is a a probability distribution) it is not a “random variable”. This subtle, technical point is meant to emphasize the intrinsic nature of MLEs.

However, a choice of coordinates allows us to consider \(\hat{\rho}\) a random variable.

Note

Stein’s Lemma interprets the function MLEs are trying to minimize interpretation in terms of hypothesis testing.

Geometric Preliminaries¶

Given a smooth function on a manifold e.g. (\(\mathbb{R}^n\)):

\[M \overset{f}\longrightarrow \mathbb{R}\]

along with a “dummy” metric, \(g\), we can construct a symmetric quadratic form, the Hessian of \(f\):

\[\mathrm{Hess}_g(f) \in \mathrm{Sym}^2(\mathrm{T}^*M)\]

which can be computed as second derivatives in coordinates.

In general, this form depends on the dummy metric. However, if:

\[\mathrm{d}f|_p = 0\]

then

\[\mathrm{Hess}_g(f)|_p \in \mathrm{Sym}^2(\mathrm{T}^*_p M)\]

is independent of the dummy metric independent of the dummy metric.

Moreover, if \(f\) is convex, \(\mathrm{Hess}_g(f)|_p\) is a positive definite symmetric quadratic form on \(\mathrm{T}_p M\).

Given coordinates \(\varphi\), this can be computed as:

\[(\partial_i \partial_j \varphi^*f)(p) \in \mathbb{R}\]

Moreover, one can explicitly compute the covariance matrix using the following construction:

Back to Statistics¶

In the setting of MLE, the dictionary is:

\[\begin{split}\begin{align*} \Theta &:= M \\ f(\rho) &:= \mathcal{D}(\rho_\theta || \rho) \in \mathrm{C}^\infty(\Theta) \end{align*}\end{split}\]

As we vary \(\theta\), we obtain an element of:

\[f_\theta \in \Theta \times \mathrm{C}^\infty(\Theta)\]

In other words, the function which MLEs are trying to minimize gives a positive definite quadratic form on the tangent space of \(\hat{\rho}_\theta\).

Definition

The Fisher information metric is defined as:

\[\mathbb{I}_\Theta = \left. \mathrm{Hess} \bigl( \mathcal{D}(\rho_\theta || \rho) \bigl) \right\vert_{\rho = \rho_\theta} \in \mathrm{Sym}^2(\mathrm{T}^*\Theta)\]

Note

As this metric is positive definite and symmetric, it defines a Riemannian metric on \(\Theta\), conventionally referred to as the Fisher-Rao metric

Theorem

When \(g \in \mathrm{Im}(\theta)\), in the limit of \(n \rightarrow \infty\):

\[\hat{\rho}_\theta \sim \mathcal{N} \bigl( g, (n \cdot \mathbb{I}|_{\hat{\rho}_\theta)})^{-1} \bigl)\]

where:

Lemma

When \(g \in \mathrm{Im}(\theta)\) (i.e. the model is correctly “specified”, MLE’s are consistent.

Note

As will be seen in the next sections, the relative entropy is a generalization of the effect size between two normal distibutions with identical standard deviations.

Theorem

Maximum likelihood estimation is positive definite and convex.

Corrollary

The central limit theorem?

The Statement of Fundamental Theorem of MLEs¶

Recollections¶

Geometric Preliminaries¶

Back to Statistics¶

Jeremy Mann's Writing

Navigation

Related Topics