The Statement of Fundamental Theorem of MLEs
--------------------------------------------

Loosely, the Fundamental Theorem of Maximum Likelihood Estimators states:

.. admonition:: Theorem
   
   Maximum likelihood estimators are asymptotically normal.

Making precise sense of this requires considerable work.

Recollections
=============

.. admonition:: Definition
    
    The relative entropy:
    
       .. math::
           \begin{align*}
           \mathcal{D}(\rho_A || \rho_B) &= \langle I_{\rho_B} - I_{\rho_A} \rangle_{\rho_A} \\
           &= \langle I_{\rho_B} \rangle_{\rho_A} - \mathcal{S}(\rho_A)
           \end{align*}
       

    where :math:`I_\rho` is the information associated to the distibution 
    :math:`\rho_A`

We begin with a brief review.

.. note::
   
   Recall that, given a parametric family:
   
      .. math::
          \begin{align*}
          \Theta &\overset{\theta}\longrightarrow \mathrm{Prob}(\Omega) \\
          \theta &\longmapsto \rho_\theta
          \end{align*}
          

   maximum likelihood estimation provides a map:
   
      .. math::
         \begin{align*}
         \mathfrak{D}(\Omega) &\overset{\mathrm{MLE}_\Theta}\longrightarrow 
         \mathrm{Prob}(\Omega) \\
         \rho_X &\longmapsto \hat{\rho}_\theta(X) 
         = \mathrm{MLE_\Theta}(\rho_x) := \
         \mathrm{argmin}_\theta \mathfrak{D}(\rho_X || \rho_\theta
         \end{align*}

    When the data is drawn from a probability distribution, 
    :math:`\rho \in \mathrm{Prob}(\Omega)`, the MLE map gives a probability
    distribution on the space of probability distributions:
    
       .. math::
          \mathrm{MLE}_*(\rho_\theta) := 
          \hat{\rho_\theta} \in \mathrm{Prob}(\Omega)

.. warning::
   
   Although :math:`\hat{\rho}` is "random" (in the sense that it is a
   a probability distribution) it is not a "random variable". This subtle, 
   technical point is meant to emphasize the intrinsic nature of MLEs.
   
   However, a choice of coordinates allows us to consider :math:`\hat{\rho}` 
   a random variable.

.. note::
   
   Stein's Lemma interprets the function MLEs are trying to minimize
   interpretation in terms of hypothesis testing.

Geometric Preliminaries
=======================

Given a smooth function on a manifold e.g. (:math:`\mathbb{R}^n`):

   .. math::
      M \overset{f}\longrightarrow \mathbb{R}

along with a "dummy" metric, :math:`g`, we can construct a symmetric quadratic 
form, the Hessian of :math:`f`:

   .. math::
      \mathrm{Hess}_g(f) \in \mathrm{Sym}^2(\mathrm{T}^*M)

which can be computed as second derivatives in coordinates.

In general, this form depends on the dummy metric. However, if:

   .. math::
      \mathrm{d}f|_p = 0 

then 

   .. math::
      \mathrm{Hess}_g(f)|_p \in \mathrm{Sym}^2(\mathrm{T}^*_p M) 

is independent of the dummy metric independent of the dummy metric.

Moreover, if :math:`f` is convex, :math:`\mathrm{Hess}_g(f)|_p` is a
positive definite symmetric quadratic form on :math:`\mathrm{T}_p M`.

Given coordinates :math:`\varphi`, this can be computed as:

   .. math::
      (\partial_i \partial_j \varphi^*f)(p) \in \mathbb{R}


Moreover, one can explicitly compute the covariance matrix using the following
construction:

Back to Statistics
==================

In the setting of MLE, the dictionary is:

    .. math::
        \begin{align*}
         \Theta &:= M \\
         f(\rho) &:= \mathcal{D}(\rho_\theta || \rho) \in \mathrm{C}^\infty(\Theta)
        \end{align*}
   
As we vary :math:`\theta`, we obtain an element of:

   .. math::
      f_\theta \in \Theta \times \mathrm{C}^\infty(\Theta)

In other words, the function which MLEs are trying to minimize gives
a positive definite quadratic form on the tangent space of 
:math:`\hat{\rho}_\theta`.

.. admonition:: Definition
    
    The Fisher information metric is defined as:

       .. math::
          \mathbb{I}_\Theta = \left.
          \mathrm{Hess} \bigl( \mathcal{D}(\rho_\theta || \rho) \bigl) 
          \right\vert_{\rho = \rho_\theta} 
          \in \mathrm{Sym}^2(\mathrm{T}^*\Theta)

.. note::

    As this metric is positive definite and symmetric, it defines a Riemannian
    metric on :math:`\Theta`, conventionally referred to as the Fisher-Rao
    metric
    
.. admonition:: Theorem
  
    When :math:`g \in \mathrm{Im}(\theta)`, in the limit of
    :math:`n \rightarrow \infty`:
    
       .. math::
          \hat{\rho}_\theta \sim
          \mathcal{N} \bigl(
          g, (n \cdot \mathbb{I}|_{\hat{\rho}_\theta)})^{-1}
          \bigl)

    where:

.. admonition:: Lemma
   
    When :math:`g \in \mathrm{Im}(\theta)` (i.e. the model is correctly
    "specified", MLE's are consistent.

.. note::
   
   As will be seen in the next sections, the relative entropy is a
   generalization of the effect size between two normal distibutions with
   identical standard deviations.

.. admonition:: Theorem
   
   Maximum likelihood estimation is positive definite and convex.

.. admonition:: Corrollary
  
   The central limit theorem?