Hypotheses, Models and Loss Functions

These notes seek to establish a precise notion of losses and functions in statistical learning using a probabilistic framework

Models and Hypotheses

Definition

We call

\[\mathcal{H}_\mathrm{gen} := \mathrm{Prob}(\mathcal{X}\times\mathcal{Y})\]

the space of generative hypotheses

A generative model is a family of generative hypotheses.

Definition

A generative model is a space \(\mathcal{M}\), along with a map:

\[\begin{split}\mathcal{M} &\overset{\Theta}\longrightarrow \mathcal{H}_\mathrm{gen} \\ f &\longmapsto \rho^f \in \mathrm{Prob}(\mathcal{X}\times\mathcal{Y})\end{split}\]

Discriminative models are similiar to generative models, but are impartial in regard to the distribution of features:

Definition

We call

\[\mathcal{H}_\mathrm{discr} := \mathrm{Maps}\bigl(\mathcal{X}, \mathrm{Prob}(\mathcal{Y})\bigl)\]

the space of discriminative hypotheses

Definition

A discriminative model is a space \(\mathcal{M}\), along with a map:

\[\begin{split}\mathcal{M} &\overset{\Theta}\longrightarrow \mathcal{H}_\mathrm{discr} \\ f &\mapsto \Bigl[x \longmapsto \rho^f_x \Bigl] \in \mathrm{Maps}\bigl(\mathcal{X},\mathrm{Prob}(\mathcal{Y})\bigl)\end{split}\]

Note

In general, \(\Theta\) is not an embedding.

Example

When \(\mathcal{Y} \simeq \mathbb{R}\), the learning problem is called regression.

Linear regression may be described as:

\[\begin{split}\mathcal{M} &\simeq (\mathbb{R}^n)^\vee \longrightarrow \mathcal{H}_\mathrm{discr} \\ f &\simeq \beta \longmapsto \bigl( x \mapsto \delta_{\beta(x)} \bigl) \in \mathrm{Maps}(\mathbb{R}^n, \mathbb{R})\end{split}\]

Bayes’ Law factors generative models into the data of a discriminative model and a probability distribution on the space of features:

Theorem

There is a natural equivalence:

\[\begin{split}\mathcal{H}_\mathrm{discr}\times\mathrm{Prob}(\mathcal{X}) &\overset{\simeq}\longrightarrow \mathcal{H}_\mathrm{gen} \\ (f, \tilde{\rho}) &\longmapsto \bigl( (x, y) \mapsto \tilde{\rho}(x) \rho_x^f(y) \bigl)\end{split}\]

Learning problems come in two flavors: supervised and unsupervised. We take the somewhat unusual approach, defining unsupervised learning as a special case of supervised learning:

Definition

A learning problem is unsupervised if:

\[\mathcal{Y} \simeq *\]

Note

Note this means supervised learning problems inherit any construction or property of supervised learning problems.

However, it may be the case that they become uninteresting in the unsupervised case. For example, there are no interesting discriminative models for unsupervised learning problems.

Warning

To deemphasize the difference between supervised and unsupervised, we’ll adopt the notation:

\[\Omega := \mathcal{X}\times\mathcal{Y}\]

Loss Functions

A loss function pairs data and models, producing a number. This both assesses models given data and allows data to act on the space of models via gradient descents.

Definition

A loss function is a map:

\[\begin{split}\mathfrak{D} \times \mathcal{M} &\overset{\mathscr{L}}\longrightarrow \mathbb{R} \\ (\rho, f) &\longmapsto \mathscr{L}_{\rho} (f)\end{split}\]

where we are viewing data, \(\rho\), as a finitely support probability distribution on \(\Omega\)

Example

A standard example comes from relative entropy:

\[\begin{split}\mathrm{Prob}(\Omega) \times \mathrm{Prob}(\Omega) &\longrightarrow \mathbb{R} \\ (\rho_A, \rho_B) &\longmapsto \mathcal{S}(\rho_A || \rho_B)\end{split}\]

and is:

\[\begin{split}\mathcal{D}(\Omega) \times \mathcal{M} \overset{\Theta}\longrightarrow &\mathrm{Prob}(\Omega) \times \mathrm{Prob}(\Omega)\overset{S( - || - )}\longrightarrow \mathbb{R} \\ (\rho^\mathrm{emp}, f) &\longmapsto \mathcal{S}( \rho^\mathrm{emp} || \rho_x)\end{split}\]

Note

Although relative entropy does not coicide with cross entropy, they differ by a term independent of the model, so that optimizing relative entropy coincides with optimizing cross entropy.

Some loss functions do not depend on the hypothesis of the underlying model. In other words, many may no reference to \(\Theta\).

Definition

In many instances, there are additional regularization terms and hyperparameters:

\[\begin{split}\mathcal{M} \times \Lambda &\overset{g}\longrightarrow \mathbb{R} \\ f, \lambda &\longmapsto g_\lambda(f)\end{split}\]

which define a regularized loss function

\[\mathscr{L}_{\rho, \lambda}(f) : = \mathscr{L}_{\rho}(f) + g_\lambda(f)\]

Example:

A standard example comes from some linear coordinates:

\[\begin{split}\mathcal{M} &\simeq \mathbb{R}^n \\ f &\mapsto \beta\end{split}\]

and \(g_\lambda = \lambda \lvert\lvert - \lvert\lvert^2\) comes from a norm with a scaling factor \(\lambda\):

\[\mathscr{L}_{\rho, \lambda}(f) : = \mathscr{L}_{\rho}(f) + \lambda \lvert \lvert \beta \lvert\lvert^2\]

Note

Regularization alters not only the loss function, but also augments the model. As a rule of thumb, this augmentation should be slight, as one normally optimzes hyperparameters though some randomized search.