Hypotheses, Models and Loss Functions¶
These notes seek to establish a precise notion of losses and functions in statistical learning using a probabilistic framework
Models and Hypotheses¶
Definition
We call
the space of generative hypotheses
A generative model is a family of generative hypotheses.
Definition
A generative model is a space \(\mathcal{M}\), along with a map:
Discriminative models are similiar to generative models, but are impartial in regard to the distribution of features:
Definition
We call
the space of discriminative hypotheses
Definition
A discriminative model is a space \(\mathcal{M}\), along with a map:
Note
In general, \(\Theta\) is not an embedding.
Example
When \(\mathcal{Y} \simeq \mathbb{R}\), the learning problem is called regression.
Linear regression may be described as:
Bayes’ Law factors generative models into the data of a discriminative model and a probability distribution on the space of features:
Theorem
There is a natural equivalence:
Learning problems come in two flavors: supervised and unsupervised. We take the somewhat unusual approach, defining unsupervised learning as a special case of supervised learning:
Definition
A learning problem is unsupervised if:
Note
Note this means supervised learning problems inherit any construction or property of supervised learning problems.
However, it may be the case that they become uninteresting in the unsupervised case. For example, there are no interesting discriminative models for unsupervised learning problems.
Warning
To deemphasize the difference between supervised and unsupervised, we’ll adopt the notation:
Loss Functions¶
A loss function pairs data and models, producing a number. This both assesses models given data and allows data to act on the space of models via gradient descents.
Definition
A loss function is a map:
where we are viewing data, \(\rho\), as a finitely support probability distribution on \(\Omega\)
Example
A standard example comes from relative entropy:
and is:
Note
Although relative entropy does not coicide with cross entropy, they differ by a term independent of the model, so that optimizing relative entropy coincides with optimizing cross entropy.
Some loss functions do not depend on the hypothesis of the underlying model. In other words, many may no reference to \(\Theta\).
Definition
In many instances, there are additional regularization terms and hyperparameters:
which define a regularized loss function
Example:
A standard example comes from some linear coordinates:
and \(g_\lambda = \lambda \lvert\lvert - \lvert\lvert^2\) comes from a norm with a scaling factor \(\lambda\):
Note
Regularization alters not only the loss function, but also augments the model. As a rule of thumb, this augmentation should be slight, as one normally optimzes hyperparameters though some randomized search.