From Samples to Probability Distributions¶
The goal of these notes are to show how to think of sample data as probability distributions satisfying a finite support condition.
The Empirical Distribution¶
Usually, sample data is thought of a bunch of tuples \(z_i\) for all \(i=0, \ldots, N\).
In other words:
Definition
Sample data is the data of a map:
for some finite indexing set \(I\). We also adopt the following notation:
Example
We review how the technology of tabular data fits into this perspective. For simplicity, we’ll consider the case of a table with two columns. Let’s say there is a column with a string type. As we’re speaking on a very general level, we’ll identify strings with some hashing:
\[\mathrm{String} \simeq \mathbb{N}\]
Furthermmore, let’s say the second column is a float, i.e. a element of \(\mathbb{R}\). In this case,
\[\Omega \simeq \mathbb{N}\times \mathbb{R}\]
Let’s further assume that our table is indexed by some set of “surrogate keys” \(I\).
- This this case, the formal definition of sample data:
- \[I \rightarrow \mathbb{N} \times \mathbb{R}\]
is the data of, for every \(i \in I\), a string and a float. This is, in particular, the data of a table with two columns. The general situation follows, mutatis mutandi.
Digression
It’s common to factor \(\Omega\) as:
and call \(\mathcal{X}\) the space of features, and \(\mathcal{Y}\) the space of labels.
The choice of a factorization establishes a supervised learning problem.
We view the “supervision” of a learning to be a structure. The distinction between the features and labels (which we can think of as columns in a table) is a choice we make or context gives. We view the situation
Example
Standard examples to keep in mind are:
\(\mathcal{Y} = \{0, 1\}\) corresponds to binary classification problem
\(\mathcal{Y} = J\), for a finite set \(J\), corresponds to multiclass classification problem
\(\mathcal{Y} = P(J)\) corresponds to a labelling problem. Here \(P(-)\) denotes the power set construction.
\(\mathcal{Y} = \mathbb{R}\) corresponds to a regression problem
Definition
Let \(\rho_I^\mathrm{unif}\) denote the uniform distribution on the set \(I\). Viewing the sample as a map allows us to efficiently generate a distribution on \(\Omega\):
which we shall refer to as the empirical distribution associated to the sample \(S\).
Warning
These notes assume a familiarity with notation reviewed in probabilistic-notation.
Note
This is construction is implicit in the construction of a histogram from data.
Motivation
In practice, the ordering of a sample has any meaning in terms of the problem at hand. More concretely, no learning algorithm should depend on any ordering of the data, as any such dependence would introduce superfluous variance to the model, without any reciprocal reduction in bias.
Therefore, if we wanted to count the number of of all possible samples, we must not overcount the number of samples by individually counting samples differing by a permutation.
One benefit of the notion of the empirical distribution is that it naturally avoids the quotient process. In other words, it’s a convenient representation of the moduli space of samples.
Note
Moreover, different choices for the distribution on \(I\) yield different “class weights” for the learning problem. This forms the basis for boosting algorithms.
Note
In classical information theory, the empirical distribution is referred to as the type of the sample.
Probability Distributions with Finite Support¶
Below is a property which, in some sense, characterizes data probabilistically:
Definition
A probability distribution with finite support is a probability distribution of the form:
for some sample \(S\) and \(\rho_I\in\mathrm{Prob}(I)\).
Note
In other words, when the distribution has finite support, only finitely many points have nonzero probability.
We introduce the following notation to invite the reader to imagine sample data as forming a geometric object.
Definition
The space of sample data:
is defined as the space of finitely supported probability distributions.
At times, we may abusively omit reference to \(\Omega\) and simply write \(\mathfrak{D}\).
The space of sample data admits a filtration/stratification by cardinality. Below are the strata of this stratification.
Definition
We denote:
the space of distributions supported on \(n\) points, and
the space of distributions supported on less than \(n+1\) points
Example
When \(n=1\),
Note
Note that the empirical distribution can be considering as a point in the space of samples, in that there is a natural factorization:
Large Samples Approximation of Distributions¶
Fix a probability distribution \(\rho \in \mathrm{Prob}(\Omega)\). This in turn generates a distribution
for every \(n\) in the standard fashion.
Note
This distribution satisfies a maximum entropy principle: it maximizes entropy subject to the condition that it’s marginals coincides with \(\rho\):
for every \(i\).
The following theorem is essentially the inferential interpretation of probability (i.e. outcomes of repeated trials generate the probability distribution):
Theorem