Categorical distribution

In probability theory and statistics, a categorical distribution is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible categories, with the probability of each category separately specified. There is no innate underlying ordering of these outcomes, but numerical labels are often attached for convenience in describing the distribution,. The K-dimensional categorical distribution is the most general distribution over a K-way event; any other discrete distribution over a size-K sample space is a special case. The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1.
The categorical distribution is the generalization of the Bernoulli distribution for a categorical random variable, i.e. for a discrete variable with more than two possible outcomes, such as the roll of a die. On the other hand, the categorical distribution is a special case of the multinomial distribution, in that it gives the probabilities of potential outcomes of a single drawing rather than multiple drawings.

Terminology

Occasionally, the categorical distribution is termed the "discrete distribution". However, this properly refers not to one particular family of distributions but to a general class of distributions.
In some fields, such as machine learning and natural language processing, the categorical and multinomial distributions are conflated, and it is common to speak of a "multinomial distribution" when a "categorical distribution" would be more precise. This imprecise usage stems from the fact that it is sometimes convenient to express the outcome of a categorical distribution as a "1-of-K" vector rather than as an integer in the range 1 to K; in this form, a categorical distribution is equivalent to a multinomial distribution for a single observation.
However, conflating the categorical and multinomial distributions can lead to problems. For example, in a Dirichlet-multinomial distribution, which arises commonly in natural language processing models as a result of collapsed Gibbs sampling where Dirichlet distributions are collapsed out of a hierarchical Bayesian model, it is very important to distinguish categorical from multinomial. The joint distribution of the same variables with the same Dirichlet-multinomial distribution has two different forms depending on whether it is characterized as a distribution whose domain is over individual categorical nodes or over multinomial-style counts of nodes in each particular category. Both forms have very similar-looking probability mass functions, which both make reference to multinomial-style counts of nodes in a category. However, the multinomial-style PMF has an extra factor, a multinomial coefficient, that is a constant equal to 1 in the categorical-style PMF. Confusing the two can easily lead to incorrect results in settings where this extra factor is not constant with respect to the distributions of interest. The factor is frequently constant in the complete conditionals used in Gibbs sampling and the optimal distributions in variational methods.

Formulating distributions

A categorical distribution is a discrete probability distribution whose sample space is the set of k individually identified items. It is the generalization of the Bernoulli distribution for a categorical random variable.
In one formulation of the distribution, the sample space is taken to be a finite sequence of integers. The exact integers used as labels are unimportant; they might be or or any other arbitrary set of values. In the following descriptions, we use for convenience, although this disagrees with the convention for the Bernoulli distribution, which uses. In this case, the probability mass function f is:
where, represents the probability of seeing element i and.
Another formulation that appears more complex but facilitates mathematical manipulations is as follows, using the Iverson bracket:
where evaluates to 1 if, 0 otherwise. There are various advantages of this formulation, e.g.:

It is easier to write out the likelihood function of a set of independent identically distributed categorical variables.
It connects the categorical distribution with the related multinomial distribution.
It shows why the Dirichlet distribution is the conjugate prior of the categorical distribution, and allows the posterior distribution of the parameters to be calculated.

Yet another formulation makes explicit the connection between the categorical and multinomial distributions by treating the categorical distribution as a special case of the multinomial distribution in which the parameter n of the multinomial distribution is fixed at 1. In this formulation, the sample space can be considered to be the set of 1-of-K encoded random vectors x of dimension k having the property that exactly one element has the value 1 and the others have the value 0. The particular element having the value 1 indicates which category has been chosen. The probability mass function f in this formulation is:
where represents the probability of seeing element i and.
This is the formulation adopted by Bishop.

Properties

The distribution is completely given by the probabilities associated with each number i:, i = 1,...,k, where. The possible sets of probabilities are exactly those in the standard -dimensional simplex; for k = 2 this reduces to the possible probabilities of the Bernoulli distribution being the 1-simplex,
The distribution is a special case of a "multivariate Bernoulli distribution" in which exactly one of the k 0-1 variables takes the value one.
Let be the realisation from a categorical distribution. Define the random vector Y as composed of the elements:
The conjugate prior distribution of a categorical distribution is a Dirichlet distribution. See the [|section below] for more discussion.
The sufficient statistic from n independent observations is the set of counts of observations in each category, where the total number of trials is fixed.
The indicator function of an observation having a value i, equivalent to the Iverson bracket function or the Kronecker delta function is Bernoulli distributed with parameter
Bayesian inference using conjugate prior

In Bayesian statistics, the Dirichlet distribution is the conjugate prior distribution of the categorical distribution. This means that in a model consisting of a data point having a categorical distribution with unknown parameter vector p, and we choose to treat this parameter as a random variable and give it a prior distribution defined using a Dirichlet distribution, then the posterior distribution of the parameter, after incorporating the knowledge gained from the observed data, is also a Dirichlet. Intuitively, in such a case, starting from what is known about the parameter prior to observing the data point, knowledge can then be updated based on the data point, yielding a new distribution of the same form as the old one. As such, knowledge of a parameter can be successively updated by incorporating new observations one at a time, without running into mathematical difficulties.
Formally, this can be expressed as follows. Given a model
then the following holds:
This relationship is used in Bayesian statistics to estimate the underlying parameter p of a categorical distribution given a collection of N samples. Intuitively, we can view the hyperprior vector α as pseudocounts, i.e. as representing the number of observations in each category that we have already seen. Then we simply add in the counts for all the new observations in order to derive the posterior distribution.
Further intuition comes from the expected value of the posterior distribution :
This says that the expected probability of seeing a category i among the various discrete distributions generated by the posterior distribution is simply equal to the proportion of occurrences of that category actually seen in the data, including the pseudocounts in the prior distribution. This makes a great deal of intuitive sense: if, for example, there are three possible categories, and category 1 is seen in the observed data 40% of the time, one would expect on average to see category 1 40% of the time in the posterior distribution as well.
, which is indeed what the posterior reveals. However, the true distribution might actually be or

MAP estimation

The maximum-a-posteriori estimate of the parameter p in the above model is simply the mode of the posterior Dirichlet distribution, i.e.,
In many practical applications, the only way to guarantee the condition that is to set for all i.

Marginal likelihood

In the above model, the marginal likelihood of the observations is a Dirichlet-multinomial distribution:
This distribution plays an important role in hierarchical Bayesian models, because when doing inference over such models using methods such as Gibbs sampling or variational Bayes, Dirichlet prior distributions are often marginalized out. See the article on this distribution for more details.

Posterior predictive distribution

The posterior predictive distribution of a new observation in the above model is the distribution that a new observation would take given the set of N categorical observations. As shown in the Dirichlet-multinomial distribution article, it has a very simple form:
There are various relationships among this formula and the previous ones:

The posterior predictive probability of seeing a particular category is the same as the relative proportion of previous observations in that category. This makes logical sense — intuitively, we would expect to see a particular category according to the frequency already observed of that category.
The posterior predictive probability is the same as the expected value of the posterior distribution. This is explained more below.
As a result, this formula can be expressed as simply "the posterior predictive probability of seeing a category is proportional to the total observed count of that category", or as "the expected count of a category is the same as the total observed count of the category", where "observed count" is taken to include the pseudo-observations of the prior.

The reason for the equivalence between posterior predictive probability and the expected value of the posterior distribution of p is evident with re-examination of the above formula. As explained in the posterior predictive distribution article, the formula for the posterior predictive probability has the form of an expected value taken with respect to the posterior distribution:
The crucial line above is the third. The second follows directly from the definition of expected value. The third line is particular to the categorical distribution, and follows from the fact that, in the categorical distribution specifically, the expected value of seeing a particular value i is directly specified by the associated parameter p_i. The fourth line is simply a rewriting of the third in a different notation, using the notation farther up for an expectation taken with respect to the posterior distribution of the parameters.
Observe data points one by one and each time consider their predictive probability before observing the data point and updating the posterior. For any given data point, the probability of that point assuming a given category depends on the number of data points already in that category. In this scenario, if a category has a high frequency of occurrence, then new data points are more likely to join that category — further enriching the same category. This type of scenario is often termed a preferential attachment model. This models many real-world processes, and in such cases the choices made by the first few data points have an outsize influence on the rest of the data points.

Posterior conditional distribution

In Gibbs sampling, one typically needs to draw from conditional distributions in multi-variable Bayes networks where each variable is conditioned on all the others. In networks that include categorical variables with Dirichlet priors, the Dirichlet distributions are often "collapsed out" of the network, which introduces dependencies among the various categorical nodes dependent on a given prior. One of the reasons for doing this is that in such a case, the distribution of one categorical node given the others is exactly the posterior predictive distribution of the remaining nodes.
That is, for a set of nodes, if the node in question is denoted as and the remainder as, then
where is the number of nodes having category i among the nodes other than node n.

Sampling

There are a number of methods, but the most common way to sample from a categorical distribution uses a type of inverse transform sampling:
Assume a distribution is expressed as "proportional to" some expression, with unknown normalizing constant. Before taking any samples, one prepares some values as follows:

Compute the unnormalized value of the distribution for each category.
Sum them up and divide each value by this sum, in order to normalize them.
Impose some sort of order on the categories.
Convert the values to a cumulative distribution function by replacing each value with the sum of all of the previous values. This can be done in time O. The resulting value for the first category will be 0.

Then, each time it is necessary to sample a value:

Pick a uniformly distributed number between 0 and 1.
Locate the greatest number in the CDF whose value is less than or equal to the number just chosen. This can be done in time O, by binary search.
Return the category corresponding to this CDF value.

If it is necessary to draw many values from the same categorical distribution, the following approach is more efficient. It draws n samples in O time.


function draw_categorical // where n is the number of samples to draw from the categorical distribution
 r = 1
 s = 0
 for i from 1 to k // where k is the number of categories
 v = draw from a binomial distribution // where p is the probability of category i
 for j from 1 to v
 z = i // where z is an array in which the results are stored
 n = n - v
 r = r - p
 shuffle the elements in z
 return z

Sampling via the Gumbel distribution

In machine learning it is typical to parametrize the categorical distribution, via an unconstrained representation in, whose components are given by:
where is any real constant. Given this representation, can be recovered using the softmax function, which can then be sampled using the techniques described above. There is however a more direct sampling method that uses samples from the Gumbel distribution. Let be k independent draws from the standard Gumbel distribution, then
will be a sample from the desired categorical distribution.

Related distributions

Dirichlet distribution
Multinomial distribution
Bernoulli distribution
Dirichlet-multinomial distribution

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...