F-divergence


In probability theory, an ƒ-divergence is a function Df  that measures the difference between two probability distributions P and Q. It helps the intuition to think of the divergence as an average, weighted by the function f, of the odds ratio given by P and Q.
These divergences were introduced by Alfréd Rényi in the same paper where he introduced the well-known Rényi entropy. He proved that these divergences decrease in Markov Processes. f-divergences were studied further independently by, and and are sometimes known as Csiszár ƒ-divergences, Csiszár-Morimoto divergences or Ali-Silvey distances.

Definition

Let P and Q be two probability distributions over a space Ω such that P is absolutely continuous with respect to Q. Then, for a convex function f such that f = 0, the f-divergence of P from Q is defined as
If P and Q are both absolutely continuous with respect to a reference distribution μ on Ω then their probability densities p and q satisfy dP = p dμ and dQ = q dμ. In this case the f-divergence can be written as
The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances.

Instances of ''f''-divergences

Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of f-divergence, coinciding with a particular choice of f. The following table lists many of the common divergences between probability distributions and the f function to which they correspond.
DivergenceCorresponding f
KL-divergence
reverse KL-divergence
squared Hellinger distance
Total variation distance
Pearson -divergence
Neyman -divergence
α-divergence
Jensen-Shannon Divergence
α-divergence

The function is defined up to the summand, where is any constant.

Properties

In particular, the monotonicity implies that if a Markov process has a positive equilibrium probability distribution then is a monotonic function of time, where the probability distribution is a solution of the Kolmogorov forward equations, used to describe the time evolution of the probability distribution in the Markov process. This means that all f-divergences are the Lyapunov functions of the Kolmogorov forward equations. Reverse statement is also true: If is a Lyapunov function for all Markov chains with positive equilibrium and is of the trace-form
then, for some convex function f. For example, Bregman divergences in general do not have such property and can increase in Markov processes.