F-divergence
In probability theory, an ƒ-divergence is a function Df that measures the difference between two probability distributions P and Q. It helps the intuition to think of the divergence as an average, weighted by the function f, of the odds ratio given by P and Q.
These divergences were introduced by Alfréd Rényi in the same paper where he introduced the well-known Rényi entropy. He proved that these divergences decrease in Markov Processes. f-divergences were studied further independently by, and and are sometimes known as Csiszár ƒ-divergences, Csiszár-Morimoto divergences or Ali-Silvey distances.
Definition
Let P and Q be two probability distributions over a space Ω such that P is absolutely continuous with respect to Q. Then, for a convex function f such that f = 0, the f-divergence of P from Q is defined asIf P and Q are both absolutely continuous with respect to a reference distribution μ on Ω then their probability densities p and q satisfy dP = p dμ and dQ = q dμ. In this case the f-divergence can be written as
The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances.
Instances of ''f''-divergences
Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of f-divergence, coinciding with a particular choice of f. The following table lists many of the common divergences between probability distributions and the f function to which they correspond.Divergence | Corresponding f |
KL-divergence | |
reverse KL-divergence | |
squared Hellinger distance | |
Total variation distance | |
Pearson -divergence | |
Neyman -divergence | |
α-divergence | |
Jensen-Shannon Divergence | |
α-divergence |
The function is defined up to the summand, where is any constant.
Properties
In particular, the monotonicity implies that if a Markov process has a positive equilibrium probability distribution then is a monotonic function of time, where the probability distribution is a solution of the Kolmogorov forward equations, used to describe the time evolution of the probability distribution in the Markov process. This means that all f-divergences are the Lyapunov functions of the Kolmogorov forward equations. Reverse statement is also true: If is a Lyapunov function for all Markov chains with positive equilibrium and is of the trace-formthen, for some convex function f. For example, Bregman divergences in general do not have such property and can increase in Markov processes.