Pinsker's inequality
In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance (or statistical distance) in terms of the Kullback–Leibler divergence. The inequality is tight up to constant factors.[1]
Formal statement[]
Pinsker's inequality states that, if and are two probability distributions on a measurable space , then
where
is the total variation distance (or statistical distance) between and and
is the Kullback–Leibler divergence in nats. When the sample space is a finite set, the Kullback–Leibler divergence is given by
Note that in terms of the total variation norm of the signed measure , Pinsker's inequality differs from the one given above by a factor of two:
A proof of Pinsker's inequality uses the for f-divergences.
Alternative version[]
Note that the expression of Pinsker inequality depends on what basis of logarithm is used in the definition of KL-divergence. is defined using (logarithm in base ), whereas is typically defined with (logarithm in base 2). Then,
Given the above comments, there is an alternative statement of Pinsker's inequality in some literature that relates information divergence to variation distance:
i.e.,
in which
is the (non-normalized) variation distance between two probability density functions and on the same alphabet .[2]
This form of Pinsker's inequality shows that "convergence in divergence" is stronger notion than "convergence in variation distance".
A simple proof by John Pollard is shown by letting :
Here Titu's lemma is also known as Sedrakyan's inequality.
History[]
Pinsker first proved the inequality with a greater constant. The inequality in the above form was proved independently by Kullback, Csiszár, and Kemperman.[3]
Inverse problem[]
A precise inverse of the inequality cannot hold: for every , there are distributions with but . An easy example is given by the two-point space with and . [4]
However, an inverse inequality holds on finite spaces with a constant depending on .[5] More specifically, it can be shown that with the definition we have for any measure which is absolutely continuous to
As a consequence, if has full support (i.e. for all ), then
References[]
- ^ Csiszár, Imre; Körner, János (2011). Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press. p. 44. ISBN 9781139499989.
- ^ Raymond W., Yeung (2008). Information Theory and Network Coding. Hong Kong: Springer. p. 26. ISBN 978-0-387-79233-0.
- ^ Tsybakov, Alexandre (2009). Introduction to Nonparametric Estimation. Springer. p. 132. ISBN 9780387790527.
- ^ The divergence becomes infinite whenever one of the two distributions assigns probability zero to an event while the other assigns it a nonzero probability (no matter how small); see e.g. Basu, Mitra; Ho, Tin Kam (2006). Data Complexity in Pattern Recognition. Springer. p. 161. ISBN 9781846281723..
- ^ see Lemma 4.1 in Götze, Friedrich; Sambale, Holger; Sinulis, Arthur. "Higher order concentration for functions of weakly dependent random variables". arXiv:1801.06348.
Further reading[]
- Thomas M. Cover and Joy A. Thomas: Elements of Information Theory, 2nd edition, Willey-Interscience, 2006
- Nicolo Cesa-Bianchi and Gábor Lugosi: Prediction, Learning, and Games, Cambridge University Press, 2006
- Information theory
- Probabilistic inequalities