Chernoff bound

In probability theory, the Chernoff bound gives exponentially decreasing bounds on tail distributions of sums of independent random variables. Despite being named after Herman Chernoff, the author of the paper it first appeared in,^[1] the result is due to Herman Rubin.^[2] It is a sharper bound than the known first- or second-moment-based tail bounds such as Markov's inequality or Chebyshev's inequality, which only yield power-law bounds on tail decay. However, the Chernoff bound requires that the variates be independent – a condition that neither Markov's inequality nor Chebyshev's inequality require, although Chebyshev's inequality does require the variates to be pairwise independent.

It is related to the (historically prior) Bernstein inequalities and to Hoeffding's inequality.

The generic bound[]

The generic Chernoff bound for a random variable $X$ is attained by applying Markov's inequality to $e tX$ . This gives a bound in terms of the moment-generating function of $X$ . For every $t\geq 0$ :

\Pr(X\geq a)=\Pr(e^{t\cdot X}\geq e^{t\cdot a})\leq {\frac {\mathrm {E} \left[e^{t\cdot X}\right]}{e^{t\cdot a}}}.

Since this bound is true for every $t$ , we have:

\Pr(X\geq a)\leq \inf _{t\geq 0}{\frac {\mathrm {E} \left[\exp(t\cdot X)\right]}{\exp(t\cdot a)}}.

The Chernoff bound sometimes refers to the above inequality,^[3] which was first applied by Sergei Bernstein to prove the related Bernstein inequalities.^{[citation needed]} It is also used to prove Hoeffding's inequality, Bennett's inequality, and McDiarmid's inequality.

This inequality can be applied generally to various classes of distributions, including sub-gaussian distributions,^[4] sub-gamma distributions, and sums of independent random variables.^[3] Chernoff bounds commonly refer to the case where $X$ is the sum of independent Bernoulli random variables.^[5]^[6]

When $X$ is the sum of $n$ independent random variables $X 1, ..., X n$ , the moment generating function of $X$ is the product of the individual moment generating functions, giving that

\Pr(X\geq a)\leq \inf _{t\geq 0}{\frac {\mathrm {E} \left[\prod _{i}e^{t\cdot X_{i}}\right]}{e^{t\cdot a}}}=\inf _{t\geq 0}e^{-t\cdot a}\prod _{i}\mathrm {E} \left[e^{t\cdot X_{i}}\right].

(1)

By performing the same analysis on the random variable $-X$ , one can get the same bound in the other direction.

\Pr(X\leq a)\leq \inf _{t\geq 0}e^{ta}\prod _{i}\mathrm {E} \left[e^{-tX_{i}}\right]

Specific Chernoff bounds are attained by calculating the moment-generating function $\mathrm {E} \left[e^{-t\cdot X_{i}}\right]$ for specific instances of the random variables $X_{i}$ . The bounds in the following sections for Bernoulli random variables are derived by using that, for a Bernoulli random variable $X_{i}$ with probability p of being equal to 1,

\mathrm {E} \left[e^{t\cdot X_{i}}\right]=(1-p)e^{0}+pe^{t}=1+p(e^{t}-1)\leq e^{p(e^{t}-1)}.

One can encounter many flavors of Chernoff bounds: the original additive form (which gives a bound on the absolute error) or the more practical multiplicative form (which bounds the error relative to the mean).

Multiplicative form (relative error)[]

Multiplicative Chernoff bound. Suppose $X 1, ..., X n$ are independent random variables taking values in ${0, 1}.$ Let $X$ denote their sum and let $μ = E[X]$ denote the sum's expected value. Then for any $δ > 0$ ,

\Pr(X>(1+\delta )\mu )<\left({\frac {e^{\delta }}{(1+\delta )^{1+\delta }}}\right)^{\mu }.

A similar proof strategy can be used to show that

\Pr(X<(1-\delta )\mu )<\left({\frac {e^{-\delta }}{(1-\delta )^{1-\delta }}}\right)^{\mu }.

The above formula is often unwieldy in practice, so the following looser but more convenient bounds^[7] are often used, which follow from the inequality $\textstyle {\frac {2\delta }{2+\delta }}\leq \log(1+\delta )$ from the list of logarithmic inequalities:

\Pr(X\leq (1-\delta )\mu )\leq e^{-{\frac {\delta ^{2}\mu }{2}}},\qquad 0\leq \delta \leq 1,

\Pr(X\geq (1+\delta )\mu )\leq e^{-{\frac {\delta ^{2}\mu }{2+\delta }}},\qquad 0\leq \delta ,

\Pr(|X-\mu |\geq \delta \mu )\leq 2e^{-{\frac {\delta ^{2}\mu }{3}}},\qquad 0\leq \delta \leq 1.

Additive form (absolute error)[]

The following theorem is due to Wassily Hoeffding^[8] and hence is called the Chernoff–Hoeffding theorem.

Chernoff–Hoeffding theorem. Suppose

X 1, ..., X n

are i.i.d. random variables, taking values in

{0, 1}.

Let

p = E[X 1]

and

ε > 0

.

{\begin{aligned}\Pr \left({\frac {1}{n}}\sum X_{i}\geq p+\varepsilon \right)\leq \left(\left({\frac {p}{p+\varepsilon }}\right)^{p+\varepsilon }{\left({\frac {1-p}{1-p-\varepsilon }}\right)}^{1-p-\varepsilon }\right)^{n}&=e^{-D(p+\varepsilon \parallel p)n}\\\Pr \left({\frac {1}{n}}\sum X_{i}\leq p-\varepsilon \right)\leq \left(\left({\frac {p}{p-\varepsilon }}\right)^{p-\varepsilon }{\left({\frac {1-p}{1-p+\varepsilon }}\right)}^{1-p+\varepsilon }\right)^{n}&=e^{-D(p-\varepsilon \parallel p)n}\end{aligned}}

where

D(x\parallel y)=x\ln {\frac {x}{y}}+(1-x)\ln \left({\frac {1-x}{1-y}}\right)

is the Kullback–Leibler divergence between Bernoulli distributed random variables with parameters x and y respectively. If

p ≥ .mw-parser-output .sfrac{white-space:nowrap}.mw-parser-output .sfrac.tion,.mw-parser-output .sfrac .tion{display:inline-block;vertical-align:-0.5em;font-size:85%;text-align:center}.mw-parser-output .sfrac .num,.mw-parser-output .sfrac .den{display:block;line-height:1em;margin:0 0.1em}.mw-parser-output .sfrac .den{border-top:1px solid}.mw-parser-output .sr-only{border:0;clip:rect(0,0,0,0);height:1px;margin:-1px;overflow:hidden;padding:0;position:absolute;width:1px}1/2,

then

D(p+\varepsilon \parallel p)\geq {\tfrac {\varepsilon ^{2}}{2p(1-p)}}

which means

\Pr \left({\frac {1}{n}}\sum X_{i}>p+x\right)\leq \exp \left(-{\frac {x^{2}n}{2p(1-p)}}\right).

A simpler bound follows by relaxing the theorem using $D (p + ε || p) \geq 2 ε 2$ , which follows from the convexity of $D (p + ε || p)$ and the fact that

{\frac {d^{2}}{d\varepsilon ^{2}}}D(p+\varepsilon \parallel p)={\frac {1}{(p+\varepsilon )(1-p-\varepsilon )}}\geq 4={\frac {d^{2}}{d\varepsilon ^{2}}}(2\varepsilon ^{2}).

This result is a special case of Hoeffding's inequality. Sometimes, the bounds

{\begin{aligned}D((1+x)p\parallel p)\geq {\frac {1}{4}}x^{2}p,&&&{-{\tfrac {1}{2}}}\leq x\leq {\tfrac {1}{2}},\\[6pt]D(x\parallel y)\geq {\frac {3(x-y)^{2}}{2(2y+x)}},\\[6pt]D(x\parallel y)\geq {\frac {(x-y)^{2}}{2y}},&&&x\leq y,\\[6pt]D(x\parallel y)\geq {\frac {(x-y)^{2}}{2x}},&&&x\geq y\end{aligned}}

which are stronger for $p < 1 / 8,$ are also used.

Sums of independent bounded random variables[]

Chernoff bounds may also be applied to general sums of independent, bounded random variables, regardless of their distribution; this is known as Hoeffding's inequality. The proof follows a similar approach to the other Chernoff bounds, but applying Hoeffding's lemma to bound the moment generating functions (see Hoeffding's inequality).

Hoeffding's inequality. Suppose

X 1, ..., X n

are independent random variables taking values in

[a,b].

Let

X

denote their sum and let

μ = E[X]

denote the sum's expected value. Then for any

δ > 0

,

\Pr(X\leq (1-\delta )\mu )<e^{-{\frac {2\delta ^{2}\mu ^{2}}{n(b-a)^{2}}}},

\Pr(X\geq (1+\delta )\mu )<e^{-{\frac {2\delta ^{2}\mu ^{2}}{n(b-a)^{2}}}}.

Applications[]

Chernoff bounds have very useful applications in set balancing and packet routing in sparse networks.

The set balancing problem arises while designing statistical experiments. Typically while designing a statistical experiment, given the features of each participant in the experiment, we need to know how to divide the participants into 2 disjoint groups such that each feature is roughly as balanced as possible between the two groups.^[9]

Chernoff bounds are also used to obtain tight bounds for permutation routing problems which reduce network congestion while routing packets in sparse networks.^[9]

Chernoff bounds are used in computational learning theory to prove that a learning algorithm is probably approximately correct, i.e. with high probability the algorithm has small error on a sufficiently large training data set.^[10]

Chernoff bounds can be effectively used to evaluate the "robustness level" of an application/algorithm by exploring its perturbation space with randomization.^[11] The use of the Chernoff bound permits one to abandon the strong—and mostly unrealistic—small perturbation hypothesis (the perturbation magnitude is small). The robustness level can be, in turn, used either to validate or reject a specific algorithmic choice, a hardware implementation or the appropriateness of a solution whose structural parameters are affected by uncertainties.

A simple and common use of Chernoff bounds is for "boosting" of randomized algorithms. If one has an algorithm that outputs a guess that is the desired answer with probability p > 1/2, then one can get a higher success rate by running the algorithm $n=\log(1/\delta )2p/(p-1/2)^{2}$ times and outputting a guess that is output by more than n/2 runs of the algorithm. (There cannot be more than one such guess by the pigeonhole principle.) Assuming that these algorithm runs are independent, the probability that more than n/2 of the guesses is correct is equal to the probability that the sum of independent Bernoulli random variables $X k$ that are 1 with probability p is more than n/2. This can be shown to be at least $1-\delta$ via the multiplicative Chernoff bound (Corollary 13.3 in Sinclair's class notes, $μ = np$ ).^[12]:

\Pr \left[X>{n \over 2}\right]\geq 1-e^{-{\frac {1}{2p}}n\left(p-{\frac {1}{2}}\right)^{2}}\geq 1-\delta

Matrix bound[]

Rudolf Ahlswede and Andreas Winter introduced a Chernoff bound for matrix-valued random variables.^[13] The following version of the inequality can be found in the work of Tropp.^[14]

Let $M 1, ..., M t$ be independent matrix valued random variables such that $M_{i}\in \mathbb {C} ^{d_{1}\times d_{2}}$ and $\mathbb {E} [M_{i}]=0$ . Let us denote by $\lVert M\rVert$ the operator norm of the matrix $M$ . If $\lVert M_{i}\rVert \leq \gamma$ holds almost surely for all $i\in \{1,\ldots ,t\}$ , then for every $ε > 0$

\Pr \left(\left\|{\frac {1}{t}}\sum _{i=1}^{t}M_{i}\right\|>\varepsilon \right)\leq (d_{1}+d_{2})\exp \left(-{\frac {3\varepsilon ^{2}t}{8\gamma ^{2}}}\right).

Notice that in order to conclude that the deviation from 0 is bounded by $ε$ with high probability, we need to choose a number of samples $t$ proportional to the logarithm of $d_{1}+d_{2}$ . In general, unfortunately, a dependence on $\log(\min(d_{1},d_{2}))$ is inevitable: take for example a diagonal random sign matrix of dimension $d\times d$ . The operator norm of the sum of t independent samples is precisely the maximum deviation among d independent random walks of length t. In order to achieve a fixed bound on the maximum deviation with constant probability, it is easy to see that t should grow logarithmically with d in this scenario.^[15]

The following theorem can be obtained by assuming M has low rank, in order to avoid the dependency on the dimensions.

Theorem without the dependency on the dimensions[]

Let $0 < ε < 1$ and M be a random symmetric real matrix with $\|\mathrm {E} [M]\|\leq 1$ and $\|M\|\leq \gamma$ almost surely. Assume that each element on the support of M has at most rank r. Set

t=\Omega \left({\frac {\gamma \log(\gamma /\varepsilon ^{2})}{\varepsilon ^{2}}}\right).

If $r\leq t$ holds almost surely, then

\Pr \left(\left\|{\frac {1}{t}}\sum _{i=1}^{t}M_{i}-\mathrm {E} [M]\right\|>\varepsilon \right)\leq {\frac {1}{\mathbf {poly} (t)}}

where $M 1, ..., M t$ are i.i.d. copies of M.

Theorem with matrices that are not completely random[]

Garg, Lee, Song and Srivastava ^[16] proved a Chernoff-type bound for sums of matrix-valued random variables sampled via a random walk on an expander, confirming a conjecture due to Wigderson and Xiao.

Kyng and Song ^[17] proved a Chernoff-type bound for sums of Laplacian matrix of random spanning trees.

Sampling variant[]

The following variant of Chernoff's bound can be used to bound the probability that a majority in a population will become a minority in a sample, or vice versa.^[18]

Suppose there is a general population A and a sub-population B⊆A. Mark the relative size of the sub-population (|B|/|A|) by r.

Suppose we pick an integer k and a random sample S⊂A of size k. Mark the relative size of the sub-population in the sample (|B∩S|/|S|) by r_S.

Then, for every fraction d∈[0,1]:

\mathrm {Pr} \left(r_{S}<(1-d)\cdot r\right)<\exp \left(-r\cdot d^{2}\cdot k/2\right)

In particular, if B is a majority in A (i.e. r > 0.5) we can bound the probability that B will remain majority in S (r_S>0.5) by taking: d = 1 - 1 / (2 r):^[19]

\mathrm {Pr} \left(r_{S}>0.5\right)>1-\exp \left(-r\cdot \left(1-{\frac {1}{2r}}\right)^{2}\cdot k/2\right)

This bound is of course not tight at all. For example, when r=0.5 we get a trivial bound Prob > 0.

Proofs[]

Multiplicative form[]

Following the conditions of the multiplicative Chernoff bound, let $X 1, ..., X n$ be independent Bernoulli random variables, whose sum is $X$ , each having probability p_i of being equal to 1. For a Bernoulli variable:

\mathrm {E} \left[e^{t\cdot X_{i}}\right]=(1-p_{i})e^{0}+p_{i}e^{t}=1+p_{i}(e^{t}-1)\leq e^{p_{i}(e^{t}-1)}

So, using (1) with $a=(1+\delta )\mu$ for any $\delta >0$ and where $\mu =\mathrm {E} [X]=\textstyle \sum _{i=1}^{n}p_{i}$ ,

{\begin{aligned}\Pr(X>(1+\delta )\mu )&\leq \inf _{t\geq 0}\exp(-t(1+\delta )\mu )\prod _{i=1}^{n}\operatorname {E} [\exp(tX_{i})]\\[4pt]&\leq \inf _{t\geq 0}\exp {\Big (}-t(1+\delta )\mu +\sum _{i=1}^{n}p_{i}(e^{t}-1){\Big )}\\[4pt]&=\inf _{t\geq 0}\exp {\Big (}-t(1+\delta )\mu +(e^{t}-1)\mu {\Big )}.\end{aligned}}

If we simply set $t = log(1 + δ)$ so that $t > 0$ for $δ > 0$ , we can substitute and find

\exp {\Big (}-t(1+\delta )\mu +(e^{t}-1)\mu {\Big )}={\frac {\exp((1+\delta -1)\mu )}{(1+\delta )^{(1+\delta )\mu }}}=\left[{\frac {e^{\delta }}{(1+\delta )^{(1+\delta )}}}\right]^{\mu }.

This proves the result desired.

Chernoff–Hoeffding theorem (additive form)[]

Let $q = p + ε$ . Taking $a = nq$ in (1), we obtain:

\Pr \left({\frac {1}{n}}\sum X_{i}\geq q\right)\leq \inf _{t>0}{\frac {E\left[\prod e^{tX_{i}}\right]}{e^{tnq}}}=\inf _{t>0}\left({\frac {E\left[e^{tX_{i}}\right]}{e^{tq}}}\right)^{n}.

Now, knowing that $Pr(X i = 1) = p, Pr(X i = 0) = 1 - p$ , we have

\left({\frac {\mathrm {E} \left[e^{tX_{i}}\right]}{e^{tq}}}\right)^{n}=\left({\frac {pe^{t}+(1-p)}{e^{tq}}}\right)^{n}=\left(pe^{(1-q)t}+(1-p)e^{-qt}\right)^{n}.

Therefore, we can easily compute the infimum, using calculus:

{\frac {d}{dt}}\left(pe^{(1-q)t}+(1-p)e^{-qt}\right)=(1-q)pe^{(1-q)t}-q(1-p)e^{-qt}

Setting the equation to zero and solving, we have

{\begin{aligned}(1-q)pe^{(1-q)t}&=q(1-p)e^{-qt}\\(1-q)pe^{t}&=q(1-p)\end{aligned}}

so that

e^{t}={\frac {(1-p)q}{(1-q)p}}.

Thus,

t=\log \left({\frac {(1-p)q}{(1-q)p}}\right).

As $q = p + ε > p$ , we see that $t > 0$ , so our bound is satisfied on $t$ . Having solved for $t$ , we can plug back into the equations above to find that

{\begin{aligned}\log \left(pe^{(1-q)t}+(1-p)e^{-qt}\right)&=\log \left(e^{-qt}(1-p+pe^{t})\right)\\&=\log \left(e^{-q\log \left({\frac {(1-p)q}{(1-q)p}}\right)}\right)+\log \left(1-p+pe^{\log \left({\frac {1-p}{1-q}}\right)}e^{\log {\frac {q}{p}}}\right)\\&=-q\log {\frac {1-p}{1-q}}-q\log {\frac {q}{p}}+\log \left(1-p+p\left({\frac {1-p}{1-q}}\right){\frac {q}{p}}\right)\\&=-q\log {\frac {1-p}{1-q}}-q\log {\frac {q}{p}}+\log \left({\frac {(1-p)(1-q)}{1-q}}+{\frac {(1-p)q}{1-q}}\right)\\&=-q\log {\frac {q}{p}}+\left(-q\log {\frac {1-p}{1-q}}+\log {\frac {1-p}{1-q}}\right)\\&=-q\log {\frac {q}{p}}+(1-q)\log {\frac {1-p}{1-q}}\\&=-D(q\parallel p).\end{aligned}}

We now have our desired result, that

\Pr \left({\tfrac {1}{n}}\sum X_{i}\geq p+\varepsilon \right)\leq e^{-D(p+\varepsilon \parallel p)n}.

To complete the proof for the symmetric case, we simply define the random variable $Y i = 1 - X i$ , apply the same proof, and plug it into our bound.

References[]

^ Chernoff, Herman (1952). "A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations". The Annals of Mathematical Statistics. 23 (4): 493–507. doi:10.1214/aoms/1177729330. ISSN 0003-4851. JSTOR 2236576.
^ Chernoff, Herman (2014). "A career in statistics" (PDF). In Lin, Xihong; Genest, Christian; Banks, David L.; Molenberghs, Geert; Scott, David W.; Wang, Jane-Ling (eds.). Past, Present, and Future of Statistics. CRC Press. p. 35. ISBN 9781482204964.
^ ^a ^b Boucheron, Stéphane (2013). Concentration Inequalities: a Nonasymptotic Theory of Independence. Gábor Lugosi, Pascal Massart. Oxford: Oxford University Press. p. 21. ISBN 978-0-19-953525-5. OCLC 837517674.
^ Wainwright, M. (January 22, 2015). "Basic tail and concentration bounds" (PDF).
^ Vershynin, Roman (2018). High-dimensional probability : an introduction with applications in data science. Cambridge, United Kingdom. p. 19. ISBN 978-1-108-41519-4. OCLC 1029247498.
^ Tropp, Joel A. (2015-05-26). "An Introduction to Matrix Concentration Inequalities". Foundations and Trends in Machine Learning. 8 (1–2): 60. arXiv:1501.01571. doi:10.1561/2200000048. ISSN 1935-8237. S2CID 5679583.
^ Mitzenmacher, Michael; Upfal, Eli (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press. ISBN 978-0-521-83540-4.
^ Hoeffding, W. (1963). "Probability Inequalities for Sums of Bounded Random Variables" (PDF). Journal of the American Statistical Association. 58 (301): 13–30. doi:10.2307/2282952. JSTOR 2282952.
^ ^a ^b Refer to this book section for more info on the problem.
^ Kearns, M.; Vazirani, U. (1994). An Introduction to Computational Learning Theory. MIT Press. Chapter 9 (Appendix), pages 190–192. ISBN 0-262-11193-4.
^ Alippi, C. (2014). "Randomized Algorithms". Intelligence for Embedded Systems. Springer. ISBN 978-3-319-05278-6.
^ Sinclair, Alistair (Fall 2011). "Class notes for the course "Randomness and Computation"" (PDF). Archived from the original (PDF) on 31 October 2014. Retrieved 30 October 2014.
^ Ahlswede, R.; Winter, A. (2003). "Strong Converse for Identification via Quantum Channels". IEEE Transactions on Information Theory. 48 (3): 569–579. arXiv:quant-ph/0012127. doi:10.1109/18.985947. S2CID 523176.
^ Tropp, J. (2010). "User-friendly tail bounds for sums of random matrices". Foundations of Computational Mathematics. 12 (4): 389–434. arXiv:1004.4389. doi:10.1007/s10208-011-9099-z. S2CID 17735965.
^ Magen, A.; Zouzias, A. (2011). "Low Rank Matrix-Valued Chernoff Bounds and Approximate Matrix Multiplication". arXiv:1005.2724 [cs.DM].
^ Garg, Ankit; Lee, Yin Tat; Song, Zhao; Srivastava, Nikhil (2018). A Matrix Expander Chernoff Bound. STOC '18 Proceedings of the fifty annual ACM symposium on Theory of Computing. arXiv:1704.03864.
^ Kyng, Rasmus; Song, Zhao (2018). A Matrix Chernoff Bound for Strongly Rayleigh Distributions and Spectral Sparsifiers from a few Random Spanning Trees. FOCS '18 IEEE Symposium on Foundations of Computer Science. arXiv:1810.08345.
^ Goldberg, A. V.; Hartline, J. D. (2001). "Competitive Auctions for Multiple Digital Goods". Algorithms — ESA 2001. Lecture Notes in Computer Science. 2161. p. 416. CiteSeerX 10.1.1.8.5115. doi:10.1007/3-540-44676-1_35. ISBN 978-3-540-42493-2.; lemma 6.1
^ See graphs of: the bound as a function of r when k changes and the bound as a function of k when r changes.