Johnson–Lindenstrauss lemma

In mathematics, the Johnson–Lindenstrauss lemma is a result named after William B. Johnson and Joram Lindenstrauss concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. The map used for the embedding is at least Lipschitz, and can even be taken to be an orthogonal projection.

The lemma has applications in compressed sensing, manifold learning, dimensionality reduction, and graph embedding. Much of the data stored and manipulated on computers, including text and images, can be represented as points in a high-dimensional space (see vector space model for the case of text). However, the essential algorithms for working with such data tend to become bogged down very quickly as dimension increases.^[1] It is therefore desirable to reduce the dimensionality of the data in a way that preserves its relevant structure. The Johnson–Lindenstrauss lemma is a classic result in this vein.

Also, the lemma is tight up to a constant factor, i.e. there exists a set of points of size m that needs dimension

\Omega \left({\frac {\log(m)}{\varepsilon ^{2}}}\right)

in order to preserve the distances between all pairs of points within a factor of $(1\pm \varepsilon )$ .^[2]^[3]

Lemma[]

Given $0<\varepsilon <1$ , a set $X$ of $m$ points in $\mathbb {R} ^{N}$ , and a number $n>8\ln(m)/\varepsilon ^{2}$ , there is a linear map $f:\mathbb {R} ^{N}\rightarrow \mathbb {R} ^{n}$ such that

(1-\varepsilon )\|u-v\|^{2}\leq \|f(u)-f(v)\|^{2}\leq (1+\varepsilon )\|u-v\|^{2}

for all $u,v\in X$ .

The formula can be rearranged:

(1+\varepsilon )^{-1}\|f(u)-f(v)\|^{2}\leq \|u-v\|^{2}\leq (1-\varepsilon )^{-1}\|f(u)-f(v)\|^{2}

One proof of the lemma takes ƒ to be a suitable multiple of the orthogonal projection onto a random subspace of dimension $n$ in $\mathbb {R} ^{N}$ , and exploits the phenomenon of concentration of measure.

An orthogonal projection will, in general, reduce the average distance between points, but the lemma can be viewed as dealing with relative distances, which do not change under scaling. In a nutshell, you roll the dice and obtain a random projection, which will reduce the average distance, and then you scale up the distances so that the average distance returns to its previous value. If you keep rolling the dice, you will, in polynomial random time, find a projection for which the (scaled) distances satisfy the lemma.

Alternate statement[]

A related lemma is the distributional JL lemma. This lemma states that for any 0 < ε, δ < 1/2 and positive integer d, there exists a distribution over R^{k × d} from which the matrix A is drawn such that for k = O(ε⁻²log(1/δ)) and for any unit-length vector x ∈ R^d, the claim below holds.^[4]

P(|\Vert Ax\Vert _{2}^{2}-1|>\varepsilon )<\delta

One can obtain the JL lemma from the distributional version by setting $x=(u-v)/\|u-v\|_{2}$ and $\delta <1/n^{2}$ for some pair u,v both in X. Then the JL lemma follows by a union bound over all such pairs.

Speeding up the JL transform[]

Given A, computing the matrix vector product takes O(kd) time. There has been some work in deriving distributions for which the matrix vector product can be computed in less than O(kd) time.

There are two major lines of work. The first, Fast Johnson Lindenstrauss Transform (FJLT),^[5] was introduced by Ailon and Chazelle in 2006. This method allows the computation of the matrix vector product in just $d\log d+k^{2+\gamma }$ for any constant $\gamma >0$ .

Another approach is to build a distribution supported over matrices that are sparse.^[6] This method allows keeping only an $\varepsilon$ fraction of the entries in the matrix, which means the computation can be done in just $kd\varepsilon$ time. Furthermore, if the vector has only $b$ non-zero entries, the Sparse JL takes time $kb\varepsilon$ , which may be much less than the $d\log d$ time used by Fast JL.

Tensorized Random Projections[]

It is possible to combine two JL matrices by taking the so-called Face-splitting product is defined as the tensor products of the rows (was proposed by V. Slyusar^[7] in 1996^[8]^[9]^[10]^[11]^[12] for radar and digital antenna array applications). More directly, let ${C}\in \mathbb {R} ^{3\times 3}$ and ${D}\in \mathbb {R} ^{3\times 3}$ be two matrices. Then the Face-splitting product ${C}\bullet {D}$ is^[8]^[9]^[10]^[11]^[12]

{C}\bullet {D}=\left[{\begin{array}{c }{C}_{1}\otimes {D}_{1}\\\hline {C}_{2}\otimes {D}_{2}\\\hline {C}_{3}\otimes {D}_{3}\\\end{array}}\right].

This idea of tensorization was used by Kasiviswanathan et al. 2010^[13] for differential privacy.

JL matrices defined like this use fewer random bits, and can be applied quickly to vectors that have tensor structure, due to the following identity:^[10]

(\mathbf {C} \bullet \mathbf {D} )(x\otimes y)=\mathbf {C} x\circ \mathbf {D} y=\left[{\begin{array}{c }(\mathbf {C} x)_{1}(\mathbf {D} y)_{1}\\(\mathbf {C} x)_{2}(\mathbf {D} y)_{2}\\\vdots \end{array}}\right]

,

where $\circ$ is the element-wise (Hadamard) product. Such computations have been used to efficiently compute polynomial kernels and many other linear algebra algorithms.^[14]

In 2020^[15] it was shown that if the matrices $C_{1},C_{2},\dots ,C_{c}$ are independent $\pm 1$ or Gaussian matrices, the combined matrix $C_{1}\bullet \dots \bullet C_{c}$ satisfies the distributional JL lemma if the number of rows is at least

O(\epsilon ^{-2}\log 1/\delta +\epsilon ^{-1}({\tfrac {1}{c}}\log 1/\delta )^{c})

.

For large $\epsilon$ this is as good as the completely random Johnson-Lindenstrauss, but a matching lower bound in the same paper shows that this exponential dependency on $(\log 1/\delta )^{c}$ is necessary. Alternative JL constructions are suggested to circumvent this.

Notes[]

^ For instance, writing about nearest neighbor search in high-dimensional data sets, Jon Kleinberg writes: "The more sophisticated algorithms typically achieve a query time that is logarithmic in n at the expense of an exponential dependence on the dimension d; indeed, even the average case analysis of heuristics such as k-d trees reveal an exponential dependence on d in the query time. Kleinberg, Jon M. (1997), "Two Algorithms for Nearest-neighbor Search in High Dimensions", Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing, STOC '97, New York, NY, USA: ACM, pp. 599–608, doi:10.1145/258533.258653, ISBN 0-89791-888-6.
^ Kasper Green Larsen; Jelani Nelson (2017). Optimality of the Johnson-Lindenstrauss Lemma. Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science (FOCS). pp. 633–638. arXiv:1609.02094. doi:10.1109/FOCS.2017.64.
^ Nielsen, Frank (2016). "10. Fast approximate optimization in high dimensions with core-sets and fast dimension reduction". Introduction to HPC with MPI for Data Science. Springer. pp. 259–272. ISBN 978-3-319-21903-5.
^ Johnson, William B.; Lindenstrauss, Joram (1984). "Extensions of Lipschitz mappings into a Hilbert space". In Beals, Richard; Beck, Anatole; Bellow, Alexandra; et al. (eds.). Conference in modern analysis and probability (New Haven, Conn., 1982). Contemporary Mathematics. Vol. 26. Providence, RI: American Mathematical Society. pp. 189–206. doi:10.1090/conm/026/737400. ISBN 0-8218-5030-X. MR 0737400.
^ Ailon, Nir; Chazelle, Bernard (2006). "Approximate nearest neighbors and the fast Johnson–Lindenstrauss transform". Proceedings of the 38th Annual ACM Symposium on Theory of Computing. New York: ACM Press. pp. 557–563. doi:10.1145/1132516.1132597. ISBN 1-59593-134-1. MR 2277181.
^ Kane, Daniel M.; Nelson, Jelani (2014). "Sparser Johnson-Lindenstrauss Transforms". Journal of the ACM. 61 (1): 1. arXiv:1012.1577. doi:10.1145/2559902. MR 3167920.. A preliminary version of this paper was published in the Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, 2012.
^ Anna Esteve, Eva Boj & Josep Fortiana (2009): Interaction Terms in Distance-Based Regression, Communications in Statistics - Theory and Methods, 38:19, P. 3501 [1]
^ ^a ^b Slyusar, V. I. (December 27, 1996). "End products in matrices in radar applications" (PDF). Radioelectronics and Communications Systems. 41 (3): 50–53.
^ ^a ^b Slyusar, V. I. (1997-05-20). "Analytical model of the digital antenna array on a basis of face-splitting matrix products" (PDF). Proc. ICATT-97, Kyiv: 108–109.
^ ^a ^b ^c Slyusar, V. I. (1997-09-15). "New operations of matrices product for applications of radars" (PDF). Proc. Direct and Inverse Problems of Electromagnetic and Acoustic Wave Theory (DIPED-97), Lviv.: 73–74.
^ ^a ^b Slyusar, V. I. (March 13, 1998). "A Family of Face Products of Matrices and its Properties" (PDF). Cybernetics and Systems Analysis C/C of Kibernetika I Sistemnyi Analiz.- 1999. 35 (3): 379–384. doi:10.1007/BF02733426.
^ ^a ^b Slyusar, V. I. (2003). "Generalized face-products of matrices in models of digital antenna arrays with nonidentical channels" (PDF). Radioelectronics and Communications Systems. 46 (10): 9–17.
^ Kasiviswanathan, Shiva Prasad, et al. "The price of privately releasing contingency tables and the spectra of random matrices with correlated rows." Proceedings of the forty-second ACM symposium on Theory of computing. 2010.
^ Woodruff, David P. "Sketching as a Tool for Numerical Linear Algebra." Theoretical Computer Science 10.1-2 (2014): 1-157.
^ Ahle, Thomas; Kapralov, Michael; Knudsen, Jakob; Pagh, Rasmus; Velingker, Ameya; Woodruff, David; Zandieh, Amir (2020). Oblivious Sketching of High-Degree Polynomial Kernels. ACM-SIAM Symposium on Discrete Algorithms. Association for Computing Machinery. doi:10.1137/1.9781611975994.9.

Johnson–Lindenstrauss lemma

Contents

Lemma[]

Alternate statement[]

Speeding up the JL transform[]

Tensorized Random Projections[]

See also[]

Notes[]

Further reading[]