Spectral clustering

An example of two connected graphs

In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.

In application to image segmentation, spectral clustering is known as segmentation-based object categorization.

Definitions[]

Given an enumerated set of data points, the similarity matrix may be defined as a symmetric matrix $A$ , where $A_{ij}\geq 0$ represents a measure of the similarity between data points with indices $i$ and $j$ . The general approach to spectral clustering is to use a standard clustering method (there are many such methods, k-means is discussed below) on relevant eigenvectors of a Laplacian matrix of $A$ . There are many different ways to define a Laplacian which have different mathematical interpretations, and so the clustering will also have different interpretations. The eigenvectors that are relevant are the ones that correspond to smallest several eigenvalues of the Laplacian except for the smallest eigenvalue which will have a value of 0. For computational efficiency, these eigenvectors are often computed as the eigenvectors corresponding to the largest several eigenvalues of a function of the Laplacian.

Spectral clustering is well known to relate to partitioning of a mass-spring system, where each mass is associated with a data point and each spring stiffness corresponds to a weight of an edge describing a similarity of the two related data points. Specifically, the classical reference ^[1] explains that the eigenvalue problem describing transversal vibration modes of a mass-spring system is exactly the same as the eigenvalue problem for the graph Laplacian matrix defined as

L:=D-A

,

where $D$ is the diagonal matrix

D_{ii}=\sum _{j}A_{ij}.

The masses that are tightly connected by the springs in the mass-spring system evidently move together from the equilibrium position in low-frequency vibration modes, so that the components of the eigenvectors corresponding to the smallest eigenvalues of the graph Laplacian can be used for meaningful clustering of the masses.

A popular related spectral clustering technique is the normalized cuts algorithm or Shi–Malik algorithm introduced by Jianbo Shi and Jitendra Malik,^[2] commonly used for image segmentation. It partitions points into two sets $(B_{1},B_{2})$ based on the eigenvector $v$ corresponding to the second-smallest eigenvalue of the symmetric normalized Laplacian defined as

L^{\text{norm}}:=I-D^{-1/2}AD^{-1/2}.

A mathematically equivalent algorithm ^[3] takes the eigenvector corresponding to the largest eigenvalue of the random walk normalized adjacency matrix $P=D^{-1}A$ .

Knowing the eigenvectors, partitioning may be done in various ways, such as by computing the median $m$ of the components of the second smallest eigenvector $v$ , and placing all points whose component in $v$ is greater than $m$ in $B_{1}$ , and the rest in $B_{2}$ . The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in this fashion.

Algorithms[]

Basic Algorithm

Calculate the Laplacian $L$ (or the normalized Laplacian)
Calculate the first $k$ eigenvectors (the eigenvectors corresponding to the $k$ smallest eigenvalues of $L$ )
Consider the matrix formed by the first $k$ eigenvectors; the $l$ -th row defines the features of graph node $l$
Cluster the graph nodes based on these features (e.g., using k-means clustering)

If the similarity matrix $A$ has not already been explicitly constructed, the efficiency of spectral clustering may be improved if the solution to the corresponding eigenvalue problem is performed in a matrix-free fashion (without explicitly manipulating or even computing the similarity matrix), as in the Lanczos algorithm.

For large-sized graphs, the second eigenvalue of the (normalized) graph Laplacian matrix is often ill-conditioned, leading to slow convergence of iterative eigenvalue solvers. Preconditioning is a key technology accelerating the convergence, e.g., in the matrix-free LOBPCG method. Spectral clustering has been successfully applied on large graphs by first identifying their community structure, and then clustering communities.^[4]

Spectral clustering is closely related to nonlinear dimensionality reduction, and dimension reduction techniques such as locally-linear embedding can be used to reduce errors from noise or outliers.^[5]

Software[]

Free software implementing spectral clustering is available in large open source projects like Scikit-learn^[6] using LOBPCG^[7] with multigrid preconditioning^[8] ^[9] or ARPACK, MLlib for pseudo-eigenvector clustering using the power iteration method,^[10] and R.^[11]

Relationship with other clustering methods[]

The ideas behind spectral clustering may not be immediately obvious. It may be useful to highlight relationships with other methods. In particular, it can be described in the context of kernel clustering methods, which reveals several similarities with other approaches.^[12]

Relationship with k-means[]

The kernel k-means problem is an extension of the k-means problem where the input data points are mapped non-linearly into a higher-dimensional feature space via a kernel function $k(x_{i},x_{j})=\varphi ^{T}(x_{i})\varphi (x_{j})$ . The weighted kernel k-means problem further extends this problem by defining a weight $w_{r}$ for each cluster as the reciprocal of the number of elements in the cluster,

\max _{\{C_{s}\}}\sum _{r=1}^{k}w_{r}\sum _{x_{i},x_{j}\in C_{r}}k(x_{i},x_{j}).

Suppose $F$ is a matrix of the normalizing coefficients for each point for each cluster $F_{ij}=w_{r}$ if $i,j\in C_{r}$ and zero otherwise. Suppose $K$ is the kernel matrix for all points. The weighted kernel k-means problem with n points and k clusters is given as,

\max _{F}\operatorname {trace} (KF)

such that

F=G_{n\times k}G_{k\times n}^{T}

G^{T}G=I

such that $\operatorname {rank} (G)=k$ . In addition, there are identity constraints on $F$ given by,

F\cdot \mathbb {I} =\mathbb {I}

F^{T}\mathbb {I} =\mathbb {I}

where $\mathbb {I}$ represents a vector of ones. This problem can be recast as

\max _{G}\operatorname {trace} (G^{T}G).

This problem is equivalent to the spectral clustering problem when the identity constraints on $F$ are relaxed. In particular, the weighted kernel k-means problem can be reformulated as a spectral clustering (graph partitioning) problem and vice versa. The output of the algorithms are eigenvectors which do not satisfy the identity requirements for indicator variables defined by $F$ . Hence, post-processing of the eigenvectors is required for the equivalence between the problems.^[13] Transforming the spectral clustering problem into a weighted kernel k-means problem greatly reduces the computational burden.^[14]

Relationship to DBSCAN[]

Spectral clustering is also related to DBSCAN clustering, that finds density-connected components. Connected components correspond to optimal spectral clusters (no edges cut); and DBSCAN uses an asymmetric neighbor graph with edges removed when source points are not dense.^[15] Thus, DBSCAN is a special case of spectral clustering, but one which allows more efficient algorithms (worst case $O(n^{2})$ , in many practical cases much faster with indices).

Measures to compare clusterings[]

Ravi Kannan, Santosh Vempala and Adrian Vetta^[16] proposed a bicriteria measure to define the quality of a given clustering. They said that a clustering was an (α, ε)-clustering if the conductance of each cluster (in the clustering) was at least α and the weight of the inter-cluster edges was at most ε fraction of the total weight of all the edges in the graph. They also look at two approximation algorithms in the same paper.

Approximate solutions[]

Spectral clustering is computationally expensive unless the graph is sparse and the similarity matrix can be efficiently constructed. If the similarity matrix is an RBF kernel matrix, spectral clustering is expensive. There are approximate algorithms for making spectral clustering more efficient: power method,^[17] Nystrom method,^[18] etc. However, recent research^[19] pointed out the problems with spectral clustering with Nystrom method; in particular, the similarity matrix with Nystrom approximation is not elementwisely positive, which can be problematic.

History and related literatures[]

Spectral clustering has a long history.^[20]^[21]^[22]^[23]^[24]^[2]^[25] Spectral clustering as a machine learning method was popularized by Shi & Malik^[2] and Ng, Jordan, & Weiss.^[25]

Ideas and network measures related to spectral clustering also play an important role in a number of applications apparently different from clustering problems. For instance, networks with stronger spectral partitions take longer to converge in opinion-updating models used in sociology and economics.^[26]^[27]

References[]

^ J. Demmel, [1], CS267: Notes for Lecture 23, April 9, 1999, Graph Partitioning, Part 2
^ Jump up to: ^a ^b ^c Jianbo Shi and Jitendra Malik, "Normalized Cuts and Image Segmentation", IEEE Transactions on PAMI, Vol. 22, No. 8, Aug 2000.
^ Marina Meilă & Jianbo Shi, "Learning Segmentation by Random Walks", Neural Information Processing Systems 13 (NIPS 2000), 2001, pp. 873–879.
^ Zare, Habil; P. Shooshtari; A. Gupta; R. Brinkman (2010). "Data reduction for spectral clustering to analyze high throughput flow cytometry data". BMC Bioinformatics. 11: 403. doi:10.1186/1471-2105-11-403. PMC 2923634. PMID 20667133.
^ Arias-Castro, E. and Chen, G. and Lerman, G. (2011), "Spectral clustering based on local linear approximations.", Electronic Journal of Statistics, 5: 1537–1587, arXiv:1001.1323, doi:10.1214/11-ejs651, S2CID 88518155CS1 maint: multiple names: authors list (link)
^ http://scikit-learn.org/stable/modules/clustering.html#spectral-clustering
^ Knyazev, Andrew V. (2003). Boley; Dhillon; Ghosh; Kogan (eds.). Modern preconditioned eigensolvers for spectral image segmentation and graph bisection. Clustering Large Data Sets; Third IEEE International Conference on Data Mining (ICDM 2003) Melbourne, Florida: IEEE Computer Society. pp. 59–62.
^ Knyazev, Andrew V. (2006). Multiscale Spectral Image Segmentation Multiscale preconditioning for computing eigenvalues of graph Laplacians in image segmentation. Fast Manifold Learning Workshop, WM Williamburg, VA. doi:10.13140/RG.2.2.35280.02565.
^ Knyazev, Andrew V. (2006). Multiscale Spectral Graph Partitioning and Image Segmentation. Workshop on Algorithms for Modern Massive Datasets Stanford University and Yahoo! Research.
^ http://spark.apache.org/docs/latest/mllib-clustering.html#power-iteration-clustering-pic
^ https://cran.r-project.org/web/packages/kernlab
^ Filippone M., Camastra F., Masulli, F., Rovetta, S. (January 2008). "A survey of kernel and spectral methods for clustering". Pattern Recognition. 41 (1): 176–190. Bibcode:2008PatRe..41..176F. doi:10.1016/j.patcog.2007.05.018.CS1 maint: multiple names: authors list (link) CS1 maint: date and year (link)
^ Dhillon, I.S. and Guan, Y. and Kulis, B. (2004). "Kernel k-means: spectral clustering and normalized cuts". Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 551–556.CS1 maint: multiple names: authors list (link)
^ Dhillon, Inderjit; Yuqiang Guan; Brian Kulis (November 2007). "Weighted Graph Cuts without Eigenvectors: A Multilevel Approach". IEEE Transactions on Pattern Analysis and Machine Intelligence. 29 (11): 1944–1957. CiteSeerX 10.1.1.131.2635. doi:10.1109/tpami.2007.1115. PMID 17848776. S2CID 9402790.
^ Schubert, Erich; Hess, Sibylle; Morik, Katharina (2018). The Relationship of DBSCAN to Matrix Factorization and Spectral Clustering (PDF). LWDA. pp. 330–334.
^ Kannan, Ravi; Vempala, Santosh; Vetta, Adrian (2004). "On Clusterings : Good, Bad and Spectral". Journal of the ACM. 51 (3): 497–515. doi:10.1145/990308.990313. S2CID 207558562.
^ Boutsidis, Christos (2015). "Spectral Clustering via the Power Method Provably" (PDF). International Conference on Machine Learning.
^ Fowlkes, C (2004). "Spectral grouping using the Nystrom method". IEEE Transactions on Pattern Analysis and Machine Intelligence. 26 (2): 214–25. doi:10.1109/TPAMI.2004.1262185. PMID 15376896. S2CID 2384316.
^ S. Wang, A. Gittens, and M. W. Mahoney (2019). "Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds". Journal of Machine Learning Research. 20: 1–49. arXiv:1706.02803.CS1 maint: multiple names: authors list (link)
^ Cheeger, Jeff (1969). "A lower bound for the smallest eigenvalue of the Laplacian". Proceedings of the Princeton Conference in Honor of Professor S. Bochner.
^ William Donath and Alan Hoffman (1972). "Algorithms for partitioning of graphs and computer logic based on eigenvectors of connections matrices". IBM Technical Disclosure Bulletin.
^ Fiedler, Miroslav (1973). "Algebraic connectivity of graphs". Czechoslovak Mathematical Journal. 23 (2): 298–305. doi:10.21136/CMJ.1973.101168.
^ Stephen Guattery and Gary L. Miller (1995). "On the performance of spectral graph partitioning methods". Annual ACM-SIAM Symposium on Discrete Algorithms.
^ Daniel A. Spielman and Shang-Hua Teng (1996). "Spectral Partitioning Works: Planar graphs and finite element meshes". Annual IEEE Symposium on Foundations of Computer Science.
^ Jump up to: ^a ^b Ng, Andrew Y and Jordan, Michael I and Weiss, Yair (2002). "On spectral clustering: analysis and an algorithm" (PDF). Advances in Neural Information Processing Systems.CS1 maint: multiple names: authors list (link)
^ DeMarzo, P. M.; Vayanos, D.; Zwiebel, J. (2003-08-01). "Persuasion Bias, Social Influence, and Unidimensional Opinions". The Quarterly Journal of Economics. Oxford University Press (OUP). 118 (3): 909–968. doi:10.1162/00335530360698469. ISSN 0033-5533.
^ Golub, Benjamin; Jackson, Matthew O. (2012-07-26). "How Homophily Affects the Speed of Learning and Best-Response Dynamics". The Quarterly Journal of Economics. Oxford University Press (OUP). 127 (3): 1287–1338. doi:10.1093/qje/qjs021. ISSN 0033-5533.

[1] J. Demmel, [1], CS267: Notes for Lecture 23, April 9, 1999, Graph Partitioning, Part 2

[:0-2] Jump up to: ^a ^b ^c Jianbo Shi and Jitendra Malik, "Normalized Cuts and Image Segmentation", IEEE Transactions on PAMI, Vol. 22, No. 8, Aug 2000.

[3] Marina Meilă & Jianbo Shi, "Learning Segmentation by Random Walks", Neural Information Processing Systems 13 (NIPS 2000), 2001, pp. 873–879.

[4] Zare, Habil; P. Shooshtari; A. Gupta; R. Brinkman (2010). "Data reduction for spectral clustering to analyze high throughput flow cytometry data". BMC Bioinformatics. 11: 403. doi:10.1186/1471-2105-11-403. PMC 2923634. PMID 20667133.

[5] Arias-Castro, E. and Chen, G. and Lerman, G. (2011), "Spectral clustering based on local linear approximations.", Electronic Journal of Statistics, 5: 1537–1587, arXiv:1001.1323, doi:10.1214/11-ejs651, S2CID 88518155CS1 maint: multiple names: authors list (link)

[6] ttp://scikit-learn.org/stable/modules/clustering.html#spectral-clustering

[7] Knyazev, Andrew V. (2003). Boley; Dhillon; Ghosh; Kogan (eds.). Modern preconditioned eigensolvers for spectral image segmentation and graph bisection. Clustering Large Data Sets; Third IEEE International Conference on Data Mining (ICDM 2003) Melbourne, Florida: IEEE Computer Society. pp. 59–62.

[spectralmultigrid2006-8] Knyazev, Andrew V. (2006). Multiscale Spectral Image Segmentation Multiscale preconditioning for computing eigenvalues of graph Laplacians in image segmentation. Fast Manifold Learning Workshop, WM Williamburg, VA. doi:10.13140/RG.2.2.35280.02565.

[9] Knyazev, Andrew V. (2006). Multiscale Spectral Graph Partitioning and Image Segmentation. Workshop on Algorithms for Modern Massive Datasets Stanford University and Yahoo! Research.

[10] ttp://spark.apache.org/docs/latest/mllib-clustering.html#power-iteration-clustering-pic

[11] ttps://cran.r-project.org/web/packages/kernlab

[filippone2008survey-12] Filippone M., Camastra F., Masulli, F., Rovetta, S. (January 2008). "A survey of kernel and spectral methods for clustering". Pattern Recognition. 41 (1): 176–190. Bibcode:2008PatRe..41..176F. doi:10.1016/j.patcog.2007.05.018.CS1 maint: multiple names: authors list (link) CS1 maint: date and year (link)

[dhillon2004kernel-13] Dhillon, I.S. and Guan, Y. and Kulis, B. (2004). "Kernel k-means: spectral clustering and normalized cuts". Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 551–556.CS1 maint: multiple names: authors list (link)

[14] Dhillon, Inderjit; Yuqiang Guan; Brian Kulis (November 2007). "Weighted Graph Cuts without Eigenvectors: A Multilevel Approach". IEEE Transactions on Pattern Analysis and Machine Intelligence. 29 (11): 1944–1957. CiteSeerX 10.1.1.131.2635. doi:10.1109/tpami.2007.1115. PMID 17848776. S2CID 9402790.

[15] Schubert, Erich; Hess, Sibylle; Morik, Katharina (2018). The Relationship of DBSCAN to Matrix Factorization and Spectral Clustering (PDF). LWDA. pp. 330–334.

[16] Kannan, Ravi; Vempala, Santosh; Vetta, Adrian (2004). "On Clusterings : Good, Bad and Spectral". Journal of the ACM. 51 (3): 497–515. doi:10.1145/990308.990313. S2CID 207558562.

[17] Boutsidis, Christos (2015). "Spectral Clustering via the Power Method Provably" (PDF). International Conference on Machine Learning.

[18] Fowlkes, C (2004). "Spectral grouping using the Nystrom method". IEEE Transactions on Pattern Analysis and Machine Intelligence. 26 (2): 214–25. doi:10.1109/TPAMI.2004.1262185. PMID 15376896. S2CID 2384316.

[19] S. Wang, A. Gittens, and M. W. Mahoney (2019). "Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds". Journal of Machine Learning Research. 20: 1–49. arXiv:1706.02803.CS1 maint: multiple names: authors list (link)

[20] Cheeger, Jeff (1969). "A lower bound for the smallest eigenvalue of the Laplacian". Proceedings of the Princeton Conference in Honor of Professor S. Bochner.

[21] William Donath and Alan Hoffman (1972). "Algorithms for partitioning of graphs and computer logic based on eigenvectors of connections matrices". IBM Technical Disclosure Bulletin.

[22] Fiedler, Miroslav (1973). "Algebraic connectivity of graphs". Czechoslovak Mathematical Journal. 23 (2): 298–305. doi:10.21136/CMJ.1973.101168.

[23] Stephen Guattery and Gary L. Miller (1995). "On the performance of spectral graph partitioning methods". Annual ACM-SIAM Symposium on Discrete Algorithms.

[24] Daniel A. Spielman and Shang-Hua Teng (1996). "Spectral Partitioning Works: Planar graphs and finite element meshes". Annual IEEE Symposium on Foundations of Computer Science.

[:1-25] Jump up to: ^a ^b Ng, Andrew Y and Jordan, Michael I and Weiss, Yair (2002). "On spectral clustering: analysis and an algorithm" (PDF). Advances in Neural Information Processing Systems.CS1 maint: multiple names: authors list (link)

[DeMarzo_Vayanos_Zwiebel_pp._909–968-26] DeMarzo, P. M.; Vayanos, D.; Zwiebel, J. (2003-08-01). "Persuasion Bias, Social Influence, and Unidimensional Opinions". The Quarterly Journal of Economics. Oxford University Press (OUP). 118 (3): 909–968. doi:10.1162/00335530360698469. ISSN 0033-5533.

[Golub_Jackson_pp._1287–1338-27] Golub, Benjamin; Jackson, Matthew O. (2012-07-26). "How Homophily Affects the Speed of Learning and Best-Response Dynamics". The Quarterly Journal of Economics. Oxford University Press (OUP). 127 (3): 1287–1338. doi:10.1093/qje/qjs021. ISSN 0033-5533.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]