Sparse PCA

Sparse principal component analysis (sparse PCA) is a specialised technique used in statistical analysis and, in particular, in the analysis of multivariate data sets. It extends the classic method of principal component analysis (PCA) for the reduction of dimensionality of data by introducing sparsity structures to the input variables.

A particular disadvantage of ordinary PCA is that the principal components are usually linear combinations of all input variables. Sparse PCA overcomes this disadvantage by finding linear combinations that contain just a few input variables.

Contemporary datasets often have the number of input variables ( $p$ ) comparable with or even much larger than the number of samples ( $n$ ). It has been shown that if $p/n$ does not converge to zero, the classical PCA is not consistent. But sparse PCA can retain consistency even if $p\gg n.$ ^[1]

Mathematical formulation[]

Consider a data matrix, $X$ , where each of the $p$ columns represent an input variable, and each of the $n$ rows represents an independent sample from data population. One assumes each column of $X$ has mean zero, otherwise one can subtract column-wise mean from each element of $X$ . Let $\Sigma ={\frac {1}{n-1}}X^{\top }X$ be the empirical covariance matrix of $X$ , which has dimension $p\times p$ . Given an integer $k$ with $1\leq k\leq p$ , the sparse PCA problem can be formulated as maximizing the variance along a direction represented by vector $v\in \mathbb {R} ^{p}$ while constraining its cardinality:

{\begin{aligned}\max \quad &v^{T}\Sigma v\\{\text{subject to}}\quad &\left\Vert v\right\Vert _{2}=1\\&\left\Vert v\right\Vert _{0}\leq k.\end{aligned}}

Eq. 1

The first constraint specifies that v is a unit vector. In the second constraint, $\left\Vert v\right\Vert _{0}$ represents the $\ell _{0}$ pseudo-norm of v, which is defined as the number of its non-zero components. So the second constraint specifies that the number of non-zero components in v is less than or equal to k, which is typically an integer that is much smaller than dimension p. The optimal value of Eq. 1 is known as the k-sparse largest eigenvalue.

If one takes k=p, the problem reduces to the ordinary PCA, and the optimal value becomes the largest eigenvalue of covariance matrix Σ.

After finding the optimal solution v, one deflates Σ to obtain a new matrix

\Sigma _{1}=\Sigma -(v^{T}\Sigma v)vv^{T},

and iterate this process to obtain further principal components. However, unlike PCA, sparse PCA cannot guarantee that different principal components are orthogonal. In order to achieve orthogonality, additional constraints must be enforced.

The following equivalent definition is in matrix form. Let $V$ be a p×p symmetric matrix, one can rewrite the sparse PCA problem as

{\begin{aligned}\max \quad &Tr(\Sigma V)\\{\text{subject to}}\quad &Tr(V)=1\\&\Vert V\Vert _{0}\leq k^{2}\\&Rank(V)=1,V\succeq 0.\end{aligned}}

Eq. 2

Tr is the matrix trace, and $\Vert V\Vert _{0}$ represents the non-zero elements in matrix V. The last line specifies that V has matrix rank one and is positive semidefinite. The last line means that one has $V=vv^{T}$ , so Eq. 2 is equivalent to Eq. 1.

Moreover, the rank constraint in this formulation is actually redundant, and therefore sparse PCA can be cast as the following mixed-integer semidefinite program^[2]

{\begin{aligned}\max \quad &Tr(\Sigma V)\\{\text{subject to}}\quad &Tr(V)=1\\&\vert V_{i,i}\vert \leq z_{i},\forall i\in \{1,...,p\},\vert V_{i,j}\vert \leq {\frac {1}{2}}z_{i},\forall i,j\in \{1,...,p\}:i\neq j,\\&V\succeq 0,z\in \{0,1\}^{p},\sum _{i}z_{i}\leq k\end{aligned}}

Eq. 3

Because of the cardinality constraint, the maximization problem is hard to solve exactly, especially when dimension p is high. In fact, the sparse PCA problem in Eq. 1 is NP-hard in the strong sense.^[3]

Algorithms for Sparse PCA[]

Several alternative approaches (of Eq. 1) have been proposed, including

a regression framework,^[4]
a convex relaxation/semidefinite programming framework,^[5]
a generalized power method framework^[6]
an alternating maximization framework^[7]
forward-backward greedy search and exact methods using branch-and-bound techniques,^[8]
a certifiably optimal branch-and-bound approach^[9]
Bayesian formulation framework.^[10]
A certifiably optimal mixed-integer semidefinite branch-and-cut approach ^[2]

The methodological and theoretical developments of Sparse PCA as well as its applications in scientific studies are recently reviewed in a survey paper.^[11]

Regression approach via lasso (elastic net)[]

Semidefinite Programming Relaxation[]

It has been proposed that sparse PCA can be approximated by semidefinite programming (SDP).^[5] If one drops the rank constraint and relaxes the cardinality constraint by a 1-norm convex constraint, one gets a semidefinite programming relaxation, which can be solved efficiently in polynomial time:

{\begin{aligned}\max \quad &Tr(\Sigma V)\\{\text{subject to}}\quad &Tr(V)=1\\&\mathbf {1} ^{T}|V|\mathbf {1} \leq k\\&V\succeq 0.\end{aligned}}

Eq. 3

In the second constraint, $\mathbf {1}$ is a p×1 vector of ones, and |V| is the matrix whose elements are the absolute values of the elements of V.

The optimal solution $V$ to the relaxed problem Eq. 3 is not guaranteed to have rank one. In that case, $V$ can be truncated to retain only the dominant eigenvector.

While the semidefinite program does not scale beyond n=300 covariates, it has been shown that a second-order cone relaxation of the semidefinite relaxation is almost as tight and successfully solves problems with n=1000s of covariates ^[12]

Applications[]

Financial Data Analysis[]

Suppose ordinary PCA is applied to a dataset where each input variable represents a different asset, it may generate principal components that are weighted combination of all the assets. In contrast, sparse PCA would produce principal components that are weighted combination of only a few input assets, so one can easily interpret its meaning. Furthermore, if one uses a trading strategy based on these principal components, fewer assets imply less transaction costs.

Biology[]

Consider a dataset where each input variable corresponds to a specific gene. Sparse PCA can produce a principal component that involves only a few genes, so researchers can focus on these specific genes for further analysis.

High-dimensional Hypothesis Testing[]

Contemporary datasets often have the number of input variables ( $p$ ) comparable with or even much larger than the number of samples ( $n$ ). It has been shown that if $p/n$ does not converge to zero, the classical PCA is not consistent. In other words, if we let $k=p$ in Eq. 1, then the optimal value does not converge to the largest eigenvalue of data population when the sample size $n\rightarrow \infty$ , and the optimal solution does not converge to the direction of maximum variance. But sparse PCA can retain consistency even if $p\gg n.$ ^[1]

The k-sparse largest eigenvalue (the optimal value of Eq. 1) can be used to discriminate an isometric model, where every direction has the same variance, from a spiked covariance model in high-dimensional setting.^[13] Consider a hypothesis test where the null hypothesis specifies that data $X$ are generated from a multivariate normal distribution with mean 0 and covariance equal to an identity matrix, and the alternative hypothesis specifies that data $X$ is generated from a spiked model with signal strength $\theta$ :

H_{0}:X\sim N(0,I_{p}),\quad H_{1}:X\sim N(0,I_{p}+\theta vv^{T}),

where $v\in \mathbb {R} ^{p}$ has only k non-zero coordinates. The largest k-sparse eigenvalue can discriminate the two hypothesis if and only if $\theta >\Theta ({\sqrt {k\log(p)/n}})$ .

Since computing k-sparse eigenvalue is NP-hard, one can approximate it by the optimal value of semidefinite programming relaxation (Eq. 3). If that case, we can discriminate the two hypotheses if $\theta >\Theta ({\sqrt {k^{2}\log(p)/n}})$ . The additional ${\sqrt {k}}$ term cannot be improved by any other polynomial time algorithm if the planted clique conjecture holds.

Software/source code[]

amanpg - R package for Sparse PCA using the Alternating Manifold Proximal Gradient Method ^[14]
elasticnet – R package for Sparse Estimation and Sparse PCA using Elastic-Nets ^[15]
nsprcomp - R package for sparse and/or non-negative PCA based on thresholded power iterations^[16]
Scikit-learn – Python library for machine learning which contains Sparse PCA and other techniques in the decomposition module.^[17]

References[]

^ ^a ^b Iain M Johnstone; Arthur Yu Lu (2009). "On Consistency and Sparsity for Principal Components Analysis in High Dimensions". Journal of the American Statistical Association. 104 (486): 682–693. doi:10.1198/jasa.2009.0121. PMC 2898454. PMID 20617121.
^ ^a ^b Dimitris Bertsimas; Ryan Cory-Wright; Jean Pauphilet (2020). "Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality". arXiv:2005.05195. Cite journal requires |journal= (help)
^ Andreas M. Tillmann; Marc E. Pfetsch (2013). "The Computational Complexity of the Restricted Isometry Property, the Nullspace Property, and Related Concepts in Compressed Sensing". IEEE Transactions on Information Theory. 60 (2): 1248–1259. arXiv:1205.2081. CiteSeerX 10.1.1.760.2559. doi:10.1109/TIT.2013.2290112. S2CID 2788088.
^ Hui Zou; Trevor Hastie; Robert Tibshirani (2006). "Sparse principal component analysis" (PDF). Journal of Computational and Graphical Statistics. 15 (2): 262–286. CiteSeerX 10.1.1.62.580. doi:10.1198/106186006x113430. S2CID 5730904.
^ ^a ^b Alexandre d’Aspremont; Laurent El Ghaoui; Michael I. Jordan; Gert R. G. Lanckriet (2007). "A Direct Formulation for Sparse PCA Using Semidefinite Programming" (PDF). SIAM Review. 49 (3): 434–448. arXiv:cs/0406021. doi:10.1137/050645506. S2CID 5490061.
^ Michel Journee; Yurii Nesterov; Peter Richtarik; Rodolphe Sepulchre (2010). "Generalized Power Method for Sparse Principal Component Analysis" (PDF). Journal of Machine Learning Research. 11: 517–553. arXiv:0811.4724. Bibcode:2008arXiv0811.4724J. CORE Discussion Paper 2008/70.
^ Peter Richtarik; Majid Jahani; S. Damla Ahipasaoglu; Martin Takac (2020). "Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes". .
^ Baback Moghaddam; Yair Weiss; Shai Avidan (2005). "Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms" (PDF). Advances in Neural Information Processing Systems. 18. MIT Press.
^ Lauren Berk; Dimitris Bertsimas (2019). "Certifiably optimal sparse principal component analysis". Mathematical Programming Computation. Springer. 11 (3): 381–420. doi:10.1007/s12532-018-0153-6. hdl:1721.1/131566. S2CID 126998398.
^ Yue Guan; Jennifer Dy (2009). "Sparse Probabilistic Principal Component Analysis" (PDF). . 5: 185.
^ Hui Zou; Lingzhou Xue (2018). "A Selective Overview of Sparse Principal Component Analysis". Proceedings of the IEEE. 106 (8): 1311–1320. doi:10.1109/jproc.2018.2846588.
^ Dimitris Bertsimas; Ryan Cory-Wright (2020). "On polyhedral and second-order cone decompositions of semidefinite optimization problems". Operations Research Letters. Elsevier. 48 (1): 78–85. arXiv:1910.03143. doi:10.1016/j.orl.2019.12.003.
^ Quentin Berthet; Philippe Rigollet (2013). "Optimal Detection of Sparse Principal Components in High Dimension". Annals of Statistics. 41 (1): 1780–1815. arXiv:1202.5070. doi:10.1214/13-aos1127. S2CID 7162068.
^ [1] https://cran.r-project.org/web/packages/amanpg/index.html
^ [2] https://cran.r-project.org/web/packages/elasticnet/elasticnet.pdf
^ [3] https://cran.r-project.org/package=nsprcomp
^ [4] http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html

[consistency-1] Iain M Johnstone; Arthur Yu Lu (2009). "On Consistency and Sparsity for Principal Components Analysis in High Dimensions". Journal of the American Statistical Association. 104 (486): 682–693. doi:10.1198/jasa.2009.0121. PMC 2898454. PMID 20617121.

[BCJ20-2] Dimitris Bertsimas; Ryan Cory-Wright; Jean Pauphilet (2020). "Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality". arXiv:2005.05195. Cite journal requires |journal= (help)

[TP14-3] Andreas M. Tillmann; Marc E. Pfetsch (2013). "The Computational Complexity of the Restricted Isometry Property, the Nullspace Property, and Related Concepts in Compressed Sensing". IEEE Transactions on Information Theory. 60 (2): 1248–1259. arXiv:1205.2081. CiteSeerX 10.1.1.760.2559. doi:10.1109/TIT.2013.2290112. S2CID 2788088.

[4] Hui Zou; Trevor Hastie; Robert Tibshirani (2006). "Sparse principal component analysis" (PDF). Journal of Computational and Graphical Statistics. 15 (2): 262–286. CiteSeerX 10.1.1.62.580. doi:10.1198/106186006x113430. S2CID 5730904.

[SDP-5] Alexandre d’Aspremont; Laurent El Ghaoui; Michael I. Jordan; Gert R. G. Lanckriet (2007). "A Direct Formulation for Sparse PCA Using Semidefinite Programming" (PDF). SIAM Review. 49 (3): 434–448. arXiv:cs/0406021. doi:10.1137/050645506. S2CID 5490061.

[6] Michel Journee; Yurii Nesterov; Peter Richtarik; Rodolphe Sepulchre (2010). "Generalized Power Method for Sparse Principal Component Analysis" (PDF). Journal of Machine Learning Research. 11: 517–553. arXiv:0811.4724. Bibcode:2008arXiv0811.4724J. CORE Discussion Paper 2008/70.

[7] Peter Richtarik; Majid Jahani; S. Damla Ahipasaoglu; Martin Takac (2020). "Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes". .

[8] Baback Moghaddam; Yair Weiss; Shai Avidan (2005). "Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms" (PDF). Advances in Neural Information Processing Systems. 18. MIT Press.

[9] Lauren Berk; Dimitris Bertsimas (2019). "Certifiably optimal sparse principal component analysis". Mathematical Programming Computation. Springer. 11 (3): 381–420. doi:10.1007/s12532-018-0153-6. hdl:1721.1/131566. S2CID 126998398.

[10] Yue Guan; Jennifer Dy (2009). "Sparse Probabilistic Principal Component Analysis" (PDF). . 5: 185.

[11] Hui Zou; Lingzhou Xue (2018). "A Selective Overview of Sparse Principal Component Analysis". Proceedings of the IEEE. 106 (8): 1311–1320. doi:10.1109/jproc.2018.2846588.

[12] Dimitris Bertsimas; Ryan Cory-Wright (2020). "On polyhedral and second-order cone decompositions of semidefinite optimization problems". Operations Research Letters. Elsevier. 48 (1): 78–85. arXiv:1910.03143. doi:10.1016/j.orl.2019.12.003.

[optimal-13] Quentin Berthet; Philippe Rigollet (2013). "Optimal Detection of Sparse Principal Components in High Dimension". Annals of Statistics. 41 (1): 1780–1815. arXiv:1202.5070. doi:10.1214/13-aos1127. S2CID 7162068.

[14] [1] https://cran.r-project.org/web/packages/amanpg/index.html

[15] [2] https://cran.r-project.org/web/packages/elasticnet/elasticnet.pdf

[16] [3] https://cran.r-project.org/package=nsprcomp

[17] [4] http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]