Number of relevant directions in Principal Component Analysis and Wishart random matrices

Satya N. Majumdar 1, Pierpaolo Vivo 1

Physical Review Letters 108 (2012) 200601

We compute analytically, for large $N$, the probability $\mathcal{P}(N_+,N)$ that a $N\times N$ Wishart random matrix has $N_+$ eigenvalues exceeding a threshold $N\zeta$, including its large deviation tails. This probability plays a benchmark role when performing the Principal Component Analysis of a large empirical dataset. We find that $\mathcal{P}(N_+,N)\approx\exp(-\beta N^2 \psi_\zeta(N_+/N))$, where $\beta$ is the Dyson index of the ensemble and $\psi_\zeta(\kappa)$ is a rate function that we compute explicitly in the full range $0\leq \kappa\leq 1$ and for any $\zeta$. The rate function $\psi_\zeta(\kappa)$ displays a quadratic behavior modulated by a logarithmic singularity close to its minimum $\kappa^\star(\zeta)$. This is shown to be a consequence of a phase transition in an associated Coulomb gas problem. The variance $\Delta(N)$ of the number of relevant components is also shown to grow universally (independent of $\zeta)$ as $\Delta(N)\sim (\beta \pi^2)^{-1}\ln N$ for large $N$.

  • 1. Laboratoire de Physique Théorique et Modèles Statistiques (LPTMS),
    CNRS : UMR8626 – Université Paris XI - Paris Sud