Abstract
Dimensionality reduction is fundamental for analyzing high-dimensional data, supporting visualization, denoising, and structure discovery. We present a systematic, large-scale benchmark of three widely used methods—Principal Component Analysis (PCA), Isometric Mapping (Isomap), and t-Distributed Stochastic Neighbor Embedding (t-SNE)—evaluated by average silhouette scores to quantify cluster preservation after embedding. Our full factorial simulation varies sample size (Formula presented.), noise variance (Formula presented.), and feature count (Formula presented.) under four generative regimes: (1) a linear Gaussian mixture, (2) a linear Student-t mixture with heavy tails, (3) a nonlinear Swiss-roll manifold, and (4) a nonlinear concentric-spheres manifold, each replicated 1000 times per condition. Beyond empirical comparisons, we provide mathematical results that explain the observed rankings: under standard separation and sampling assumptions, PCA maximizes silhouettes for linear, low-rank structure, whereas Isomap dominates on smooth curved manifolds; t-SNE prioritizes local neighborhoods, yielding strong local separation but less reliable global geometry. Empirically, PCA consistently achieves the highest silhouettes for linear structure (Isomap second, t-SNE third); on manifolds the ordering reverses (Isomap > t-SNE > PCA). Increasing (Formula presented.) and adding uninformative dimensions (larger p) degrade all methods, while larger n improves levels and stability. To our knowledge, this is the first integrated study combining a comprehensive factorial simulation across linear and nonlinear regimes with distribution-based summaries (density and violin plots) and supporting theory that predicts method orderings. The results offer clear, practice-oriented guidance: prefer PCA when structure is approximately linear; favor manifold learning—especially Isomap—when curvature is present; and use t-SNE for the exploratory visualization of local neighborhoods. Complete tables and replication materials are provided to facilitate method selection and reproducibility.
Author supplied keywords
Cite
CITATION STYLE
Zahed, M., & Skafyan, M. (2025). Silhouette-Based Evaluation of PCA, Isomap, and t-SNE on Linear and Nonlinear Data Structures. Stats, 8(4). https://doi.org/10.3390/stats8040105
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.